This document provides a high-level introduction to vLLM's architecture as a high-throughput inference engine for Large Language Models (LLMs). It explains the layered system design, request processing pipeline, and how the major components interact to enable efficient model serving.
For detailed information about specific subsystems, see the relevant wiki pages:
| Topic | Wiki Page |
|---|---|
| Configuration and initialization | Page 2 |
| Request scheduling and KV cache management | Page 3 |
| Model execution details | Page 4 |
| Model support and registration | Page 5 |
| Serving APIs | Page 6 |
| Quantization and MoE optimizations | Page 7 |
| Attention backends | Page 8 |
| Distributed parallelism strategies | Page 9 |
| Hardware platform support | Page 10 |
vLLM is a high-throughput inference engine designed for serving Large Language Models efficiently. The current production engine is V1 (V0 has been fully deprecated). It provides:
LLM), asynchronous (AsyncLLM), and OpenAI-compatible REST APIThe codebase is organized into distinct layers, each responsible for specific concerns from user-facing APIs down to hardware-specific kernels.
Sources: vllm/entrypoints/llm.py107-203 vllm/v1/engine/async_llm.py70-138 docs/usage/v1_guide.md
vLLM follows a layered architecture where each layer has well-defined responsibilities:
Layer 1 provides user interfaces for different use cases: offline inference (LLM), async serving (AsyncLLM), CLI tools, and REST API endpoints.
Layer 2 handles protocol translation between user inputs and engine requests. InputProcessor converts prompts to EngineCoreRequest objects, while OutputProcessor converts engine outputs to RequestOutput objects.
Layer 3 contains the core scheduling logic. EngineCore runs the main loop that schedules requests, manages KV cache allocation, and coordinates execution.
Layer 4 abstracts distributed execution. The Executor spawns and manages Worker processes across GPUs/nodes, each running a GPUModelRunner for local inference.
Layer 5 contains the actual model implementations, attention backends, and hardware-specific kernels.
Sources: vllm/entrypoints/llm.py205-355 vllm/v1/engine/async_llm.py73-159 vllm/v1/engine/core.py80-224
The following diagram shows how a request flows through vLLM from submission to completion:
Request Submission: Users submit requests through AsyncLLM.generate() or LLM.generate(). The InputProcessor tokenizes prompts and creates EngineCoreRequest objects containing token IDs and sampling parameters.
Scheduling: Each engine step, Scheduler.schedule() selects which requests to process based on resource availability, creates SchedulerOutput with selected requests and allocated KV cache blocks.
Execution: Executor distributes SchedulerOutput to workers via execute_model(). Each GPUModelRunner prepares input tensors, runs the model forward pass, and samples output tokens.
Output Processing: OutputProcessor receives sampled token IDs, detokenizes them to text, computes logprobs if requested, and creates RequestOutput objects for users.
Iteration: The loop continues until all requests reach stop conditions (max tokens, EOS token, etc.).
Sources: vllm/v1/engine/async_llm.py286-355 vllm/v1/engine/core.py375-404 vllm/v1/engine/output_processor.py45-152
EngineCore is the heart of vLLM's inference engine, located at vllm/v1/engine/core.py83-230 It orchestrates the main inference loop:
add_request(), tracks active requests, handles aborts via abort_requests()SchedulerInterface.schedule() to select requests and allocate KV cacheExecutor.execute_model() for distributed inferencedeque-based batch queue (batch_queue) for pipeline parallel execution to eliminate pipeline bubblesKey methods: step(), step_with_batch_queue(), add_request(), abort_requests(), _initialize_kv_caches()
The Scheduler (interface at vllm/v1/core/sched/interface.py) makes resource allocation decisions:
SchedulerConfigKVCacheManagerProduces SchedulerOutput containing selected requests, block allocations, and metadata for model execution.
GPUModelRunner manages model execution on a single GPU. It is located in vllm/v1/worker/gpu_model_runner.py:
InputBatch with token IDs, positions, attention metadataSampler class, samples tokenscapture_model()Key methods: execute_model(), sample_tokens(), capture_model()
vLLM uses a hierarchical configuration system centered around VllmConfig:
EngineArgs at vllm/engine/arg_utils.py362-613 parses CLI arguments and creates config objects. VllmConfig (in vllm/config/vllm.py) aggregates all configuration and is passed throughout the system. All config classes are exported from vllm/config/__init__.py
Sources: vllm/engine/arg_utils.py362-613 vllm/config/__init__.py vllm/config/model.py99-351 vllm/config/parallel.py
The LLM class at vllm/entrypoints/llm.py107-216 provides a simple synchronous interface for offline inference:
Key methods:
| Method | Description |
|---|---|
generate() | Batch text generation, blocks until all requests complete |
chat() | Multi-turn conversation interface |
encode() | Generate embeddings for pooling models |
score() | Compute similarity scores |
apply_model() | Apply a function to the underlying nn.Module |
Internally, LLM uses LLMEngine (which is an alias to vllm/v1/engine/llm_engine.py) and InprocClient or SyncMPClient from EngineCoreClient for engine communication.
The AsyncLLM class at vllm/v1/engine/async_llm.py71-184 provides an async/await interface with streaming support:
Uses EngineCoreClient.make_async_mp_client() (typically AsyncMPClient) to communicate with EngineCore running in a separate background process via ZMQ sockets, enabling concurrent request handling and non-blocking execution.
The API server at vllm/entrypoints/openai/api_server.py provides REST endpoints compatible with OpenAI's API:
| Endpoint | Description |
|---|---|
POST /v1/completions | Text completion |
POST /v1/chat/completions | Chat messages |
POST /v1/embeddings | Text embeddings |
POST /v1/responses | Responses API with tool calling |
GET /health | Health check |
Built with FastAPI and uses AsyncLLM internally for request processing.
Sources: vllm/entrypoints/llm.py107-216 vllm/v1/engine/async_llm.py71-184 vllm/v1/engine/llm_engine.py48-60 vllm/engine/llm_engine.py
vLLM supports multiple parallelism strategies for scaling to large models and high throughput:
Tensor Parallelism (TP): Splits individual weight matrices across GPUs. Each GPU computes a portion of matrix multiplications, with all-reduce to combine results. Configured via tensor_parallel_size.
Pipeline Parallelism (PP): Assigns consecutive transformer layers to different GPUs. Requests flow through stages sequentially. Configured via pipeline_parallel_size.
Data Parallelism (DP): Replicates the entire model across GPU groups. Each replica serves independent requests. Configured via data_parallel_size.
Expert Parallelism (EP): For Mixture-of-Experts models, distributes experts across GPUs. Configured via enable_expert_parallel.
The Executor interface (in vllm/v1/executor/) provides the abstraction. MultiprocExecutor uses Python multiprocessing and ZMQ for single-node setups. RayDistributedExecutor uses Ray for multi-node clusters. UniProcExecutor runs in-process for debugging.
Sources: vllm/config/parallel.py vllm/v1/engine/core.py107-160
vLLM's memory efficiency comes from its KV cache management system:
Paged KV Cache: KV cache is divided into fixed-size blocks (typically 16 tokens). Each request's KV cache is a sequence of block references, allowing non-contiguous physical storage.
Prefix Caching: Common prompt prefixes are hashed and stored once. Multiple requests sharing a prefix reuse the same physical blocks, saving memory and computation.
Block Allocation: Scheduler allocates blocks from BlockPool. When memory is full, lower-priority requests are preempted and their blocks freed.
The cache sizing is determined during initialization in EngineCore._initialize_kv_caches():
Executor.get_kv_cache_specs() returns per-layer cache requirementsExecutor.determine_available_memory() profiles the model to measure peak activation memoryget_kv_cache_configs() computes the number of allocatable blocks given the available memoryExecutor.initialize_from_config() allocates and warms up the actual cache tensorsSources: vllm/v1/engine/core.py231-287 vllm/config/cache.py
vLLM supports models through two paths:
Native Models: Optimized implementations under vllm/model_executor/models/ with custom attention, quantization, and kernel support. The registry (_TEXT_GENERATION_MODELS, _MULTIMODAL_MODELS, _EMBEDDING_MODELS) at vllm/model_executor/models/registry.py70-300 maps HuggingFace architecture names to implementation classes.
Transformers Backend: For models without a native vLLM implementation, the Transformers modeling backend wraps HuggingFace model classes. Configured via model_impl="transformers" or --model-impl transformers.
Multimodal Support: Vision-language models (InternVL, Qwen2-VL, Phi-3.5-vision, LLaVA, etc.) supported through MultiModalRegistry and BaseMultiModalProcessor at vllm/multimodal/.
Quantization: Built-in support for FP8, INT4 (GPTQ, AWQ, Marlin), MXFP4, INT8, and BitsAndBytes quantization methods, configured via ModelConfig.quantization.
Sources: vllm/model_executor/models/registry.py70-300 vllm/config/model.py99-351
The following shows the typical initialization flow:
The initialization follows these phases:
VllmConfig, validate settingsAfter initialization, the engine enters its main serving loop ready to process requests.
Sources: vllm/v1/engine/core.py83-224 vllm/v1/worker/gpu_worker.py218-529 vllm/v1/worker/gpu_model_runner.py740-1194
Refresh this wiki
This wiki was recently refreshed. Please wait 5 days to refresh again.