Overview

Relevant source files

Purpose and Scope

This document provides a high-level introduction to vLLM's architecture as a high-throughput inference engine for Large Language Models (LLMs). It explains the layered system design, request processing pipeline, and how the major components interact to enable efficient model serving.

For detailed information about specific subsystems, see the relevant wiki pages:

Topic	Wiki Page
Configuration and initialization	Page 2
Request scheduling and KV cache management	Page 3
Model execution details	Page 4
Model support and registration	Page 5
Serving APIs	Page 6
Quantization and MoE optimizations	Page 7
Attention backends	Page 8
Distributed parallelism strategies	Page 9
Hardware platform support	Page 10

What is vLLM?

vLLM is a high-throughput inference engine designed for serving Large Language Models efficiently. The current production engine is V1 (V0 has been fully deprecated). It provides:

Multiple API interfaces: Synchronous (LLM), asynchronous (AsyncLLM), and OpenAI-compatible REST API
Advanced batching: Continuous batching with intelligent request scheduling
Memory optimization: PagedAttention-style KV cache management with prefix caching
Distributed execution: Support for tensor, pipeline, data, and expert parallelism
Hardware flexibility: Runs on CUDA, ROCm, CPU, TPU, and Intel XPU platforms
Extensive model support: Native implementations for popular architectures plus HuggingFace Transformers backend

The codebase is organized into distinct layers, each responsible for specific concerns from user-facing APIs down to hardware-specific kernels.

Sources: vllm/entrypoints/llm.py107-203 vllm/v1/engine/async_llm.py70-138 docs/usage/v1_guide.md

Layered Architecture

vLLM follows a layered architecture where each layer has well-defined responsibilities:

Layer 1 provides user interfaces for different use cases: offline inference (LLM), async serving (AsyncLLM), CLI tools, and REST API endpoints.

Layer 2 handles protocol translation between user inputs and engine requests. InputProcessor converts prompts to EngineCoreRequest objects, while OutputProcessor converts engine outputs to RequestOutput objects.

Layer 3 contains the core scheduling logic. EngineCore runs the main loop that schedules requests, manages KV cache allocation, and coordinates execution.

Layer 4 abstracts distributed execution. The Executor spawns and manages Worker processes across GPUs/nodes, each running a GPUModelRunner for local inference.

Layer 5 contains the actual model implementations, attention backends, and hardware-specific kernels.

Sources: vllm/entrypoints/llm.py205-355 vllm/v1/engine/async_llm.py73-159 vllm/v1/engine/core.py80-224

Request Processing Pipeline

The following diagram shows how a request flows through vLLM from submission to completion:

Request Submission: Users submit requests through AsyncLLM.generate() or LLM.generate(). The InputProcessor tokenizes prompts and creates EngineCoreRequest objects containing token IDs and sampling parameters.

Scheduling: Each engine step, Scheduler.schedule() selects which requests to process based on resource availability, creates SchedulerOutput with selected requests and allocated KV cache blocks.

Execution: Executor distributes SchedulerOutput to workers via execute_model(). Each GPUModelRunner prepares input tensors, runs the model forward pass, and samples output tokens.

Output Processing: OutputProcessor receives sampled token IDs, detokenizes them to text, computes logprobs if requested, and creates RequestOutput objects for users.

Iteration: The loop continues until all requests reach stop conditions (max tokens, EOS token, etc.).

Sources: vllm/v1/engine/async_llm.py286-355 vllm/v1/engine/core.py375-404 vllm/v1/engine/output_processor.py45-152

Key Components

EngineCore

EngineCore is the heart of vLLM's inference engine, located at vllm/v1/engine/core.py83-230 It orchestrates the main inference loop:

Request Management: Adds new requests via add_request(), tracks active requests, handles aborts via abort_requests()
Scheduling: Calls the SchedulerInterface.schedule() to select requests and allocate KV cache
Execution Coordination: Invokes Executor.execute_model() for distributed inference
Output Processing: Updates scheduler state from model outputs, handles request completion
Batch Queue: Optionally maintains a deque-based batch queue (batch_queue) for pipeline parallel execution to eliminate pipeline bubbles

Key methods: step(), step_with_batch_queue(), add_request(), abort_requests(), _initialize_kv_caches()

Scheduler

The Scheduler (interface at vllm/v1/core/sched/interface.py) makes resource allocation decisions:

Request Selection: Chooses which requests to run based on policy (FCFS, priority, etc.) set in SchedulerConfig
KV Cache Allocation: Assigns physical KV cache blocks to requests via KVCacheManager
Preemption: Evicts lower-priority requests when resources are exhausted
Prefix Caching: Identifies and reuses common prompt prefixes across requests using hash-based block lookup

Produces SchedulerOutput containing selected requests, block allocations, and metadata for model execution.

GPUModelRunner

GPUModelRunner manages model execution on a single GPU. It is located in vllm/v1/worker/gpu_model_runner.py:

Input Preparation: Constructs InputBatch with token IDs, positions, attention metadata
Model Execution: Runs forward pass through transformer layers
Sampling: Applies logits processors, temperature, top-k/top-p via the Sampler class, samples tokens
CUDA Graphs: Captures and replays computation graphs for fixed batch sizes via capture_model()
KV Cache Binding: Manages the mapping between logical blocks and physical cache tensors

Key methods: execute_model(), sample_tokens(), capture_model()

Configuration System

vLLM uses a hierarchical configuration system centered around VllmConfig:

EngineArgs at vllm/engine/arg_utils.py362-613 parses CLI arguments and creates config objects. VllmConfig (in vllm/config/vllm.py) aggregates all configuration and is passed throughout the system. All config classes are exported from vllm/config/__init__.py

Sources: vllm/engine/arg_utils.py362-613 vllm/config/__init__.py vllm/config/model.py99-351 vllm/config/parallel.py

Entry Points and APIs

Synchronous LLM Class

The LLM class at vllm/entrypoints/llm.py107-216 provides a simple synchronous interface for offline inference:

Key methods:

Method	Description
`generate()`	Batch text generation, blocks until all requests complete
`chat()`	Multi-turn conversation interface
`encode()`	Generate embeddings for pooling models
`score()`	Compute similarity scores
`apply_model()`	Apply a function to the underlying `nn.Module`

Internally, LLM uses LLMEngine (which is an alias to vllm/v1/engine/llm_engine.py) and InprocClient or SyncMPClient from EngineCoreClient for engine communication.

Asynchronous AsyncLLM Class

The AsyncLLM class at vllm/v1/engine/async_llm.py71-184 provides an async/await interface with streaming support:

Uses EngineCoreClient.make_async_mp_client() (typically AsyncMPClient) to communicate with EngineCore running in a separate background process via ZMQ sockets, enabling concurrent request handling and non-blocking execution.

OpenAI-Compatible API Server

The API server at vllm/entrypoints/openai/api_server.py provides REST endpoints compatible with OpenAI's API:

Endpoint	Description
`POST /v1/completions`	Text completion
`POST /v1/chat/completions`	Chat messages
`POST /v1/embeddings`	Text embeddings
`POST /v1/responses`	Responses API with tool calling
`GET /health`	Health check

Built with FastAPI and uses AsyncLLM internally for request processing.

Sources: vllm/entrypoints/llm.py107-216 vllm/v1/engine/async_llm.py71-184 vllm/v1/engine/llm_engine.py48-60 vllm/engine/llm_engine.py

Distributed Execution Architecture

vLLM supports multiple parallelism strategies for scaling to large models and high throughput:

Tensor Parallelism (TP): Splits individual weight matrices across GPUs. Each GPU computes a portion of matrix multiplications, with all-reduce to combine results. Configured via tensor_parallel_size.

Pipeline Parallelism (PP): Assigns consecutive transformer layers to different GPUs. Requests flow through stages sequentially. Configured via pipeline_parallel_size.

Data Parallelism (DP): Replicates the entire model across GPU groups. Each replica serves independent requests. Configured via data_parallel_size.

Expert Parallelism (EP): For Mixture-of-Experts models, distributes experts across GPUs. Configured via enable_expert_parallel.

The Executor interface (in vllm/v1/executor/) provides the abstraction. MultiprocExecutor uses Python multiprocessing and ZMQ for single-node setups. RayDistributedExecutor uses Ray for multi-node clusters. UniProcExecutor runs in-process for debugging.

Sources: vllm/config/parallel.py vllm/v1/engine/core.py107-160

Memory Management and KV Cache

vLLM's memory efficiency comes from its KV cache management system:

Paged KV Cache: KV cache is divided into fixed-size blocks (typically 16 tokens). Each request's KV cache is a sequence of block references, allowing non-contiguous physical storage.

Prefix Caching: Common prompt prefixes are hashed and stored once. Multiple requests sharing a prefix reuse the same physical blocks, saving memory and computation.

Block Allocation: Scheduler allocates blocks from BlockPool. When memory is full, lower-priority requests are preempted and their blocks freed.

The cache sizing is determined during initialization in EngineCore._initialize_kv_caches():

Executor.get_kv_cache_specs() returns per-layer cache requirements
Executor.determine_available_memory() profiles the model to measure peak activation memory
get_kv_cache_configs() computes the number of allocatable blocks given the available memory
Executor.initialize_from_config() allocates and warms up the actual cache tensors

Sources: vllm/v1/engine/core.py231-287 vllm/config/cache.py

Model Support System

vLLM supports models through two paths:

Native Models: Optimized implementations under vllm/model_executor/models/ with custom attention, quantization, and kernel support. The registry (_TEXT_GENERATION_MODELS, _MULTIMODAL_MODELS, _EMBEDDING_MODELS) at vllm/model_executor/models/registry.py70-300 maps HuggingFace architecture names to implementation classes.

Transformers Backend: For models without a native vLLM implementation, the Transformers modeling backend wraps HuggingFace model classes. Configured via model_impl="transformers" or --model-impl transformers.

Multimodal Support: Vision-language models (InternVL, Qwen2-VL, Phi-3.5-vision, LLaVA, etc.) supported through MultiModalRegistry and BaseMultiModalProcessor at vllm/multimodal/.

Quantization: Built-in support for FP8, INT4 (GPTQ, AWQ, Marlin), MXFP4, INT8, and BitsAndBytes quantization methods, configured via ModelConfig.quantization.

Sources: vllm/model_executor/models/registry.py70-300 vllm/config/model.py99-351

Initialization Sequence

The following shows the typical initialization flow:

The initialization follows these phases:

Configuration: Parse arguments, create VllmConfig, validate settings
Model Loading: Spawn workers, load model weights on each GPU
Memory Profiling: Execute dummy forward pass to measure activation memory
KV Cache Allocation: Calculate and allocate KV cache blocks based on available memory
Warmup: Capture CUDA graphs for common batch sizes, compile kernels

After initialization, the engine enters its main serving loop ready to process requests.

Sources: vllm/v1/engine/core.py83-224 vllm/v1/worker/gpu_worker.py218-529 vllm/v1/worker/gpu_model_runner.py740-1194

Overview

Relevant source files

Purpose and Scope

For detailed information about specific subsystems, see the relevant wiki pages:

Topic	Wiki Page
Configuration and initialization	Page 2
Request scheduling and KV cache management	Page 3
Model execution details	Page 4
Model support and registration	Page 5
Serving APIs	Page 6
Quantization and MoE optimizations	Page 7
Attention backends	Page 8
Distributed parallelism strategies	Page 9
Hardware platform support	Page 10

What is vLLM?

vLLM is a high-throughput inference engine designed for serving Large Language Models efficiently. The current production engine is V1 (V0 has been fully deprecated). It provides:

Multiple API interfaces: Synchronous (LLM), asynchronous (AsyncLLM), and OpenAI-compatible REST API
Advanced batching: Continuous batching with intelligent request scheduling
Memory optimization: PagedAttention-style KV cache management with prefix caching
Distributed execution: Support for tensor, pipeline, data, and expert parallelism
Hardware flexibility: Runs on CUDA, ROCm, CPU, TPU, and Intel XPU platforms
Extensive model support: Native implementations for popular architectures plus HuggingFace Transformers backend

The codebase is organized into distinct layers, each responsible for specific concerns from user-facing APIs down to hardware-specific kernels.

Sources: vllm/entrypoints/llm.py107-203 vllm/v1/engine/async_llm.py70-138 docs/usage/v1_guide.md

Layered Architecture

vLLM follows a layered architecture where each layer has well-defined responsibilities:

Layer 1 provides user interfaces for different use cases: offline inference (LLM), async serving (AsyncLLM), CLI tools, and REST API endpoints.

Layer 3 contains the core scheduling logic. EngineCore runs the main loop that schedules requests, manages KV cache allocation, and coordinates execution.

Layer 4 abstracts distributed execution. The Executor spawns and manages Worker processes across GPUs/nodes, each running a GPUModelRunner for local inference.

Layer 5 contains the actual model implementations, attention backends, and hardware-specific kernels.

Sources: vllm/entrypoints/llm.py205-355 vllm/v1/engine/async_llm.py73-159 vllm/v1/engine/core.py80-224

Request Processing Pipeline

The following diagram shows how a request flows through vLLM from submission to completion:

Execution: Executor distributes SchedulerOutput to workers via execute_model(). Each GPUModelRunner prepares input tensors, runs the model forward pass, and samples output tokens.

Output Processing: OutputProcessor receives sampled token IDs, detokenizes them to text, computes logprobs if requested, and creates RequestOutput objects for users.

Iteration: The loop continues until all requests reach stop conditions (max tokens, EOS token, etc.).

Sources: vllm/v1/engine/async_llm.py286-355 vllm/v1/engine/core.py375-404 vllm/v1/engine/output_processor.py45-152

Key Components

EngineCore

EngineCore is the heart of vLLM's inference engine, located at vllm/v1/engine/core.py83-230 It orchestrates the main inference loop:

Request Management: Adds new requests via add_request(), tracks active requests, handles aborts via abort_requests()
Scheduling: Calls the SchedulerInterface.schedule() to select requests and allocate KV cache
Execution Coordination: Invokes Executor.execute_model() for distributed inference
Output Processing: Updates scheduler state from model outputs, handles request completion
Batch Queue: Optionally maintains a deque-based batch queue (batch_queue) for pipeline parallel execution to eliminate pipeline bubbles

Key methods: step(), step_with_batch_queue(), add_request(), abort_requests(), _initialize_kv_caches()

Scheduler

The Scheduler (interface at vllm/v1/core/sched/interface.py) makes resource allocation decisions:

Request Selection: Chooses which requests to run based on policy (FCFS, priority, etc.) set in SchedulerConfig
KV Cache Allocation: Assigns physical KV cache blocks to requests via KVCacheManager
Preemption: Evicts lower-priority requests when resources are exhausted
Prefix Caching: Identifies and reuses common prompt prefixes across requests using hash-based block lookup

Produces SchedulerOutput containing selected requests, block allocations, and metadata for model execution.

GPUModelRunner

GPUModelRunner manages model execution on a single GPU. It is located in vllm/v1/worker/gpu_model_runner.py:

Input Preparation: Constructs InputBatch with token IDs, positions, attention metadata
Model Execution: Runs forward pass through transformer layers
Sampling: Applies logits processors, temperature, top-k/top-p via the Sampler class, samples tokens
CUDA Graphs: Captures and replays computation graphs for fixed batch sizes via capture_model()
KV Cache Binding: Manages the mapping between logical blocks and physical cache tensors

Key methods: execute_model(), sample_tokens(), capture_model()

Configuration System

vLLM uses a hierarchical configuration system centered around VllmConfig:

Sources: vllm/engine/arg_utils.py362-613 vllm/config/__init__.py vllm/config/model.py99-351 vllm/config/parallel.py

Entry Points and APIs

Synchronous LLM Class

The LLM class at vllm/entrypoints/llm.py107-216 provides a simple synchronous interface for offline inference:

Key methods:

Method	Description
`generate()`	Batch text generation, blocks until all requests complete
`chat()`	Multi-turn conversation interface
`encode()`	Generate embeddings for pooling models
`score()`	Compute similarity scores
`apply_model()`	Apply a function to the underlying `nn.Module`

Internally, LLM uses LLMEngine (which is an alias to vllm/v1/engine/llm_engine.py) and InprocClient or SyncMPClient from EngineCoreClient for engine communication.

Asynchronous AsyncLLM Class

The AsyncLLM class at vllm/v1/engine/async_llm.py71-184 provides an async/await interface with streaming support:

OpenAI-Compatible API Server

The API server at vllm/entrypoints/openai/api_server.py provides REST endpoints compatible with OpenAI's API:

Endpoint	Description
`POST /v1/completions`	Text completion
`POST /v1/chat/completions`	Chat messages
`POST /v1/embeddings`	Text embeddings
`POST /v1/responses`	Responses API with tool calling
`GET /health`	Health check

Built with FastAPI and uses AsyncLLM internally for request processing.

Sources: vllm/entrypoints/llm.py107-216 vllm/v1/engine/async_llm.py71-184 vllm/v1/engine/llm_engine.py48-60 vllm/engine/llm_engine.py

Distributed Execution Architecture

vLLM supports multiple parallelism strategies for scaling to large models and high throughput:

Pipeline Parallelism (PP): Assigns consecutive transformer layers to different GPUs. Requests flow through stages sequentially. Configured via pipeline_parallel_size.

Data Parallelism (DP): Replicates the entire model across GPU groups. Each replica serves independent requests. Configured via data_parallel_size.

Expert Parallelism (EP): For Mixture-of-Experts models, distributes experts across GPUs. Configured via enable_expert_parallel.

Sources: vllm/config/parallel.py vllm/v1/engine/core.py107-160

Memory Management and KV Cache

vLLM's memory efficiency comes from its KV cache management system:

Paged KV Cache: KV cache is divided into fixed-size blocks (typically 16 tokens). Each request's KV cache is a sequence of block references, allowing non-contiguous physical storage.

Prefix Caching: Common prompt prefixes are hashed and stored once. Multiple requests sharing a prefix reuse the same physical blocks, saving memory and computation.

Block Allocation: Scheduler allocates blocks from BlockPool. When memory is full, lower-priority requests are preempted and their blocks freed.

The cache sizing is determined during initialization in EngineCore._initialize_kv_caches():

Executor.get_kv_cache_specs() returns per-layer cache requirements
Executor.determine_available_memory() profiles the model to measure peak activation memory
get_kv_cache_configs() computes the number of allocatable blocks given the available memory
Executor.initialize_from_config() allocates and warms up the actual cache tensors

Sources: vllm/v1/engine/core.py231-287 vllm/config/cache.py

Model Support System

vLLM supports models through two paths:

Multimodal Support: Vision-language models (InternVL, Qwen2-VL, Phi-3.5-vision, LLaVA, etc.) supported through MultiModalRegistry and BaseMultiModalProcessor at vllm/multimodal/.

Quantization: Built-in support for FP8, INT4 (GPTQ, AWQ, Marlin), MXFP4, INT8, and BitsAndBytes quantization methods, configured via ModelConfig.quantization.

Sources: vllm/model_executor/models/registry.py70-300 vllm/config/model.py99-351

Initialization Sequence

The following shows the typical initialization flow:

The initialization follows these phases:

Configuration: Parse arguments, create VllmConfig, validate settings
Model Loading: Spawn workers, load model weights on each GPU
Memory Profiling: Execute dummy forward pass to measure activation memory
KV Cache Allocation: Calculate and allocate KV cache blocks based on available memory
Warmup: Capture CUDA graphs for common batch sizes, compile kernels

After initialization, the engine enters its main serving loop ready to process requests.

Sources: vllm/v1/engine/core.py83-224 vllm/v1/worker/gpu_worker.py218-529 vllm/v1/worker/gpu_model_runner.py740-1194

Overview

Purpose and Scope

What is vLLM?

Layered Architecture

Request Processing Pipeline

Key Components

EngineCore

Scheduler

GPUModelRunner

Configuration System

Entry Points and APIs

Synchronous LLM Class

Asynchronous AsyncLLM Class

OpenAI-Compatible API Server

Distributed Execution Architecture

Memory Management and KV Cache

Model Support System

Initialization Sequence

On this page

Overview

Purpose and Scope

What is vLLM?

Layered Architecture

Request Processing Pipeline

Key Components

EngineCore

Scheduler

GPUModelRunner

Configuration System

Entry Points and APIs

Synchronous LLM Class

Asynchronous AsyncLLM Class

OpenAI-Compatible API Server

Distributed Execution Architecture

Memory Management and KV Cache

Model Support System

Initialization Sequence

On this page