This page describes the overall architecture of the vLLM v1 inference engine: its layered components, how they communicate, and how a request flows from API submission through GPU execution and back to the caller.
For configuration of these components at startup, see Configuration and Initialization. For details on how model inference is executed on the GPU, see Model Execution. For the HTTP serving layer built on top of this engine, see Serving APIs.
vLLM's engine is organized into four logical layers:
| Layer | Purpose | Key Abstractions |
|---|---|---|
| Client API | Accept and return requests | LLM, AsyncLLM, EngineClient |
| Engine Core | Schedule, execute, coordinate | EngineCore, EngineCoreProc, EngineCoreClient |
| Scheduler & Cache | Batch requests, manage KV cache | Scheduler, KVCacheManager |
| Executor & Workers | Run model on GPU(s) | Executor, worker processes |
The v1 engine uses a process split architecture: the client-facing layer and the GPU execution loop run in separate processes (or optionally in-process for simpler deployments), communicating via ZMQ sockets.
Architecture overview diagram:
Sources: vllm/v1/engine/core.py83-230 vllm/v1/engine/async_llm.py71-210 vllm/v1/engine/llm_engine.py48-185 vllm/v1/engine/core_client.py66-130
There are two main entry points into the engine:
LLM — Synchronous Offline InferenceLLM (vllm/entrypoints/llm.py107-392) is intended for offline batch inference. It wraps LLMEngine and drives the engine loop synchronously by calling step() in a loop until all requests are complete.
LLM.generate() accepts prompts and SamplingParams, submits them, and blocks until completion.LLM.enqueue() / LLM.wait_for_completion() allow a producer/consumer split.LLMEngine.from_engine_args() to instantiate the engine.AsyncLLM — Asynchronous Online ServingAsyncLLM (vllm/v1/engine/async_llm.py71-210) implements EngineClient (vllm/engine/protocol.py41-130) and is the engine used by the HTTP server. It is asyncio-native and designed for concurrent request handling.
AsyncLLM.generate() is an AsyncGenerator that yields RequestOutput objects as tokens are produced.output_handler) that continuously pulls EngineCoreOutputs from the engine core and routes them to per-request queues.AsyncMPClient that communicates with EngineCoreProc in a background process via ZMQ.LLMEngine — Legacy Compatibility WrapperLLMEngine (vllm/v1/engine/llm_engine.py48-185) is kept for backward compatibility. It wraps EngineCoreClient, InputProcessor, and OutputProcessor into a synchronous step-based loop. vllm/engine/llm_engine.py is a thin alias to the v1 implementation.
Similarly, vllm/engine/async_llm_engine.py aliases AsyncLLM as AsyncLLMEngine.
EngineClient ProtocolEngineClient (vllm/engine/protocol.py41-130) is the abstract base class both LLMEngine and AsyncLLM satisfy. It declares the interface: generate(), add_request(), abort_requests(), etc.
Sources: vllm/entrypoints/llm.py107-395 vllm/v1/engine/async_llm.py71-265 vllm/v1/engine/llm_engine.py48-185 vllm/engine/protocol.py41-130 vllm/engine/llm_engine.py1-7 vllm/engine/async_llm_engine.py1-7
EngineCoreEngineCore (vllm/v1/engine/core.py83-768) is the inner execution loop. It is responsible for:
Executor (which spawns worker processes and loads the model)._initialize_kv_caches().Scheduler.step() loop: schedule → execute → update.Key methods of EngineCore:
| Method | Description |
|---|---|
_initialize_kv_caches() | Profiles GPU memory, computes block counts, initializes workers |
step() | Single-step: schedule, execute model, update scheduler from output |
step_with_batch_queue() | Pipelined step for pipeline-parallel deployments |
add_request() | Forwards a Request to the scheduler |
abort_requests() | Cancels requests by ID |
post_step() | After a step, propagates draft token IDs for speculative decoding |
The step() method (vllm/v1/engine/core.py379-408) is the heart of the execution loop:
scheduler_output = scheduler.schedule()
future = executor.execute_model(scheduler_output, non_block=True)
grammar_output = scheduler.get_grammar_bitmask(scheduler_output)
model_output = future.result()
if model_output is None:
model_output = executor.sample_tokens(grammar_output)
engine_core_outputs = scheduler.update_from_output(scheduler_output, model_output)
EngineCoreProcEngineCoreProc (vllm/v1/engine/core.py770) wraps EngineCore for multiprocess deployments. It:
EngineCore over ZMQ sockets (input socket for requests, output socket for results).MsgpackEncoder/MsgpackDecoder for zero-copy serialization of requests and outputs.ENGINE_CORE_DEAD sentinel bytes on the output socket if it dies, so clients detect failure.EngineCoreClient and Its ImplementationsEngineCoreClient (vllm/v1/engine/core_client.py66-270) is the abstract client interface that LLMEngine and AsyncLLM use to communicate with EngineCore. Three implementations are used depending on deployment mode:
| Class | Mode | Communication |
|---|---|---|
InprocClient | In-process (no multiprocessing) | Direct Python call |
SyncMPClient | Background process, synchronous | ZMQ PUSH/PULL |
AsyncMPClient | Background process, asyncio | ZMQ asyncio sockets |
EngineCoreClient.make_client() (vllm/v1/engine/core_client.py78-100) selects the appropriate implementation. For data-parallel deployments, DPAsyncMPClient or DPLBAsyncMPClient are used instead.
Class relationship diagram:
Sources: vllm/v1/engine/core.py83-768 vllm/v1/engine/core_client.py66-360 vllm/v1/engine/async_llm.py71-265 vllm/v1/engine/llm_engine.py48-185
SchedulerScheduler (vllm/v1/core/sched/scheduler.py63-270) runs inside EngineCore and manages the request queues and KV cache allocation. It maintains:
self.waiting — a priority queue of requests not yet scheduled.self.running — a list of requests currently being processed.self.requests — a dict from request_id to Request object.The schedule() method (vllm/v1/core/sched/scheduler.py322) produces a SchedulerOutput that tells the executor exactly which tokens to process for each request, which KV blocks are allocated, and which encoder inputs to run.
Scheduling proceeds in two passes each step:
The update_from_output() method processes ModelRunnerOutput from the executor: appending output tokens to requests, checking stop conditions (EOS, stop tokens, max length), and marking requests as finished.
KVCacheManagerKVCacheManager (vllm/v1/core/kv_cache_manager.py94-310) manages the physical KV cache blocks on behalf of the scheduler. Its main operations:
| Method | Description |
|---|---|
get_computed_blocks() | Find prefix-cached blocks for a new request |
allocate_slots() | Allocate new blocks for tokens to be computed |
free() | Release all blocks held by a request |
Internally it delegates to a KVCacheCoordinator which owns per-attention-type SingleTypeKVCacheManager instances. Each SingleTypeKVCacheManager (vllm/v1/core/single_type_kv_cache_manager.py28) handles the block accounting for one class of attention (full attention, sliding window, MLA, Mamba, etc.).
The BlockPool (vllm/v1/core/block_pool.py1) manages the raw KVCacheBlock objects using a doubly-linked FreeKVCacheBlockQueue (vllm/v1/core/kv_cache_utils.py158-310) for O(1) LRU eviction.
For prefix caching, blocks are hashed using hash_block_tokens() and stored in a BlockHashToBlockMap. When a new request arrives, get_computed_blocks() walks the hash chain of its prompt to find the longest matching prefix already in cache.
Scheduler and KV Cache data flow:
Sources: vllm/v1/core/sched/scheduler.py63-270 vllm/v1/core/sched/scheduler.py322-810 vllm/v1/core/kv_cache_manager.py94-310 vllm/v1/core/kv_cache_utils.py158-310 vllm/v1/core/block_pool.py1-100
Request and RequestStatusRequest (vllm/v1/request.py59-270) is the internal representation of a request inside the engine. It is created from an EngineCoreRequest (the serializable wire format) and carries:
prompt_token_ids, prompt_embeds — input tokens or embeddings.sampling_params / pooling_params.num_computed_tokens — how many tokens the model has processed so far._output_token_ids — tokens generated so far.mm_features — multimodal encoder features.block_hashes — per-block hashes used for prefix caching.spec_token_ids — speculative decode draft tokens.RequestStatus (vllm/v1/request.py1) is an enum with states:
| Status | Meaning |
|---|---|
WAITING | Queued, not yet scheduled |
WAITING_FOR_FSM | Waiting for grammar/FSM compilation |
WAITING_FOR_REMOTE_KVS | Waiting for P/D KV transfer to complete |
RUNNING | Currently being processed by a worker |
PREEMPTED | Evicted from running due to memory pressure |
FINISHED_STOPPED | Stopped by EOS or stop token |
FINISHED_LENGTH_CAPPED | Reached max_tokens |
FINISHED_ABORTED | Cancelled by client |
Request lifecycle sequence:
Sources: vllm/v1/engine/async_llm.py289-420 vllm/v1/engine/core.py379-408 vllm/v1/engine/output_processor.py109-300 vllm/v1/request.py59-270
InputProcessorInputProcessor (vllm/v1/engine/input_processor.py) converts user-facing prompt types (PromptType, ProcessorInputs) into EngineCoreRequest objects. Responsibilities:
request_id.OutputProcessorOutputProcessor (vllm/v1/engine/output_processor.py1-400) runs on the client side (same process as AsyncLLM or LLMEngine). It:
RequestState per active request, tracking detokenization state and logprobs.EngineCoreOutput (sampled token IDs, logprobs) into RequestOutput / CompletionOutput objects.IncrementalDetokenizer to convert token IDs to text incrementally, avoiding full re-decode on each step.n > 1 case by fanning out requests via ParentRequest.RequestOutputCollector queues used by AsyncLLM.generate().RequestOutputCollector (vllm/v1/engine/output_processor.py45-106) is a lightweight asyncio-compatible queue (one per request) that merges streaming delta outputs when the consumer lags behind the producer.
Sources: vllm/v1/engine/output_processor.py45-400 vllm/v1/engine/async_llm.py140-165
When multiprocessing is enabled (the default for online serving), the engine splits into two process groups:
Communication uses:
AsyncMPClient and EngineCoreProc.MsgpackEncoder / MsgpackDecoder (vllm/v1/serial_utils.py) for efficient serialization.ShmRingBuffer for large tensor payloads (see Communication Infrastructure).The handshake on startup uses a separate ZMQ socket pair where EngineCoreProc sends EngineHandshakeMetadata (vllm/v1/engine/utils.py) after initialization completes. The client polls for up to HANDSHAKE_TIMEOUT_MINS (5 minutes by default, vllm/v1/engine/core.py78).
When InprocClient is used instead (for the synchronous LLM class or when VLLM_ENABLE_V1_MULTIPROCESSING=0), EngineCore runs directly in the same process and step() is called explicitly.
Sources: vllm/v1/engine/core.py770-900 vllm/v1/engine/core_client.py364-450 vllm/v1/engine/core_client.py271-310
When pipeline parallelism is enabled (pipeline_parallel_size > 1), EngineCore uses step_with_batch_queue() instead of step(). This maintains a deque of in-flight (future, scheduler_output, exec_future) tuples, allowing the scheduler to produce the next batch while waiting for the current one to finish — eliminating pipeline bubbles.
The queue size is set by Executor.max_concurrent_batches and the deque maxlen is set accordingly (vllm/v1/engine/core.py188-194).
The step_fn attribute on EngineCore is set to either step or step_with_batch_queue at initialization:
Sources: vllm/v1/engine/core.py188-215 vllm/v1/engine/core.py420-535
These components are initialized inside EngineCore and interact with the scheduler and executor:
| Component | Class / Module | Purpose |
|---|---|---|
| Structured output | StructuredOutputManager | Compiles grammars/FSMs, provides token bitmasks |
| Multimodal cache | mm_receiver_cache | Deduplicates multimodal feature objects across requests |
| KV transfer | KVConnectorBase_V1 | P/D disaggregation: transfers KV cache between instances |
| Speculative decode | use_spec_decode, update_draft_token_ids() | Manages draft token IDs from speculative proposers |
| Prefix caching hash | request_block_hasher | Computes per-block hashes for prefix cache lookup |
For details on each:
Sources: vllm/v1/engine/core.py128-215
Refresh this wiki
This wiki was recently refreshed. Please wait 5 days to refresh again.