Engine Architecture

Relevant source files

This page describes the overall architecture of the vLLM v1 inference engine: its layered components, how they communicate, and how a request flows from API submission through GPU execution and back to the caller.

For configuration of these components at startup, see Configuration and Initialization. For details on how model inference is executed on the GPU, see Model Execution. For the HTTP serving layer built on top of this engine, see Serving APIs.

Overview

vLLM's engine is organized into four logical layers:

Layer	Purpose	Key Abstractions
Client API	Accept and return requests	`LLM`, `AsyncLLM`, `EngineClient`
Engine Core	Schedule, execute, coordinate	`EngineCore`, `EngineCoreProc`, `EngineCoreClient`
Scheduler & Cache	Batch requests, manage KV cache	`Scheduler`, `KVCacheManager`
Executor & Workers	Run model on GPU(s)	`Executor`, worker processes

The v1 engine uses a process split architecture: the client-facing layer and the GPU execution loop run in separate processes (or optionally in-process for simpler deployments), communicating via ZMQ sockets.

Architecture overview diagram:

Sources: vllm/v1/engine/core.py83-230 vllm/v1/engine/async_llm.py71-210 vllm/v1/engine/llm_engine.py48-185 vllm/v1/engine/core_client.py66-130

Client API Layer

There are two main entry points into the engine:

`LLM` — Synchronous Offline Inference

LLM (vllm/entrypoints/llm.py107-392) is intended for offline batch inference. It wraps LLMEngine and drives the engine loop synchronously by calling step() in a loop until all requests are complete.

LLM.generate() accepts prompts and SamplingParams, submits them, and blocks until completion.
LLM.enqueue() / LLM.wait_for_completion() allow a producer/consumer split.
Internally uses LLMEngine.from_engine_args() to instantiate the engine.

`AsyncLLM` — Asynchronous Online Serving

AsyncLLM (vllm/v1/engine/async_llm.py71-210) implements EngineClient (vllm/engine/protocol.py41-130) and is the engine used by the HTTP server. It is asyncio-native and designed for concurrent request handling.

AsyncLLM.generate() is an AsyncGenerator that yields RequestOutput objects as tokens are produced.
Runs a background asyncio task (output_handler) that continuously pulls EngineCoreOutputs from the engine core and routes them to per-request queues.
Creates an AsyncMPClient that communicates with EngineCoreProc in a background process via ZMQ.

`LLMEngine` — Legacy Compatibility Wrapper

LLMEngine (vllm/v1/engine/llm_engine.py48-185) is kept for backward compatibility. It wraps EngineCoreClient, InputProcessor, and OutputProcessor into a synchronous step-based loop. vllm/engine/llm_engine.py is a thin alias to the v1 implementation.

Similarly, vllm/engine/async_llm_engine.py aliases AsyncLLM as AsyncLLMEngine.

`EngineClient` Protocol

EngineClient (vllm/engine/protocol.py41-130) is the abstract base class both LLMEngine and AsyncLLM satisfy. It declares the interface: generate(), add_request(), abort_requests(), etc.

Sources: vllm/entrypoints/llm.py107-395 vllm/v1/engine/async_llm.py71-265 vllm/v1/engine/llm_engine.py48-185 vllm/engine/protocol.py41-130 vllm/engine/llm_engine.py1-7 vllm/engine/async_llm_engine.py1-7

Engine Core Layer

`EngineCore`

EngineCore (vllm/v1/engine/core.py83-768) is the inner execution loop. It is responsible for:

Initializing the Executor (which spawns worker processes and loads the model).
Profiling GPU memory and initializing the KV cache via _initialize_kv_caches().
Creating the Scheduler.
Running the step() loop: schedule → execute → update.

Key methods of EngineCore:

Method	Description
`_initialize_kv_caches()`	Profiles GPU memory, computes block counts, initializes workers
`step()`	Single-step: schedule, execute model, update scheduler from output
`step_with_batch_queue()`	Pipelined step for pipeline-parallel deployments
`add_request()`	Forwards a `Request` to the scheduler
`abort_requests()`	Cancels requests by ID
`post_step()`	After a step, propagates draft token IDs for speculative decoding

The step() method (vllm/v1/engine/core.py379-408) is the heart of the execution loop:

scheduler_output = scheduler.schedule()
future = executor.execute_model(scheduler_output, non_block=True)
grammar_output = scheduler.get_grammar_bitmask(scheduler_output)
model_output = future.result()
if model_output is None:
    model_output = executor.sample_tokens(grammar_output)
engine_core_outputs = scheduler.update_from_output(scheduler_output, model_output)

`EngineCoreProc`

EngineCoreProc (vllm/v1/engine/core.py770) wraps EngineCore for multiprocess deployments. It:

Runs as a separate OS process.
Exposes EngineCore over ZMQ sockets (input socket for requests, output socket for results).
Uses MsgpackEncoder/MsgpackDecoder for zero-copy serialization of requests and outputs.
Sends ENGINE_CORE_DEAD sentinel bytes on the output socket if it dies, so clients detect failure.

`EngineCoreClient` and Its Implementations

EngineCoreClient (vllm/v1/engine/core_client.py66-270) is the abstract client interface that LLMEngine and AsyncLLM use to communicate with EngineCore. Three implementations are used depending on deployment mode:

Class	Mode	Communication
`InprocClient`	In-process (no multiprocessing)	Direct Python call
`SyncMPClient`	Background process, synchronous	ZMQ PUSH/PULL
`AsyncMPClient`	Background process, asyncio	ZMQ asyncio sockets

EngineCoreClient.make_client() (vllm/v1/engine/core_client.py78-100) selects the appropriate implementation. For data-parallel deployments, DPAsyncMPClient or DPLBAsyncMPClient are used instead.

Class relationship diagram:

Sources: vllm/v1/engine/core.py83-768 vllm/v1/engine/core_client.py66-360 vllm/v1/engine/async_llm.py71-265 vllm/v1/engine/llm_engine.py48-185

Scheduler and KV Cache Layer

`Scheduler`

Scheduler (vllm/v1/core/sched/scheduler.py63-270) runs inside EngineCore and manages the request queues and KV cache allocation. It maintains:

self.waiting — a priority queue of requests not yet scheduled.
self.running — a list of requests currently being processed.
self.requests — a dict from request_id to Request object.

The schedule() method (vllm/v1/core/sched/scheduler.py322) produces a SchedulerOutput that tells the executor exactly which tokens to process for each request, which KV blocks are allocated, and which encoder inputs to run.

Scheduling proceeds in two passes each step:

Running requests first — tries to advance all currently running requests; preempts lower-priority ones if KV cache is exhausted.
Waiting requests second — promotes new requests from the waiting queue if token budget and KV cache allow.

The update_from_output() method processes ModelRunnerOutput from the executor: appending output tokens to requests, checking stop conditions (EOS, stop tokens, max length), and marking requests as finished.

`KVCacheManager`

KVCacheManager (vllm/v1/core/kv_cache_manager.py94-310) manages the physical KV cache blocks on behalf of the scheduler. Its main operations:

Method	Description
`get_computed_blocks()`	Find prefix-cached blocks for a new request
`allocate_slots()`	Allocate new blocks for tokens to be computed
`free()`	Release all blocks held by a request

Internally it delegates to a KVCacheCoordinator which owns per-attention-type SingleTypeKVCacheManager instances. Each SingleTypeKVCacheManager (vllm/v1/core/single_type_kv_cache_manager.py28) handles the block accounting for one class of attention (full attention, sliding window, MLA, Mamba, etc.).

The BlockPool (vllm/v1/core/block_pool.py1) manages the raw KVCacheBlock objects using a doubly-linked FreeKVCacheBlockQueue (vllm/v1/core/kv_cache_utils.py158-310) for O(1) LRU eviction.

For prefix caching, blocks are hashed using hash_block_tokens() and stored in a BlockHashToBlockMap. When a new request arrives, get_computed_blocks() walks the hash chain of its prompt to find the longest matching prefix already in cache.

Scheduler and KV Cache data flow:

Sources: vllm/v1/core/sched/scheduler.py63-270 vllm/v1/core/sched/scheduler.py322-810 vllm/v1/core/kv_cache_manager.py94-310 vllm/v1/core/kv_cache_utils.py158-310 vllm/v1/core/block_pool.py1-100

Request Lifecycle

`Request` and `RequestStatus`

Request (vllm/v1/request.py59-270) is the internal representation of a request inside the engine. It is created from an EngineCoreRequest (the serializable wire format) and carries:

prompt_token_ids, prompt_embeds — input tokens or embeddings.
sampling_params / pooling_params.
num_computed_tokens — how many tokens the model has processed so far.
_output_token_ids — tokens generated so far.
mm_features — multimodal encoder features.
block_hashes — per-block hashes used for prefix caching.
spec_token_ids — speculative decode draft tokens.

RequestStatus (vllm/v1/request.py1) is an enum with states:

Status	Meaning
`WAITING`	Queued, not yet scheduled
`WAITING_FOR_FSM`	Waiting for grammar/FSM compilation
`WAITING_FOR_REMOTE_KVS`	Waiting for P/D KV transfer to complete
`RUNNING`	Currently being processed by a worker
`PREEMPTED`	Evicted from running due to memory pressure
`FINISHED_STOPPED`	Stopped by EOS or stop token
`FINISHED_LENGTH_CAPPED`	Reached `max_tokens`
`FINISHED_ABORTED`	Cancelled by client

End-to-End Request Flow

Request lifecycle sequence:

Sources: vllm/v1/engine/async_llm.py289-420 vllm/v1/engine/core.py379-408 vllm/v1/engine/output_processor.py109-300 vllm/v1/request.py59-270

Input and Output Processing

`InputProcessor`

InputProcessor (vllm/v1/engine/input_processor.py) converts user-facing prompt types (PromptType, ProcessorInputs) into EngineCoreRequest objects. Responsibilities:

Tokenization (via the model's tokenizer).
Multimodal input encoding and feature extraction.
Validation of sampling parameters against model capabilities.
Assignment of a unique request_id.

`OutputProcessor`

OutputProcessor (vllm/v1/engine/output_processor.py1-400) runs on the client side (same process as AsyncLLM or LLMEngine). It:

Maintains a RequestState per active request, tracking detokenization state and logprobs.
Converts raw EngineCoreOutput (sampled token IDs, logprobs) into RequestOutput / CompletionOutput objects.
Uses IncrementalDetokenizer to convert token IDs to text incrementally, avoiding full re-decode on each step.
Handles the n > 1 case by fanning out requests via ParentRequest.
Populates RequestOutputCollector queues used by AsyncLLM.generate().

RequestOutputCollector (vllm/v1/engine/output_processor.py45-106) is a lightweight asyncio-compatible queue (one per request) that merges streaming delta outputs when the consumer lags behind the producer.

Sources: vllm/v1/engine/output_processor.py45-400 vllm/v1/engine/async_llm.py140-165

Process Architecture and Communication

When multiprocessing is enabled (the default for online serving), the engine splits into two process groups:

Communication uses:

ZMQ PUSH/PULL sockets between AsyncMPClient and EngineCoreProc.
MsgpackEncoder / MsgpackDecoder (vllm/v1/serial_utils.py) for efficient serialization.
ShmRingBuffer for large tensor payloads (see Communication Infrastructure).

The handshake on startup uses a separate ZMQ socket pair where EngineCoreProc sends EngineHandshakeMetadata (vllm/v1/engine/utils.py) after initialization completes. The client polls for up to HANDSHAKE_TIMEOUT_MINS (5 minutes by default, vllm/v1/engine/core.py78).

When InprocClient is used instead (for the synchronous LLM class or when VLLM_ENABLE_V1_MULTIPROCESSING=0), EngineCore runs directly in the same process and step() is called explicitly.

Sources: vllm/v1/engine/core.py770-900 vllm/v1/engine/core_client.py364-450 vllm/v1/engine/core_client.py271-310

Batch Queue and Pipeline Parallelism

When pipeline parallelism is enabled (pipeline_parallel_size > 1), EngineCore uses step_with_batch_queue() instead of step(). This maintains a deque of in-flight (future, scheduler_output, exec_future) tuples, allowing the scheduler to produce the next batch while waiting for the current one to finish — eliminating pipeline bubbles.

The queue size is set by Executor.max_concurrent_batches and the deque maxlen is set accordingly (vllm/v1/engine/core.py188-194).

The step_fn attribute on EngineCore is set to either step or step_with_batch_queue at initialization:

Sources: vllm/v1/engine/core.py188-215 vllm/v1/engine/core.py420-535

Supporting Subsystems

These components are initialized inside EngineCore and interact with the scheduler and executor:

Component	Class / Module	Purpose
Structured output	`StructuredOutputManager`	Compiles grammars/FSMs, provides token bitmasks
Multimodal cache	`mm_receiver_cache`	Deduplicates multimodal feature objects across requests
KV transfer	`KVConnectorBase_V1`	P/D disaggregation: transfers KV cache between instances
Speculative decode	`use_spec_decode`, `update_draft_token_ids()`	Manages draft token IDs from speculative proposers
Prefix caching hash	`request_block_hasher`	Computes per-block hashes for prefix cache lookup

For details on each:

Scheduler and KV cache: see Scheduler and Resource Allocation and KV Cache Management.
Output processing and detokenization: see Output Processing.
Workers and executors: see Worker and Executor Architecture.
KV transfer (P/D disaggregation): see KV Cache Transfer and Disaggregated Serving.

Sources: vllm/v1/engine/core.py128-215

Engine Architecture

Relevant source files

Overview

vLLM's engine is organized into four logical layers:

Layer	Purpose	Key Abstractions
Client API	Accept and return requests	`LLM`, `AsyncLLM`, `EngineClient`
Engine Core	Schedule, execute, coordinate	`EngineCore`, `EngineCoreProc`, `EngineCoreClient`
Scheduler & Cache	Batch requests, manage KV cache	`Scheduler`, `KVCacheManager`
Executor & Workers	Run model on GPU(s)	`Executor`, worker processes

Architecture overview diagram:

Sources: vllm/v1/engine/core.py83-230 vllm/v1/engine/async_llm.py71-210 vllm/v1/engine/llm_engine.py48-185 vllm/v1/engine/core_client.py66-130

Client API Layer

There are two main entry points into the engine:

`LLM` — Synchronous Offline Inference

LLM.generate() accepts prompts and SamplingParams, submits them, and blocks until completion.
LLM.enqueue() / LLM.wait_for_completion() allow a producer/consumer split.
Internally uses LLMEngine.from_engine_args() to instantiate the engine.

`AsyncLLM` — Asynchronous Online Serving

AsyncLLM.generate() is an AsyncGenerator that yields RequestOutput objects as tokens are produced.
Runs a background asyncio task (output_handler) that continuously pulls EngineCoreOutputs from the engine core and routes them to per-request queues.
Creates an AsyncMPClient that communicates with EngineCoreProc in a background process via ZMQ.

`LLMEngine` — Legacy Compatibility Wrapper

Similarly, vllm/engine/async_llm_engine.py aliases AsyncLLM as AsyncLLMEngine.

`EngineClient` Protocol

EngineClient (vllm/engine/protocol.py41-130) is the abstract base class both LLMEngine and AsyncLLM satisfy. It declares the interface: generate(), add_request(), abort_requests(), etc.

Engine Core Layer

`EngineCore`

EngineCore (vllm/v1/engine/core.py83-768) is the inner execution loop. It is responsible for:

Initializing the Executor (which spawns worker processes and loads the model).
Profiling GPU memory and initializing the KV cache via _initialize_kv_caches().
Creating the Scheduler.
Running the step() loop: schedule → execute → update.

Key methods of EngineCore:

Method	Description
`_initialize_kv_caches()`	Profiles GPU memory, computes block counts, initializes workers
`step()`	Single-step: schedule, execute model, update scheduler from output
`step_with_batch_queue()`	Pipelined step for pipeline-parallel deployments
`add_request()`	Forwards a `Request` to the scheduler
`abort_requests()`	Cancels requests by ID
`post_step()`	After a step, propagates draft token IDs for speculative decoding

The step() method (vllm/v1/engine/core.py379-408) is the heart of the execution loop:

scheduler_output = scheduler.schedule()
future = executor.execute_model(scheduler_output, non_block=True)
grammar_output = scheduler.get_grammar_bitmask(scheduler_output)
model_output = future.result()
if model_output is None:
    model_output = executor.sample_tokens(grammar_output)
engine_core_outputs = scheduler.update_from_output(scheduler_output, model_output)

`EngineCoreProc`

EngineCoreProc (vllm/v1/engine/core.py770) wraps EngineCore for multiprocess deployments. It:

Runs as a separate OS process.
Exposes EngineCore over ZMQ sockets (input socket for requests, output socket for results).
Uses MsgpackEncoder/MsgpackDecoder for zero-copy serialization of requests and outputs.
Sends ENGINE_CORE_DEAD sentinel bytes on the output socket if it dies, so clients detect failure.

`EngineCoreClient` and Its Implementations

Class	Mode	Communication
`InprocClient`	In-process (no multiprocessing)	Direct Python call
`SyncMPClient`	Background process, synchronous	ZMQ PUSH/PULL
`AsyncMPClient`	Background process, asyncio	ZMQ asyncio sockets

Class relationship diagram:

Sources: vllm/v1/engine/core.py83-768 vllm/v1/engine/core_client.py66-360 vllm/v1/engine/async_llm.py71-265 vllm/v1/engine/llm_engine.py48-185

Scheduler and KV Cache Layer

`Scheduler`

Scheduler (vllm/v1/core/sched/scheduler.py63-270) runs inside EngineCore and manages the request queues and KV cache allocation. It maintains:

self.waiting — a priority queue of requests not yet scheduled.
self.running — a list of requests currently being processed.
self.requests — a dict from request_id to Request object.

Scheduling proceeds in two passes each step:

Running requests first — tries to advance all currently running requests; preempts lower-priority ones if KV cache is exhausted.
Waiting requests second — promotes new requests from the waiting queue if token budget and KV cache allow.

`KVCacheManager`

KVCacheManager (vllm/v1/core/kv_cache_manager.py94-310) manages the physical KV cache blocks on behalf of the scheduler. Its main operations:

Method	Description
`get_computed_blocks()`	Find prefix-cached blocks for a new request
`allocate_slots()`	Allocate new blocks for tokens to be computed
`free()`	Release all blocks held by a request

The BlockPool (vllm/v1/core/block_pool.py1) manages the raw KVCacheBlock objects using a doubly-linked FreeKVCacheBlockQueue (vllm/v1/core/kv_cache_utils.py158-310) for O(1) LRU eviction.

Scheduler and KV Cache data flow:

Sources: vllm/v1/core/sched/scheduler.py63-270 vllm/v1/core/sched/scheduler.py322-810 vllm/v1/core/kv_cache_manager.py94-310 vllm/v1/core/kv_cache_utils.py158-310 vllm/v1/core/block_pool.py1-100

Request Lifecycle

`Request` and `RequestStatus`

Request (vllm/v1/request.py59-270) is the internal representation of a request inside the engine. It is created from an EngineCoreRequest (the serializable wire format) and carries:

prompt_token_ids, prompt_embeds — input tokens or embeddings.
sampling_params / pooling_params.
num_computed_tokens — how many tokens the model has processed so far.
_output_token_ids — tokens generated so far.
mm_features — multimodal encoder features.
block_hashes — per-block hashes used for prefix caching.
spec_token_ids — speculative decode draft tokens.

RequestStatus (vllm/v1/request.py1) is an enum with states:

Status	Meaning
`WAITING`	Queued, not yet scheduled
`WAITING_FOR_FSM`	Waiting for grammar/FSM compilation
`WAITING_FOR_REMOTE_KVS`	Waiting for P/D KV transfer to complete
`RUNNING`	Currently being processed by a worker
`PREEMPTED`	Evicted from running due to memory pressure
`FINISHED_STOPPED`	Stopped by EOS or stop token
`FINISHED_LENGTH_CAPPED`	Reached `max_tokens`
`FINISHED_ABORTED`	Cancelled by client

End-to-End Request Flow

Request lifecycle sequence:

Sources: vllm/v1/engine/async_llm.py289-420 vllm/v1/engine/core.py379-408 vllm/v1/engine/output_processor.py109-300 vllm/v1/request.py59-270

Input and Output Processing

`InputProcessor`

InputProcessor (vllm/v1/engine/input_processor.py) converts user-facing prompt types (PromptType, ProcessorInputs) into EngineCoreRequest objects. Responsibilities:

Tokenization (via the model's tokenizer).
Multimodal input encoding and feature extraction.
Validation of sampling parameters against model capabilities.
Assignment of a unique request_id.

`OutputProcessor`

OutputProcessor (vllm/v1/engine/output_processor.py1-400) runs on the client side (same process as AsyncLLM or LLMEngine). It:

Maintains a RequestState per active request, tracking detokenization state and logprobs.
Converts raw EngineCoreOutput (sampled token IDs, logprobs) into RequestOutput / CompletionOutput objects.
Uses IncrementalDetokenizer to convert token IDs to text incrementally, avoiding full re-decode on each step.
Handles the n > 1 case by fanning out requests via ParentRequest.
Populates RequestOutputCollector queues used by AsyncLLM.generate().

Sources: vllm/v1/engine/output_processor.py45-400 vllm/v1/engine/async_llm.py140-165

Process Architecture and Communication

When multiprocessing is enabled (the default for online serving), the engine splits into two process groups:

Communication uses:

ZMQ PUSH/PULL sockets between AsyncMPClient and EngineCoreProc.
MsgpackEncoder / MsgpackDecoder (vllm/v1/serial_utils.py) for efficient serialization.
ShmRingBuffer for large tensor payloads (see Communication Infrastructure).

When InprocClient is used instead (for the synchronous LLM class or when VLLM_ENABLE_V1_MULTIPROCESSING=0), EngineCore runs directly in the same process and step() is called explicitly.

Sources: vllm/v1/engine/core.py770-900 vllm/v1/engine/core_client.py364-450 vllm/v1/engine/core_client.py271-310

Batch Queue and Pipeline Parallelism

The queue size is set by Executor.max_concurrent_batches and the deque maxlen is set accordingly (vllm/v1/engine/core.py188-194).

The step_fn attribute on EngineCore is set to either step or step_with_batch_queue at initialization:

Sources: vllm/v1/engine/core.py188-215 vllm/v1/engine/core.py420-535

Supporting Subsystems

These components are initialized inside EngineCore and interact with the scheduler and executor:

Component	Class / Module	Purpose
Structured output	`StructuredOutputManager`	Compiles grammars/FSMs, provides token bitmasks
Multimodal cache	`mm_receiver_cache`	Deduplicates multimodal feature objects across requests
KV transfer	`KVConnectorBase_V1`	P/D disaggregation: transfers KV cache between instances
Speculative decode	`use_spec_decode`, `update_draft_token_ids()`	Manages draft token IDs from speculative proposers
Prefix caching hash	`request_block_hasher`	Computes per-block hashes for prefix cache lookup

For details on each:

Scheduler and KV cache: see Scheduler and Resource Allocation and KV Cache Management.
Output processing and detokenization: see Output Processing.
Workers and executors: see Worker and Executor Architecture.
KV transfer (P/D disaggregation): see KV Cache Transfer and Disaggregated Serving.

Sources: vllm/v1/engine/core.py128-215

Engine Architecture

Overview

Client API Layer

LLM — Synchronous Offline Inference

AsyncLLM — Asynchronous Online Serving

LLMEngine — Legacy Compatibility Wrapper

EngineClient Protocol

Engine Core Layer

EngineCore

EngineCoreProc

EngineCoreClient and Its Implementations

Scheduler and KV Cache Layer

Scheduler

KVCacheManager

Request Lifecycle

Request and RequestStatus

End-to-End Request Flow

Input and Output Processing

InputProcessor

OutputProcessor

Process Architecture and Communication

Batch Queue and Pipeline Parallelism

Supporting Subsystems

On this page

Engine Architecture

Overview

Client API Layer

LLM — Synchronous Offline Inference

AsyncLLM — Asynchronous Online Serving

LLMEngine — Legacy Compatibility Wrapper

EngineClient Protocol

Engine Core Layer

EngineCore

EngineCoreProc

EngineCoreClient and Its Implementations

Scheduler and KV Cache Layer

Scheduler

KVCacheManager

Request Lifecycle

Request and RequestStatus

End-to-End Request Flow

Input and Output Processing

InputProcessor

OutputProcessor

Process Architecture and Communication

Batch Queue and Pipeline Parallelism

Supporting Subsystems

On this page

`LLM` — Synchronous Offline Inference

`AsyncLLM` — Asynchronous Online Serving

`LLMEngine` — Legacy Compatibility Wrapper

`EngineClient` Protocol

`EngineCore`

`EngineCoreProc`

`EngineCoreClient` and Its Implementations

`Scheduler`

`KVCacheManager`

`Request` and `RequestStatus`

`InputProcessor`

`OutputProcessor`

`LLM` — Synchronous Offline Inference

`AsyncLLM` — Asynchronous Online Serving

`LLMEngine` — Legacy Compatibility Wrapper

`EngineClient` Protocol

`EngineCore`

`EngineCoreProc`

`EngineCoreClient` and Its Implementations

`Scheduler`

`KVCacheManager`

`Request` and `RequestStatus`

`InputProcessor`

`OutputProcessor`