This page covers how vLLM assembles a batch of active inference requests into GPU tensors for each forward pass. It describes the InputBatch, CachedRequestState, InputBuffers, BlockTable, and related data structures that bridge the scheduler's logical view of requests and the raw GPU tensors consumed by the model.
For the scheduler that produces the SchedulerOutput consumed here, see 3.3. For the GPU model runner that receives these tensors and executes the forward pass, see 4.1. For KV cache block allocation, see 3.4.
Input batch management sits between the scheduler and the model forward pass. On each step, the scheduler emits a SchedulerOutput describing which requests are active and how many tokens each should process. The batch management layer translates this into packed GPU tensors.
Input Batch Management: Position in the Execution Pipeline
Sources: vllm/v1/worker/gpu_model_runner.py537-581 vllm/v1/worker/gpu_input_batch.py81-265
There are two parallel implementations of input batch management, corresponding to two generations of the GPU model runner:
| File | Class(es) | Used by |
|---|---|---|
vllm/v1/worker/gpu_input_batch.py | InputBatch, CachedRequestState | vllm/v1/worker/gpu_model_runner.py |
vllm/v1/worker/gpu/input_batch.py | InputBuffers, InputBatch (dataclass) | vllm/v1/worker/gpu/model_runner.py |
The legacy implementation uses a stateful, mutable InputBatch that persists across steps and tracks all active requests. The newer implementation separates persistent buffers (InputBuffers) from a per-step snapshot dataclass (InputBatch).
CachedRequestState is a dataclass defined in vllm/v1/worker/gpu_input_batch.py29-79 that stores the complete mutable state for a single request on the model runner side.
CachedRequestState
├── req_id: str
├── prompt_token_ids: list[int] | None
├── mm_features: list[MultiModalFeatureSpec]
├── sampling_params: SamplingParams | None
├── generator: torch.Generator | None
├── block_ids: tuple[list[int], ...] # per KV-cache group
├── num_computed_tokens: int
├── output_token_ids: list[int]
├── mrope_positions: torch.Tensor | None # Qwen2-VL and similar
├── xdrope_positions: torch.Tensor | None
├── lora_request: LoRARequest | None
├── prompt_embeds: torch.Tensor | None
├── prev_num_draft_len: int # spec decode
├── pooling_params: PoolingParams | None
└── pooling_states: PoolingStates | None
num_tokens (property) returns num_prompt_tokens + len(output_token_ids). get_token_id(idx) retrieves a token by absolute position from either prompt or output, raising an error if the token was provided via prompt_embeds.
GPUModelRunner keeps a dict of all known requests: self.requests: dict[str, CachedRequestState] (vllm/v1/worker/gpu_model_runner.py538). This is separate from InputBatch, which only contains the currently scheduled subset.
Sources: vllm/v1/worker/gpu_input_batch.py29-79 vllm/v1/worker/gpu_model_runner.py537-542
gpu_input_batch.py)InputBatch is a persistent object initialized once in GPUModelRunner.__init__ (vllm/v1/worker/gpu_model_runner.py557-581). It tracks up to max_num_reqs scheduled requests and holds pre-allocated CPU and GPU buffers for every attribute needed by sampling and attention.
Constructor parameters:
| Parameter | Description |
|---|---|
max_num_reqs | Maximum concurrent scheduled requests |
max_model_len | Maximum sequence length (sets token buffer width) |
max_num_batched_tokens | Maximum tokens per batch |
device | Target GPU device |
pin_memory | Whether to pin CPU tensors for fast H2D copy |
vocab_size | Vocabulary size (for allowed token masks) |
block_sizes | Block size per KV cache group |
kernel_block_sizes | Kernel-facing block size per group |
Most sampling parameters exist in two forms: a CPU numpy array (for fast Python-side updates) and a GPU tensor (for the model forward pass). The GPU tensor is updated from the CPU version before execution.
CPU/GPU Buffer Pairs in InputBatch
For sampling parameters, sets like greedy_reqs, random_reqs, top_p_reqs, top_k_reqs, frequency_penalties_reqs, etc. track which requests have non-default values, enabling fast short-circuit paths.
Sources: vllm/v1/worker/gpu_input_batch.py153-270
Token IDs are stored in a 2D CPU tensor token_ids_cpu_tensor of shape (max_num_reqs, max_model_len) backed by a numpy array token_ids_cpu. Each row i contains all tokens for request at index i: first the prompt tokens, then the output tokens.
A parallel boolean array is_token_ids marks whether each position was a real token ID (True) or a prompt embedding position (False). When a request uses prompt_embeds, the corresponding rows in token_ids_cpu are not populated, and req_prompt_embeds (a dict from req_index to tensor) is used instead.
For speculative decoding, draft token IDs are written into the token_ids_cpu positions beyond the confirmed output tokens: spec_token_ids is a per-request list of pending draft tokens, and num_tokens_no_spec tracks the position boundary.
Sources: vllm/v1/worker/gpu_input_batch.py111-131 vllm/v1/worker/gpu_input_batch.py304-342
Adding a request (add_request):
_register_add_request which either fills a recently vacated index or appends at num_reqs.prompt_token_ids and output_token_ids into token_ids_cpu.temperature_cpu, top_p_cpu, etc.).req_id_to_index, greedy_reqs/random_reqs, penalty sets.block_table.add_row() with the request's KV cache block IDs.request_lora_mapping and lora_id_to_request_ids.Removing a request (remove_request):
Marks the index as empty by setting _req_ids[req_index] = None and recording it in batch_update_builder.removed. The actual slot is not zeroed immediately; it is reclaimed in the next condense() call.
Condensing (condense):
Slides non-empty requests from higher indices down into the empty slots left by removed requests. This maintains a compact, contiguous active region. Condensation calls swap_states() for each pair of slots that need to move, which swaps every array row, dict entry, and the block table row.
Request Lifecycle Diagram
Sources: vllm/v1/worker/gpu_input_batch.py278-650
SamplingMetadata is assembled from InputBatch fields by _make_sampling_metadata(). It is not rebuilt every step — only when the batch composition changes (tracked via batch_update_builder.batch_changed). SamplingMetadata is then passed to the Sampler during the output phase.
Key fields propagated from InputBatch to SamplingMetadata:
InputBatch field | SamplingMetadata field |
|---|---|
temperature, greedy_reqs, random_reqs | temperature, all_greedy, all_random |
top_p | top_p |
top_k | top_k |
frequency_penalties | frequency_penalties |
presence_penalties | presence_penalties |
repetition_penalties | repetition_penalties |
generators | generators |
num_logprobs | max_num_logprobs |
allowed_token_ids_mask | allowed_token_ids_mask |
bad_words_token_ids | bad_words_token_ids |
spec_token_ids | spec_token_ids |
Sources: vllm/v1/worker/gpu_input_batch.py720-780 vllm/v1/sample/metadata.py
BlockTable and MultiGroupBlockTableBlockTable (vllm/v1/worker/block_table.py16) tracks which physical KV cache blocks each request occupies. Each row i of block_table.block_table stores the block IDs for request i.
BlockTable
├── block_table: CpuGpuBuffer [max_num_reqs, max_num_blocks_per_req] int32
├── num_blocks_per_row: np.ndarray [max_num_reqs]
├── block_size: int (kernel-facing block size)
├── blocks_per_kv_block: int (ratio when hybrid block sizes used)
└── use_hybrid_blocks: bool
MultiGroupBlockTable (vllm/v1/worker/block_table.py) wraps a list of BlockTable instances, one per KV cache group. This supports models where different attention layers use different block types (e.g., full attention plus sliding window attention).
Hybrid Block Sizes: When the memory allocator uses a different block size than the attention kernel (e.g., 32-token memory blocks, 16-token kernel blocks), each memory block is mapped to blocks_per_kv_block kernel blocks. The BlockTable transparently handles this remapping.
From the block table, slot mappings are derived for each token: the flat position within the KV cache where that token's key/value vectors should be written. The method get_slot_mapping() computes a 1D tensor of shape [num_scheduled_tokens] giving the absolute slot index per token.
Sources: vllm/v1/worker/block_table.py16-200
InputBuffers and InputBatch DataclassThe refactored model runner (vllm/v1/worker/gpu/model_runner.py) separates persistent GPU buffers from the per-step batch descriptor.
InputBuffersInputBuffers (vllm/v1/worker/gpu/input_batch.py12-33) holds pre-allocated GPU tensors that are reused every step:
| Field | Shape | Dtype | Purpose |
|---|---|---|---|
input_ids | [max_num_tokens] | int32 | Token IDs for all positions |
positions | [max_num_tokens] | int64 | Absolute position indices |
query_start_loc | [max_num_reqs + 1] | int32 | Cumulative query lengths |
seq_lens | [max_num_reqs] | int32 | Full sequence lengths |
dcp_local_seq_lens | [max_num_reqs] | int32 | DCP-local sequence lengths |
These are written in-place during prepare_inputs() and sliced to the actual batch size before passing to the model.
InputBatch DataclassInputBatch in vllm/v1/worker/gpu/input_batch.py is a lightweight, immutable dataclass assembled once per step by prepare_inputs():
| Field | Description |
|---|---|
req_ids | Ordered list of request IDs for this batch |
num_reqs | Number of requests |
idx_mapping | [num_reqs] GPU tensor mapping batch index → RequestState index |
idx_mapping_np | CPU numpy version of idx_mapping |
expanded_idx_mapping | For spec decode: repeated per draft token |
expanded_local_pos | Position of each logit within its request |
num_scheduled_tokens | Per-request token count (numpy) |
num_tokens | Total scheduled tokens |
num_tokens_after_padding | Total tokens after CUDA graph padding |
num_draft_tokens | Total draft tokens for spec decode |
query_start_loc | [num_reqs + 1] GPU tensor of cumulative token counts |
seq_lens | [num_reqs] GPU tensor of full sequence lengths |
input_ids | [num_tokens_after_padding] GPU tensor |
positions | [num_tokens_after_padding] GPU tensor |
logits_indices | [total_num_logits] — which positions to sample from |
cu_num_logits | [num_reqs + 1] cumulative logit counts |
has_structured_output_reqs | Whether any request uses structured output |
Sources: vllm/v1/worker/gpu/input_batch.py12-143 vllm/v1/worker/gpu/model_runner.py185-189 vllm/v1/worker/gpu/model_runner.py563-695
RequestStateRequestState (vllm/v1/worker/gpu/states.py9) is the equivalent of the legacy CachedRequestState dictionary, but centralized into a single object with pre-allocated GPU/UVA tensors:
| Tensor | Shape | Description |
|---|---|---|
all_token_ids | [max_num_reqs, max_model_len] | All tokens (prompt + output) per request, stored in UVA memory |
total_len | [max_num_reqs] | Total token count per request |
num_computed_tokens | [max_num_reqs] | Number of tokens already KV-computed |
last_sampled_tokens | [max_num_reqs, 1] | Most recent sampled token per request |
draft_tokens | [max_num_reqs, num_spec_steps] | Pending draft tokens |
prefill_len | [max_num_reqs] | Prefill length (≥ prompt_len) |
The use of UVA (Unified Virtual Addressing) memory for all_token_ids avoids allocating large tensors on the GPU, since the full token history is only needed on the CPU side for prefill preparation.
Sources: vllm/v1/worker/gpu/states.py9-150
Step-by-step data flow (legacy model runner)
Sources: vllm/v1/worker/gpu_model_runner.py899-1050
Data flow (new model runner)
Sources: vllm/v1/worker/gpu/model_runner.py500-695
Both model runner implementations use Triton kernels for performance-critical input preparation:
| Kernel | File | Purpose |
|---|---|---|
_prepare_prefill_inputs_kernel | gpu/input_batch.py | Writes prefill token IDs into input_ids by reading from all_token_ids |
_prepare_pos_seq_lens_kernel | gpu/input_batch.py | Writes positions and seq_lens in one pass |
_combine_sampled_and_draft_tokens_kernel | gpu/input_batch.py | Inserts last sampled token and draft tokens into input_ids, computes logits_indices |
The _combine_sampled_and_draft_tokens_kernel is particularly important: for decode steps it writes the previous step's sampled token at the correct position, and for speculative decoding it writes the draft tokens. It also computes logits_indices — the flat positions in the output hidden states from which to extract logits for sampling.
Sources: vllm/v1/worker/gpu/input_batch.py146-355
Class Relationships and Key Buffer Dimensions
R = max_num_reqs, L = max_model_len, V = vocab_size, B = max_num_blocks_per_req
Sources: vllm/v1/worker/gpu_input_batch.py81-265 vllm/v1/worker/block_table.py16-100
Persistent batch object: InputBatch is allocated once and mutated across steps rather than reallocated. This avoids repeated memory allocation overhead and makes incremental updates (add one request, remove another) cheap.
Pinned CPU memory: CPU tensors use pin_memory=True when available, enabling asynchronous non-blocking H2D copies via DMA engines rather than blocking CPU-GPU transfers.
Condense on remove: When requests finish, their slots become gaps. condense() compacts the batch by sliding remaining requests down. This keeps the active region contiguous, which is required for the packed tensor format consumed by attention kernels.
Lazy GPU upload: CPU arrays are updated immediately on add_request / remove_request, but GPU tensors are only written before the forward pass in _prepare_inputs(). This batches all H2D copies into a single phase.
batch_update_builder tracking: Changes (adds, removes, swaps) are recorded via BatchUpdateBuilder and used to generate differential updates to the LogitsProcessors state rather than rebuilding it from scratch.
Sources: vllm/v1/worker/gpu_input_batch.py232-265 vllm/v1/worker/gpu_model_runner.py544-581
Refresh this wiki
This wiki was recently refreshed. Please wait 5 days to refresh again.