Input Batch Management

Relevant source files

This page covers how vLLM assembles a batch of active inference requests into GPU tensors for each forward pass. It describes the InputBatch, CachedRequestState, InputBuffers, BlockTable, and related data structures that bridge the scheduler's logical view of requests and the raw GPU tensors consumed by the model.

For the scheduler that produces the SchedulerOutput consumed here, see 3.3. For the GPU model runner that receives these tensors and executes the forward pass, see 4.1. For KV cache block allocation, see 3.4.

Architecture Overview

Input batch management sits between the scheduler and the model forward pass. On each step, the scheduler emits a SchedulerOutput describing which requests are active and how many tokens each should process. The batch management layer translates this into packed GPU tensors.

Input Batch Management: Position in the Execution Pipeline

Sources: vllm/v1/worker/gpu_model_runner.py537-581 vllm/v1/worker/gpu_input_batch.py81-265

Two Implementations

There are two parallel implementations of input batch management, corresponding to two generations of the GPU model runner:

File	Class(es)	Used by
`vllm/v1/worker/gpu_input_batch.py`	`InputBatch`, `CachedRequestState`	`vllm/v1/worker/gpu_model_runner.py`
`vllm/v1/worker/gpu/input_batch.py`	`InputBuffers`, `InputBatch` (dataclass)	`vllm/v1/worker/gpu/model_runner.py`

The legacy implementation uses a stateful, mutable InputBatch that persists across steps and tracks all active requests. The newer implementation separates persistent buffers (InputBuffers) from a per-step snapshot dataclass (InputBatch).

CachedRequestState

CachedRequestState is a dataclass defined in vllm/v1/worker/gpu_input_batch.py29-79 that stores the complete mutable state for a single request on the model runner side.

CachedRequestState
├── req_id: str
├── prompt_token_ids: list[int] | None
├── mm_features: list[MultiModalFeatureSpec]
├── sampling_params: SamplingParams | None
├── generator: torch.Generator | None
├── block_ids: tuple[list[int], ...]      # per KV-cache group
├── num_computed_tokens: int
├── output_token_ids: list[int]
├── mrope_positions: torch.Tensor | None  # Qwen2-VL and similar
├── xdrope_positions: torch.Tensor | None
├── lora_request: LoRARequest | None
├── prompt_embeds: torch.Tensor | None
├── prev_num_draft_len: int               # spec decode
├── pooling_params: PoolingParams | None
└── pooling_states: PoolingStates | None

num_tokens (property) returns num_prompt_tokens + len(output_token_ids). get_token_id(idx) retrieves a token by absolute position from either prompt or output, raising an error if the token was provided via prompt_embeds.

GPUModelRunner keeps a dict of all known requests: self.requests: dict[str, CachedRequestState] (vllm/v1/worker/gpu_model_runner.py538). This is separate from InputBatch, which only contains the currently scheduled subset.

Sources: vllm/v1/worker/gpu_input_batch.py29-79 vllm/v1/worker/gpu_model_runner.py537-542

Legacy InputBatch (`gpu_input_batch.py`)

Purpose and Initialization

InputBatch is a persistent object initialized once in GPUModelRunner.__init__ (vllm/v1/worker/gpu_model_runner.py557-581). It tracks up to max_num_reqs scheduled requests and holds pre-allocated CPU and GPU buffers for every attribute needed by sampling and attention.

Constructor parameters:

Parameter	Description
`max_num_reqs`	Maximum concurrent scheduled requests
`max_model_len`	Maximum sequence length (sets token buffer width)
`max_num_batched_tokens`	Maximum tokens per batch
`device`	Target GPU device
`pin_memory`	Whether to pin CPU tensors for fast H2D copy
`vocab_size`	Vocabulary size (for allowed token masks)
`block_sizes`	Block size per KV cache group
`kernel_block_sizes`	Kernel-facing block size per group

Dual CPU/GPU Representation

Most sampling parameters exist in two forms: a CPU numpy array (for fast Python-side updates) and a GPU tensor (for the model forward pass). The GPU tensor is updated from the CPU version before execution.

CPU/GPU Buffer Pairs in InputBatch

For sampling parameters, sets like greedy_reqs, random_reqs, top_p_reqs, top_k_reqs, frequency_penalties_reqs, etc. track which requests have non-default values, enabling fast short-circuit paths.

Sources: vllm/v1/worker/gpu_input_batch.py153-270

Token Storage

Token IDs are stored in a 2D CPU tensor token_ids_cpu_tensor of shape (max_num_reqs, max_model_len) backed by a numpy array token_ids_cpu. Each row i contains all tokens for request at index i: first the prompt tokens, then the output tokens.

A parallel boolean array is_token_ids marks whether each position was a real token ID (True) or a prompt embedding position (False). When a request uses prompt_embeds, the corresponding rows in token_ids_cpu are not populated, and req_prompt_embeds (a dict from req_index to tensor) is used instead.

For speculative decoding, draft token IDs are written into the token_ids_cpu positions beyond the confirmed output tokens: spec_token_ids is a per-request list of pending draft tokens, and num_tokens_no_spec tracks the position boundary.

Sources: vllm/v1/worker/gpu_input_batch.py111-131 vllm/v1/worker/gpu_input_batch.py304-342

Request Lifecycle

Adding a request (add_request):

Calls _register_add_request which either fills a recently vacated index or appends at num_reqs.
Copies prompt_token_ids and output_token_ids into token_ids_cpu.
Copies sampling params into all CPU arrays (temperature_cpu, top_p_cpu, etc.).
Registers in req_id_to_index, greedy_reqs/random_reqs, penalty sets.
Calls block_table.add_row() with the request's KV cache block IDs.
For LoRA requests, updates request_lora_mapping and lora_id_to_request_ids.

Removing a request (remove_request):

Marks the index as empty by setting _req_ids[req_index] = None and recording it in batch_update_builder.removed. The actual slot is not zeroed immediately; it is reclaimed in the next condense() call.

Condensing (condense):

Slides non-empty requests from higher indices down into the empty slots left by removed requests. This maintains a compact, contiguous active region. Condensation calls swap_states() for each pair of slots that need to move, which swaps every array row, dict entry, and the block table row.

Request Lifecycle Diagram

Sources: vllm/v1/worker/gpu_input_batch.py278-650

Sampling Metadata

SamplingMetadata is assembled from InputBatch fields by _make_sampling_metadata(). It is not rebuilt every step — only when the batch composition changes (tracked via batch_update_builder.batch_changed). SamplingMetadata is then passed to the Sampler during the output phase.

Key fields propagated from InputBatch to SamplingMetadata:

`InputBatch` field	`SamplingMetadata` field
`temperature`, `greedy_reqs`, `random_reqs`	`temperature`, `all_greedy`, `all_random`
`top_p`	`top_p`
`top_k`	`top_k`
`frequency_penalties`	`frequency_penalties`
`presence_penalties`	`presence_penalties`
`repetition_penalties`	`repetition_penalties`
`generators`	`generators`
`num_logprobs`	`max_num_logprobs`
`allowed_token_ids_mask`	`allowed_token_ids_mask`
`bad_words_token_ids`	`bad_words_token_ids`
`spec_token_ids`	`spec_token_ids`

Sources: vllm/v1/worker/gpu_input_batch.py720-780 vllm/v1/sample/metadata.py

Block Table Integration

`BlockTable` and `MultiGroupBlockTable`

BlockTable (vllm/v1/worker/block_table.py16) tracks which physical KV cache blocks each request occupies. Each row i of block_table.block_table stores the block IDs for request i.

BlockTable
├── block_table: CpuGpuBuffer [max_num_reqs, max_num_blocks_per_req] int32
├── num_blocks_per_row: np.ndarray [max_num_reqs]
├── block_size: int       (kernel-facing block size)
├── blocks_per_kv_block: int  (ratio when hybrid block sizes used)
└── use_hybrid_blocks: bool

MultiGroupBlockTable (vllm/v1/worker/block_table.py) wraps a list of BlockTable instances, one per KV cache group. This supports models where different attention layers use different block types (e.g., full attention plus sliding window attention).

Hybrid Block Sizes: When the memory allocator uses a different block size than the attention kernel (e.g., 32-token memory blocks, 16-token kernel blocks), each memory block is mapped to blocks_per_kv_block kernel blocks. The BlockTable transparently handles this remapping.

Slot Mappings

From the block table, slot mappings are derived for each token: the flat position within the KV cache where that token's key/value vectors should be written. The method get_slot_mapping() computes a 1D tensor of shape [num_scheduled_tokens] giving the absolute slot index per token.

Sources: vllm/v1/worker/block_table.py16-200

Newer Model Runner: `InputBuffers` and `InputBatch` Dataclass

The refactored model runner (vllm/v1/worker/gpu/model_runner.py) separates persistent GPU buffers from the per-step batch descriptor.

`InputBuffers`

InputBuffers (vllm/v1/worker/gpu/input_batch.py12-33) holds pre-allocated GPU tensors that are reused every step:

Field	Shape	Dtype	Purpose
`input_ids`	`[max_num_tokens]`	int32	Token IDs for all positions
`positions`	`[max_num_tokens]`	int64	Absolute position indices
`query_start_loc`	`[max_num_reqs + 1]`	int32	Cumulative query lengths
`seq_lens`	`[max_num_reqs]`	int32	Full sequence lengths
`dcp_local_seq_lens`	`[max_num_reqs]`	int32	DCP-local sequence lengths

These are written in-place during prepare_inputs() and sliced to the actual batch size before passing to the model.

`InputBatch` Dataclass

InputBatch in vllm/v1/worker/gpu/input_batch.py is a lightweight, immutable dataclass assembled once per step by prepare_inputs():

Field	Description
`req_ids`	Ordered list of request IDs for this batch
`num_reqs`	Number of requests
`idx_mapping`	`[num_reqs]` GPU tensor mapping batch index → `RequestState` index
`idx_mapping_np`	CPU numpy version of `idx_mapping`
`expanded_idx_mapping`	For spec decode: repeated per draft token
`expanded_local_pos`	Position of each logit within its request
`num_scheduled_tokens`	Per-request token count (numpy)
`num_tokens`	Total scheduled tokens
`num_tokens_after_padding`	Total tokens after CUDA graph padding
`num_draft_tokens`	Total draft tokens for spec decode
`query_start_loc`	`[num_reqs + 1]` GPU tensor of cumulative token counts
`seq_lens`	`[num_reqs]` GPU tensor of full sequence lengths
`input_ids`	`[num_tokens_after_padding]` GPU tensor
`positions`	`[num_tokens_after_padding]` GPU tensor
`logits_indices`	`[total_num_logits]` — which positions to sample from
`cu_num_logits`	`[num_reqs + 1]` cumulative logit counts
`has_structured_output_reqs`	Whether any request uses structured output

Sources: vllm/v1/worker/gpu/input_batch.py12-143 vllm/v1/worker/gpu/model_runner.py185-189 vllm/v1/worker/gpu/model_runner.py563-695

`RequestState`

RequestState (vllm/v1/worker/gpu/states.py9) is the equivalent of the legacy CachedRequestState dictionary, but centralized into a single object with pre-allocated GPU/UVA tensors:

Tensor	Shape	Description
`all_token_ids`	`[max_num_reqs, max_model_len]`	All tokens (prompt + output) per request, stored in UVA memory
`total_len`	`[max_num_reqs]`	Total token count per request
`num_computed_tokens`	`[max_num_reqs]`	Number of tokens already KV-computed
`last_sampled_tokens`	`[max_num_reqs, 1]`	Most recent sampled token per request
`draft_tokens`	`[max_num_reqs, num_spec_steps]`	Pending draft tokens
`prefill_len`	`[max_num_reqs]`	Prefill length (≥ prompt_len)

The use of UVA (Unified Virtual Addressing) memory for all_token_ids avoids allocating large tensors on the GPU, since the full token history is only needed on the CPU side for prefill preparation.

Sources: vllm/v1/worker/gpu/states.py9-150

Data Flow: Scheduler Output to GPU Tensors

Step-by-step data flow (legacy model runner)

Sources: vllm/v1/worker/gpu_model_runner.py899-1050

Data flow (new model runner)

Sources: vllm/v1/worker/gpu/model_runner.py500-695

Input Preparation Kernels

Both model runner implementations use Triton kernels for performance-critical input preparation:

Kernel	File	Purpose
`_prepare_prefill_inputs_kernel`	`gpu/input_batch.py`	Writes prefill token IDs into `input_ids` by reading from `all_token_ids`
`_prepare_pos_seq_lens_kernel`	`gpu/input_batch.py`	Writes positions and seq_lens in one pass
`_combine_sampled_and_draft_tokens_kernel`	`gpu/input_batch.py`	Inserts last sampled token and draft tokens into `input_ids`, computes `logits_indices`

The _combine_sampled_and_draft_tokens_kernel is particularly important: for decode steps it writes the previous step's sampled token at the correct position, and for speculative decoding it writes the draft tokens. It also computes logits_indices — the flat positions in the output hidden states from which to extract logits for sampling.

Sources: vllm/v1/worker/gpu/input_batch.py146-355

Memory Layout Summary

Class Relationships and Key Buffer Dimensions

R = max_num_reqs, L = max_model_len, V = vocab_size, B = max_num_blocks_per_req

Sources: vllm/v1/worker/gpu_input_batch.py81-265 vllm/v1/worker/block_table.py16-100

Key Design Decisions

Persistent batch object: InputBatch is allocated once and mutated across steps rather than reallocated. This avoids repeated memory allocation overhead and makes incremental updates (add one request, remove another) cheap.

Pinned CPU memory: CPU tensors use pin_memory=True when available, enabling asynchronous non-blocking H2D copies via DMA engines rather than blocking CPU-GPU transfers.

Condense on remove: When requests finish, their slots become gaps. condense() compacts the batch by sliding remaining requests down. This keeps the active region contiguous, which is required for the packed tensor format consumed by attention kernels.

Lazy GPU upload: CPU arrays are updated immediately on add_request / remove_request, but GPU tensors are only written before the forward pass in _prepare_inputs(). This batches all H2D copies into a single phase.

batch_update_builder tracking: Changes (adds, removes, swaps) are recorded via BatchUpdateBuilder and used to generate differential updates to the LogitsProcessors state rather than rebuilding it from scratch.

Sources: vllm/v1/worker/gpu_input_batch.py232-265 vllm/v1/worker/gpu_model_runner.py544-581

Input Batch Management

Relevant source files

Architecture Overview

Input Batch Management: Position in the Execution Pipeline

Sources: vllm/v1/worker/gpu_model_runner.py537-581 vllm/v1/worker/gpu_input_batch.py81-265

Two Implementations

There are two parallel implementations of input batch management, corresponding to two generations of the GPU model runner:

File	Class(es)	Used by
`vllm/v1/worker/gpu_input_batch.py`	`InputBatch`, `CachedRequestState`	`vllm/v1/worker/gpu_model_runner.py`
`vllm/v1/worker/gpu/input_batch.py`	`InputBuffers`, `InputBatch` (dataclass)	`vllm/v1/worker/gpu/model_runner.py`

CachedRequestState

CachedRequestState is a dataclass defined in vllm/v1/worker/gpu_input_batch.py29-79 that stores the complete mutable state for a single request on the model runner side.

CachedRequestState
├── req_id: str
├── prompt_token_ids: list[int] | None
├── mm_features: list[MultiModalFeatureSpec]
├── sampling_params: SamplingParams | None
├── generator: torch.Generator | None
├── block_ids: tuple[list[int], ...]      # per KV-cache group
├── num_computed_tokens: int
├── output_token_ids: list[int]
├── mrope_positions: torch.Tensor | None  # Qwen2-VL and similar
├── xdrope_positions: torch.Tensor | None
├── lora_request: LoRARequest | None
├── prompt_embeds: torch.Tensor | None
├── prev_num_draft_len: int               # spec decode
├── pooling_params: PoolingParams | None
└── pooling_states: PoolingStates | None

Sources: vllm/v1/worker/gpu_input_batch.py29-79 vllm/v1/worker/gpu_model_runner.py537-542

Legacy InputBatch (`gpu_input_batch.py`)

Purpose and Initialization

Constructor parameters:

Parameter	Description
`max_num_reqs`	Maximum concurrent scheduled requests
`max_model_len`	Maximum sequence length (sets token buffer width)
`max_num_batched_tokens`	Maximum tokens per batch
`device`	Target GPU device
`pin_memory`	Whether to pin CPU tensors for fast H2D copy
`vocab_size`	Vocabulary size (for allowed token masks)
`block_sizes`	Block size per KV cache group
`kernel_block_sizes`	Kernel-facing block size per group

Dual CPU/GPU Representation

CPU/GPU Buffer Pairs in InputBatch

Sources: vllm/v1/worker/gpu_input_batch.py153-270

Token Storage

Sources: vllm/v1/worker/gpu_input_batch.py111-131 vllm/v1/worker/gpu_input_batch.py304-342

Request Lifecycle

Adding a request (add_request):

Calls _register_add_request which either fills a recently vacated index or appends at num_reqs.
Copies prompt_token_ids and output_token_ids into token_ids_cpu.
Copies sampling params into all CPU arrays (temperature_cpu, top_p_cpu, etc.).
Registers in req_id_to_index, greedy_reqs/random_reqs, penalty sets.
Calls block_table.add_row() with the request's KV cache block IDs.
For LoRA requests, updates request_lora_mapping and lora_id_to_request_ids.

Removing a request (remove_request):

Condensing (condense):

Request Lifecycle Diagram

Sources: vllm/v1/worker/gpu_input_batch.py278-650

Sampling Metadata

Key fields propagated from InputBatch to SamplingMetadata:

`InputBatch` field	`SamplingMetadata` field
`temperature`, `greedy_reqs`, `random_reqs`	`temperature`, `all_greedy`, `all_random`
`top_p`	`top_p`
`top_k`	`top_k`
`frequency_penalties`	`frequency_penalties`
`presence_penalties`	`presence_penalties`
`repetition_penalties`	`repetition_penalties`
`generators`	`generators`
`num_logprobs`	`max_num_logprobs`
`allowed_token_ids_mask`	`allowed_token_ids_mask`
`bad_words_token_ids`	`bad_words_token_ids`
`spec_token_ids`	`spec_token_ids`

Sources: vllm/v1/worker/gpu_input_batch.py720-780 vllm/v1/sample/metadata.py

Block Table Integration

`BlockTable` and `MultiGroupBlockTable`

BlockTable (vllm/v1/worker/block_table.py16) tracks which physical KV cache blocks each request occupies. Each row i of block_table.block_table stores the block IDs for request i.

BlockTable
├── block_table: CpuGpuBuffer [max_num_reqs, max_num_blocks_per_req] int32
├── num_blocks_per_row: np.ndarray [max_num_reqs]
├── block_size: int       (kernel-facing block size)
├── blocks_per_kv_block: int  (ratio when hybrid block sizes used)
└── use_hybrid_blocks: bool

Slot Mappings

Sources: vllm/v1/worker/block_table.py16-200

Newer Model Runner: `InputBuffers` and `InputBatch` Dataclass

The refactored model runner (vllm/v1/worker/gpu/model_runner.py) separates persistent GPU buffers from the per-step batch descriptor.

`InputBuffers`

InputBuffers (vllm/v1/worker/gpu/input_batch.py12-33) holds pre-allocated GPU tensors that are reused every step:

Field	Shape	Dtype	Purpose
`input_ids`	`[max_num_tokens]`	int32	Token IDs for all positions
`positions`	`[max_num_tokens]`	int64	Absolute position indices
`query_start_loc`	`[max_num_reqs + 1]`	int32	Cumulative query lengths
`seq_lens`	`[max_num_reqs]`	int32	Full sequence lengths
`dcp_local_seq_lens`	`[max_num_reqs]`	int32	DCP-local sequence lengths

These are written in-place during prepare_inputs() and sliced to the actual batch size before passing to the model.

`InputBatch` Dataclass

InputBatch in vllm/v1/worker/gpu/input_batch.py is a lightweight, immutable dataclass assembled once per step by prepare_inputs():

Field	Description
`req_ids`	Ordered list of request IDs for this batch
`num_reqs`	Number of requests
`idx_mapping`	`[num_reqs]` GPU tensor mapping batch index → `RequestState` index
`idx_mapping_np`	CPU numpy version of `idx_mapping`
`expanded_idx_mapping`	For spec decode: repeated per draft token
`expanded_local_pos`	Position of each logit within its request
`num_scheduled_tokens`	Per-request token count (numpy)
`num_tokens`	Total scheduled tokens
`num_tokens_after_padding`	Total tokens after CUDA graph padding
`num_draft_tokens`	Total draft tokens for spec decode
`query_start_loc`	`[num_reqs + 1]` GPU tensor of cumulative token counts
`seq_lens`	`[num_reqs]` GPU tensor of full sequence lengths
`input_ids`	`[num_tokens_after_padding]` GPU tensor
`positions`	`[num_tokens_after_padding]` GPU tensor
`logits_indices`	`[total_num_logits]` — which positions to sample from
`cu_num_logits`	`[num_reqs + 1]` cumulative logit counts
`has_structured_output_reqs`	Whether any request uses structured output

Sources: vllm/v1/worker/gpu/input_batch.py12-143 vllm/v1/worker/gpu/model_runner.py185-189 vllm/v1/worker/gpu/model_runner.py563-695

`RequestState`

RequestState (vllm/v1/worker/gpu/states.py9) is the equivalent of the legacy CachedRequestState dictionary, but centralized into a single object with pre-allocated GPU/UVA tensors:

Tensor	Shape	Description
`all_token_ids`	`[max_num_reqs, max_model_len]`	All tokens (prompt + output) per request, stored in UVA memory
`total_len`	`[max_num_reqs]`	Total token count per request
`num_computed_tokens`	`[max_num_reqs]`	Number of tokens already KV-computed
`last_sampled_tokens`	`[max_num_reqs, 1]`	Most recent sampled token per request
`draft_tokens`	`[max_num_reqs, num_spec_steps]`	Pending draft tokens
`prefill_len`	`[max_num_reqs]`	Prefill length (≥ prompt_len)

The use of UVA (Unified Virtual Addressing) memory for all_token_ids avoids allocating large tensors on the GPU, since the full token history is only needed on the CPU side for prefill preparation.

Sources: vllm/v1/worker/gpu/states.py9-150

Data Flow: Scheduler Output to GPU Tensors

Step-by-step data flow (legacy model runner)

Sources: vllm/v1/worker/gpu_model_runner.py899-1050

Data flow (new model runner)

Sources: vllm/v1/worker/gpu/model_runner.py500-695

Input Preparation Kernels

Both model runner implementations use Triton kernels for performance-critical input preparation:

Kernel	File	Purpose
`_prepare_prefill_inputs_kernel`	`gpu/input_batch.py`	Writes prefill token IDs into `input_ids` by reading from `all_token_ids`
`_prepare_pos_seq_lens_kernel`	`gpu/input_batch.py`	Writes positions and seq_lens in one pass
`_combine_sampled_and_draft_tokens_kernel`	`gpu/input_batch.py`	Inserts last sampled token and draft tokens into `input_ids`, computes `logits_indices`

Sources: vllm/v1/worker/gpu/input_batch.py146-355

Memory Layout Summary

Class Relationships and Key Buffer Dimensions

R = max_num_reqs, L = max_model_len, V = vocab_size, B = max_num_blocks_per_req

Sources: vllm/v1/worker/gpu_input_batch.py81-265 vllm/v1/worker/block_table.py16-100

Key Design Decisions

Pinned CPU memory: CPU tensors use pin_memory=True when available, enabling asynchronous non-blocking H2D copies via DMA engines rather than blocking CPU-GPU transfers.

Sources: vllm/v1/worker/gpu_input_batch.py232-265 vllm/v1/worker/gpu_model_runner.py544-581

Input Batch Management

Architecture Overview

Two Implementations

CachedRequestState

Legacy InputBatch (gpu_input_batch.py)

Purpose and Initialization

Dual CPU/GPU Representation

Token Storage

Request Lifecycle

Sampling Metadata

Block Table Integration

BlockTable and MultiGroupBlockTable

Slot Mappings

Newer Model Runner: InputBuffers and InputBatch Dataclass

InputBuffers

InputBatch Dataclass

RequestState

Data Flow: Scheduler Output to GPU Tensors

Input Preparation Kernels

Memory Layout Summary

Key Design Decisions

On this page

Input Batch Management

Architecture Overview

Two Implementations

CachedRequestState

Legacy InputBatch (gpu_input_batch.py)

Purpose and Initialization

Dual CPU/GPU Representation

Token Storage

Request Lifecycle

Sampling Metadata

Block Table Integration

BlockTable and MultiGroupBlockTable

Slot Mappings

Newer Model Runner: InputBuffers and InputBatch Dataclass

InputBuffers

InputBatch Dataclass

RequestState

Data Flow: Scheduler Output to GPU Tensors

Input Preparation Kernels

Memory Layout Summary

Key Design Decisions

On this page

Legacy InputBatch (`gpu_input_batch.py`)

`BlockTable` and `MultiGroupBlockTable`

Newer Model Runner: `InputBuffers` and `InputBatch` Dataclass

`InputBuffers`

`InputBatch` Dataclass

`RequestState`

Legacy InputBatch (`gpu_input_batch.py`)

`BlockTable` and `MultiGroupBlockTable`

Newer Model Runner: `InputBuffers` and `InputBatch` Dataclass

`InputBuffers`

`InputBatch` Dataclass

`RequestState`