Output Processing

Relevant source files

Purpose and Scope

This page documents how vLLM transforms raw model outputs — sampled token IDs and log probabilities emitted by EngineCore — into the RequestOutput and CompletionOutput objects returned to callers. The pipeline covered here runs in the client process (not the engine worker process), and handles incremental detokenization, stop-string detection, log probability accumulation, streaming control, and parallel-sampling aggregation.

For documentation on how tokens are sampled by the GPU model runner, see page 4.4. For how the scheduler produces EngineCoreOutput objects in the first place, see page 3.3.

High-Level Data Flow

Every iteration the engine produces an EngineCoreOutputs batch. The client receives it and feeds it through the OutputProcessor, which maintains per-request state and drives detokenization and logprob accumulation. The assembled RequestOutput objects are either returned synchronously (offline LLM) or pushed into per-request RequestOutputCollector queues (online AsyncLLM).

Figure: End-to-end output processing pipeline

Sources: vllm/v1/engine/output_processor.py413-445 vllm/v1/engine/llm_engine.py294-332 vllm/v1/engine/async_llm.py660-700

Key Data Structures

Engine-Side Output Objects

These types are defined in vllm/v1/engine/__init__.py and are serialized over ZMQ using msgspec.

Class	Description
`EngineCoreOutput`	One request's output for a single step: token IDs, logprobs, finish reason
`EngineCoreOutputs`	Batch of `EngineCoreOutput` objects plus scheduler stats and timestamp
`FinishReason`	`IntEnum`: `STOP=0`, `LENGTH=1`, `ABORT=2`, `ERROR=3`

EngineCoreOutput fields (vllm/v1/engine/__init__.py150-181):

Field	Type	Meaning
`request_id`	`str`	Internal request identifier
`new_token_ids`	`list[int]`	Tokens generated in this step
`new_logprobs`	`LogprobsLists \| None`	Per-token sample logprobs
`new_prompt_logprobs_tensors`	`LogprobsTensors \| None`	Prompt-token logprobs
`pooling_output`	`torch.Tensor \| None`	Embedding for pooling models
`finish_reason`	`FinishReason \| None`	Non-None when request is complete
`stop_reason`	`int \| str \| None`	Stop token ID or string
`num_cached_tokens`	`int`	Prefix cache hit count
`routed_experts`	`np.ndarray \| None`	MoE routing info

Client-Side Output Objects

Defined in vllm/outputs.py.

RequestOutput — top-level object returned to callers:

request_id, prompt, prompt_token_ids, prompt_logprobs
outputs: list[CompletionOutput] — one per completion sequence
finished: bool
metrics: RequestStateStats | None
num_cached_tokens: int | None

CompletionOutput — a single generated sequence:

index: int — position within n parallel completions
text: str — detokenized output
token_ids: Sequence[int]
cumulative_logprob: float | None
logprobs: SampleLogprobs | None
finish_reason: str | None
stop_reason: int | str | None

PoolingRequestOutput — used for embedding/pooling tasks, wraps a PoolingOutput containing the raw tensor.

Sources: vllm/outputs.py22-200 vllm/v1/engine/__init__.py150-228

OutputProcessor

OutputProcessor (vllm/v1/engine/output_processor.py413) is instantiated once per engine client (both LLMEngine and AsyncLLM) and owns all live request state.

OutputProcessor
├── request_states: dict[str, RequestState]
├── parent_requests: dict[str, ParentRequest]    # for n > 1
├── external_req_ids: defaultdict[str, list[str]]
├── lora_states: LoRARequestStates
├── tokenizer: TokenizerLike | None
└── stream_interval: int

Key methods:

Method	Purpose
`add_request(request, prompt, parent_req, index, queue)`	Registers a new request, creates its `RequestState`
`process_outputs(outputs, timestamp, iteration_stats)`	Main per-step processing loop
`abort_requests(request_ids, internal)`	Removes request states, optionally notifies queues
`get_num_unfinished_requests()`	Used by engine loop termination check
`propagate_error(e)`	Pushes an exception to all active queues

process_outputs() returns an OutputProcessorOutput dataclass with:

request_outputs: list[RequestOutput | PoolingRequestOutput]
reqs_to_abort: list[str] — requests that hit a client-side stop string and must be cancelled at EngineCore

Sources: vllm/v1/engine/output_processor.py109-113 vllm/v1/engine/output_processor.py413-460

RequestState

RequestState (vllm/v1/engine/output_processor.py129-410) holds the complete mutable state for one in-flight request.

Figure: RequestState class composition

RequestState.from_new_request() is the factory that wires up the detokenizer and logprobs processor from the incoming EngineCoreRequest.

Sources: vllm/v1/engine/output_processor.py129-268

Incremental Detokenization

Detokenization is performed incrementally token-by-token to support streaming. The IncrementalDetokenizer base class and its two implementations are in vllm/v1/engine/detokenizer.py.

Figure: Detokenizer selection

Sources: vllm/v1/engine/detokenizer.py30-65

FastIncrementalDetokenizer

Uses the tokenizers library's DecodeStream (vllm/v1/engine/detokenizer.py169-220):

Primes the stream with prompt_token_ids at construction (native prefill)
Each call to decode_next(token_id) steps the stream
Handles spaces_between_special_tokens by tracking added-token IDs

SlowIncrementalDetokenizer

Falls back to detokenize_incrementally() from vllm/tokenizers/detokenizer_utils.py. Maintains a prefix-aware decode buffer.

Stop String Detection

The update() method on BaseIncrementalDetokenizer (vllm/v1/engine/detokenizer.py97-144) performs stop-string matching after detokenizing new tokens:

Detokenizes each token, appending to output_text
Calls check_stop_strings() on the new suffix of output_text
If a match is found, truncates output_text (unless include_stop_str_in_output=True)
Returns the matched stop string (non-None) to the caller

When a stop string is detected client-side, the request ID is added to reqs_to_abort in OutputProcessorOutput, and the LLMEngine or AsyncLLM sends an abort signal back to EngineCore.

A stop_buffer_length is maintained to hold back the last max(len(s) for s in stop) - 1 characters during streaming, so that partial stop strings are not prematurely emitted.

Sources: vllm/v1/engine/detokenizer.py68-166

Log Probability Processing

LogprobsProcessor (vllm/v1/engine/logprobs.py29) accumulates per-position logprob data across steps.

It is created via from_new_request() which reads SamplingParams.logprobs and SamplingParams.prompt_logprobs to determine what to collect.

Per-step update (update(engine_core_output)):

If new_logprobs is present: calls _update_sample_logprobs(), which iterates over each position in LogprobsLists, detokenizes top-k tokens, and calls append_logprobs_for_next_position() to build the SampleLogprobs list
If new_prompt_logprobs_tensors is present: calls _update_prompt_logprobs() during the prefill phase

The cumulative_logprob is updated by summing the sampled token logprob at each position.

In DELTA mode, pop_prompt_logprobs() is called once to return and discard prompt logprobs after they are first emitted, preventing re-transmission on subsequent steps.

Sources: vllm/v1/engine/logprobs.py29 vllm/outputs.py22-50

Output Assembly

RequestState.make_request_output() (vllm/v1/engine/output_processor.py269-331) assembles the final output object. Its logic:

Early returns: If finish_reason is None and output_kind == FINAL_ONLY, returns None (suppress intermediate outputs).
Stream interval throttling: If stream_interval > 1, only emits an output when finished, on the first token, or when detokenizer.num_output_tokens() - sent_tokens_offset >= stream_interval.
Pooling path: If pooling_output is not None, returns a PoolingRequestOutput wrapping a PoolingOutput.
Text path:
- Calls _new_completion_output() to build a CompletionOutput
- If parent_req is not None (parallel sampling), delegates to parent_req.get_outputs() to aggregate child results
- Calls _new_request_output() to wrap everything in a RequestOutput

_new_completion_output() (vllm/v1/engine/output_processor.py376-407):

Gets text from detokenizer.get_next_output_text(finished, delta)
In non-delta mode, uses detokenizer.output_token_ids for the full token list
In delta mode, slices logprobs to match the new tokens only

Sources: vllm/v1/engine/output_processor.py269-410

Streaming Control: RequestOutputCollector

RequestOutputCollector (vllm/v1/engine/output_processor.py45-106) is an asyncio-compatible single-slot buffer used in the AsyncLLM path to hand off outputs from the output_handler task to the generate() coroutine.

Figure: RequestOutputCollector interaction

Key behaviors:

Non-blocking fast path: get_nowait() returns immediately if data is ready, avoiding asyncio task switching under load.
Delta merging: When output_kind == DELTA and the producer is ahead of the consumer, successive put() calls merge outputs via RequestOutput.add() — concatenating text and extending token_ids and logprobs.
Error propagation: put(exception) stores the exception; the next get() or get_nowait() raises it.

The output_kind is set at construction time from SamplingParams.output_kind:

DELTA → aggregate=True, merge successive outputs
CUMULATIVE / FINAL_ONLY → aggregate=False, replace

Sources: vllm/v1/engine/output_processor.py45-106 vllm/outputs.py145-173

Async Output Handler

In the AsyncLLM path, a background asyncio task (output_handler) continuously pulls from EngineCore and pushes to per-request queues.

Figure: AsyncLLM output handler loop

The chunk size is controlled by VLLM_V1_OUTPUT_PROC_CHUNK_SIZE to avoid blocking the asyncio event loop for too long when large batches arrive.

Sources: vllm/v1/engine/async_llm.py641-760

Synchronous Path: LLMEngine.step()

For offline inference with LLM, LLMEngine.step() (vllm/v1/engine/llm_engine.py294-332) executes synchronously:

engine_core.get_output() — pulls one EngineCoreOutputs (blocks until available)
output_processor.process_outputs(outputs.outputs, ...) — processes all outputs
engine_core.abort_requests(processed_outputs.reqs_to_abort) — sends stop-string aborts back
logger_manager.record(...) if stats logging enabled
Returns list[RequestOutput | PoolingRequestOutput]

The LLM.generate() method collects these results by calling _run_engine() in a loop until output_processor.has_unfinished_requests() returns False.

Sources: vllm/v1/engine/llm_engine.py294-332 vllm/entrypoints/llm.py424-484

Parallel Sampling (n > 1)

When SamplingParams.n > 1, both LLMEngine and AsyncLLM fan out the request into n child requests before sending to EngineCore.

Figure: Parallel sampling fan-out and aggregation

Each child has its own RequestState, its own IncrementalDetokenizer, and its own LogprobsProcessor. ParentRequest.get_outputs() collects CompletionOutput from each child as they complete. A combined RequestOutput is only emitted when at least one child has new output to report. The finished flag on the combined output is True only when all n children are done.

Sources: vllm/v1/engine/async_llm.py387-404 vllm/v1/engine/llm_engine.py277-292 vllm/v1/engine/parallel_sampling.py

RequestOutputKind Behavior

RequestOutputKind is part of SamplingParams and controls what RequestState.make_request_output() emits:

`RequestOutputKind`	Behavior
`CUMULATIVE`	Each output contains all tokens generated so far. `token_ids` = full list, `text` = full string.
`DELTA`	Each output contains only tokens generated since the previous output. `RequestOutputCollector` merges if consumer lags.
`FINAL_ONLY`	No intermediate outputs. A single output is emitted when the request finishes.

Sources: vllm/v1/engine/output_processor.py278-331 vllm/v1/engine/output_processor.py376-407

FinishReason and Stop Conditions

FinishReason (vllm/v1/engine/__init__.py42-62) is an IntEnum used compactly in serialization:

Value	String	Cause
`STOP = 0`	`"stop"`	EOS token or matched stop string
`LENGTH = 1`	`"length"`	`max_tokens` exhausted or `max_model_len` reached
`ABORT = 2`	`"abort"`	Client disconnected or explicit abort
`ERROR = 3`	`"error"`	Retryable internal error (e.g., KV load failure)

Stop strings can be detected in two places:

EngineCore side: EOS token matching, handled by the scheduler. Results in finish_reason=STOP on the EngineCoreOutput.
OutputProcessor side: Text-level stop-string matching via IncrementalDetokenizer.update(). When detected, the OutputProcessor adds the request to reqs_to_abort and sends an abort to EngineCore to halt further generation.

The stop_reason field carries either the matched stop string (str) or the stop token ID (int).

Sources: vllm/v1/engine/__init__.py42-62 vllm/v1/engine/detokenizer.py97-144 vllm/v1/engine/llm_engine.py314-317

Output Processing

Relevant source files

Purpose and Scope

For documentation on how tokens are sampled by the GPU model runner, see page 4.4. For how the scheduler produces EngineCoreOutput objects in the first place, see page 3.3.

High-Level Data Flow

Figure: End-to-end output processing pipeline

Sources: vllm/v1/engine/output_processor.py413-445 vllm/v1/engine/llm_engine.py294-332 vllm/v1/engine/async_llm.py660-700

Key Data Structures

Engine-Side Output Objects

These types are defined in vllm/v1/engine/__init__.py and are serialized over ZMQ using msgspec.

Class	Description
`EngineCoreOutput`	One request's output for a single step: token IDs, logprobs, finish reason
`EngineCoreOutputs`	Batch of `EngineCoreOutput` objects plus scheduler stats and timestamp
`FinishReason`	`IntEnum`: `STOP=0`, `LENGTH=1`, `ABORT=2`, `ERROR=3`

EngineCoreOutput fields (vllm/v1/engine/__init__.py150-181):

Field	Type	Meaning
`request_id`	`str`	Internal request identifier
`new_token_ids`	`list[int]`	Tokens generated in this step
`new_logprobs`	`LogprobsLists \| None`	Per-token sample logprobs
`new_prompt_logprobs_tensors`	`LogprobsTensors \| None`	Prompt-token logprobs
`pooling_output`	`torch.Tensor \| None`	Embedding for pooling models
`finish_reason`	`FinishReason \| None`	Non-None when request is complete
`stop_reason`	`int \| str \| None`	Stop token ID or string
`num_cached_tokens`	`int`	Prefix cache hit count
`routed_experts`	`np.ndarray \| None`	MoE routing info

Client-Side Output Objects

Defined in vllm/outputs.py.

RequestOutput — top-level object returned to callers:

request_id, prompt, prompt_token_ids, prompt_logprobs
outputs: list[CompletionOutput] — one per completion sequence
finished: bool
metrics: RequestStateStats | None
num_cached_tokens: int | None

CompletionOutput — a single generated sequence:

index: int — position within n parallel completions
text: str — detokenized output
token_ids: Sequence[int]
cumulative_logprob: float | None
logprobs: SampleLogprobs | None
finish_reason: str | None
stop_reason: int | str | None

PoolingRequestOutput — used for embedding/pooling tasks, wraps a PoolingOutput containing the raw tensor.

Sources: vllm/outputs.py22-200 vllm/v1/engine/__init__.py150-228

OutputProcessor

OutputProcessor (vllm/v1/engine/output_processor.py413) is instantiated once per engine client (both LLMEngine and AsyncLLM) and owns all live request state.

OutputProcessor
├── request_states: dict[str, RequestState]
├── parent_requests: dict[str, ParentRequest]    # for n > 1
├── external_req_ids: defaultdict[str, list[str]]
├── lora_states: LoRARequestStates
├── tokenizer: TokenizerLike | None
└── stream_interval: int

Key methods:

Method	Purpose
`add_request(request, prompt, parent_req, index, queue)`	Registers a new request, creates its `RequestState`
`process_outputs(outputs, timestamp, iteration_stats)`	Main per-step processing loop
`abort_requests(request_ids, internal)`	Removes request states, optionally notifies queues
`get_num_unfinished_requests()`	Used by engine loop termination check
`propagate_error(e)`	Pushes an exception to all active queues

process_outputs() returns an OutputProcessorOutput dataclass with:

request_outputs: list[RequestOutput | PoolingRequestOutput]
reqs_to_abort: list[str] — requests that hit a client-side stop string and must be cancelled at EngineCore

Sources: vllm/v1/engine/output_processor.py109-113 vllm/v1/engine/output_processor.py413-460

RequestState

RequestState (vllm/v1/engine/output_processor.py129-410) holds the complete mutable state for one in-flight request.

Figure: RequestState class composition

RequestState.from_new_request() is the factory that wires up the detokenizer and logprobs processor from the incoming EngineCoreRequest.

Sources: vllm/v1/engine/output_processor.py129-268

Incremental Detokenization

Detokenization is performed incrementally token-by-token to support streaming. The IncrementalDetokenizer base class and its two implementations are in vllm/v1/engine/detokenizer.py.

Figure: Detokenizer selection

Sources: vllm/v1/engine/detokenizer.py30-65

FastIncrementalDetokenizer

Uses the tokenizers library's DecodeStream (vllm/v1/engine/detokenizer.py169-220):

Primes the stream with prompt_token_ids at construction (native prefill)
Each call to decode_next(token_id) steps the stream
Handles spaces_between_special_tokens by tracking added-token IDs

SlowIncrementalDetokenizer

Falls back to detokenize_incrementally() from vllm/tokenizers/detokenizer_utils.py. Maintains a prefix-aware decode buffer.

Stop String Detection

The update() method on BaseIncrementalDetokenizer (vllm/v1/engine/detokenizer.py97-144) performs stop-string matching after detokenizing new tokens:

Detokenizes each token, appending to output_text
Calls check_stop_strings() on the new suffix of output_text
If a match is found, truncates output_text (unless include_stop_str_in_output=True)
Returns the matched stop string (non-None) to the caller

When a stop string is detected client-side, the request ID is added to reqs_to_abort in OutputProcessorOutput, and the LLMEngine or AsyncLLM sends an abort signal back to EngineCore.

A stop_buffer_length is maintained to hold back the last max(len(s) for s in stop) - 1 characters during streaming, so that partial stop strings are not prematurely emitted.

Sources: vllm/v1/engine/detokenizer.py68-166

Log Probability Processing

LogprobsProcessor (vllm/v1/engine/logprobs.py29) accumulates per-position logprob data across steps.

It is created via from_new_request() which reads SamplingParams.logprobs and SamplingParams.prompt_logprobs to determine what to collect.

Per-step update (update(engine_core_output)):

If new_logprobs is present: calls _update_sample_logprobs(), which iterates over each position in LogprobsLists, detokenizes top-k tokens, and calls append_logprobs_for_next_position() to build the SampleLogprobs list
If new_prompt_logprobs_tensors is present: calls _update_prompt_logprobs() during the prefill phase

The cumulative_logprob is updated by summing the sampled token logprob at each position.

In DELTA mode, pop_prompt_logprobs() is called once to return and discard prompt logprobs after they are first emitted, preventing re-transmission on subsequent steps.

Sources: vllm/v1/engine/logprobs.py29 vllm/outputs.py22-50

Output Assembly

RequestState.make_request_output() (vllm/v1/engine/output_processor.py269-331) assembles the final output object. Its logic:

Early returns: If finish_reason is None and output_kind == FINAL_ONLY, returns None (suppress intermediate outputs).
Stream interval throttling: If stream_interval > 1, only emits an output when finished, on the first token, or when detokenizer.num_output_tokens() - sent_tokens_offset >= stream_interval.
Pooling path: If pooling_output is not None, returns a PoolingRequestOutput wrapping a PoolingOutput.
Text path:
- Calls _new_completion_output() to build a CompletionOutput
- If parent_req is not None (parallel sampling), delegates to parent_req.get_outputs() to aggregate child results
- Calls _new_request_output() to wrap everything in a RequestOutput

_new_completion_output() (vllm/v1/engine/output_processor.py376-407):

Gets text from detokenizer.get_next_output_text(finished, delta)
In non-delta mode, uses detokenizer.output_token_ids for the full token list
In delta mode, slices logprobs to match the new tokens only

Sources: vllm/v1/engine/output_processor.py269-410

Streaming Control: RequestOutputCollector

Figure: RequestOutputCollector interaction

Key behaviors:

Non-blocking fast path: get_nowait() returns immediately if data is ready, avoiding asyncio task switching under load.
Delta merging: When output_kind == DELTA and the producer is ahead of the consumer, successive put() calls merge outputs via RequestOutput.add() — concatenating text and extending token_ids and logprobs.
Error propagation: put(exception) stores the exception; the next get() or get_nowait() raises it.

The output_kind is set at construction time from SamplingParams.output_kind:

DELTA → aggregate=True, merge successive outputs
CUMULATIVE / FINAL_ONLY → aggregate=False, replace

Sources: vllm/v1/engine/output_processor.py45-106 vllm/outputs.py145-173

Async Output Handler

In the AsyncLLM path, a background asyncio task (output_handler) continuously pulls from EngineCore and pushes to per-request queues.

Figure: AsyncLLM output handler loop

The chunk size is controlled by VLLM_V1_OUTPUT_PROC_CHUNK_SIZE to avoid blocking the asyncio event loop for too long when large batches arrive.

Sources: vllm/v1/engine/async_llm.py641-760

Synchronous Path: LLMEngine.step()

For offline inference with LLM, LLMEngine.step() (vllm/v1/engine/llm_engine.py294-332) executes synchronously:

engine_core.get_output() — pulls one EngineCoreOutputs (blocks until available)
output_processor.process_outputs(outputs.outputs, ...) — processes all outputs
engine_core.abort_requests(processed_outputs.reqs_to_abort) — sends stop-string aborts back
logger_manager.record(...) if stats logging enabled
Returns list[RequestOutput | PoolingRequestOutput]

The LLM.generate() method collects these results by calling _run_engine() in a loop until output_processor.has_unfinished_requests() returns False.

Sources: vllm/v1/engine/llm_engine.py294-332 vllm/entrypoints/llm.py424-484

Parallel Sampling (n > 1)

When SamplingParams.n > 1, both LLMEngine and AsyncLLM fan out the request into n child requests before sending to EngineCore.

Figure: Parallel sampling fan-out and aggregation

Sources: vllm/v1/engine/async_llm.py387-404 vllm/v1/engine/llm_engine.py277-292 vllm/v1/engine/parallel_sampling.py

RequestOutputKind Behavior

RequestOutputKind is part of SamplingParams and controls what RequestState.make_request_output() emits:

`RequestOutputKind`	Behavior
`CUMULATIVE`	Each output contains all tokens generated so far. `token_ids` = full list, `text` = full string.
`DELTA`	Each output contains only tokens generated since the previous output. `RequestOutputCollector` merges if consumer lags.
`FINAL_ONLY`	No intermediate outputs. A single output is emitted when the request finishes.

Sources: vllm/v1/engine/output_processor.py278-331 vllm/v1/engine/output_processor.py376-407

FinishReason and Stop Conditions

FinishReason (vllm/v1/engine/__init__.py42-62) is an IntEnum used compactly in serialization:

Value	String	Cause
`STOP = 0`	`"stop"`	EOS token or matched stop string
`LENGTH = 1`	`"length"`	`max_tokens` exhausted or `max_model_len` reached
`ABORT = 2`	`"abort"`	Client disconnected or explicit abort
`ERROR = 3`	`"error"`	Retryable internal error (e.g., KV load failure)

Stop strings can be detected in two places:

EngineCore side: EOS token matching, handled by the scheduler. Results in finish_reason=STOP on the EngineCoreOutput.
OutputProcessor side: Text-level stop-string matching via IncrementalDetokenizer.update(). When detected, the OutputProcessor adds the request to reqs_to_abort and sends an abort to EngineCore to halt further generation.

The stop_reason field carries either the matched stop string (str) or the stop token ID (int).

Sources: vllm/v1/engine/__init__.py42-62 vllm/v1/engine/detokenizer.py97-144 vllm/v1/engine/llm_engine.py314-317

Output Processing

Purpose and Scope

High-Level Data Flow

Key Data Structures

Engine-Side Output Objects

Client-Side Output Objects

OutputProcessor

RequestState

Incremental Detokenization

FastIncrementalDetokenizer

SlowIncrementalDetokenizer

Stop String Detection

Log Probability Processing

Output Assembly

Streaming Control: RequestOutputCollector

Async Output Handler

Synchronous Path: LLMEngine.step()

Parallel Sampling (n > 1)

RequestOutputKind Behavior

FinishReason and Stop Conditions

On this page

Output Processing

Purpose and Scope

High-Level Data Flow

Key Data Structures

Engine-Side Output Objects

Client-Side Output Objects

OutputProcessor

RequestState

Incremental Detokenization

FastIncrementalDetokenizer

SlowIncrementalDetokenizer

Stop String Detection

Log Probability Processing

Output Assembly

Streaming Control: RequestOutputCollector

Async Output Handler

Synchronous Path: LLMEngine.step()

Parallel Sampling (n > 1)

RequestOutputKind Behavior

FinishReason and Stop Conditions

On this page