This page documents how vLLM transforms raw model outputs — sampled token IDs and log probabilities emitted by EngineCore — into the RequestOutput and CompletionOutput objects returned to callers. The pipeline covered here runs in the client process (not the engine worker process), and handles incremental detokenization, stop-string detection, log probability accumulation, streaming control, and parallel-sampling aggregation.
For documentation on how tokens are sampled by the GPU model runner, see page 4.4. For how the scheduler produces EngineCoreOutput objects in the first place, see page 3.3.
Every iteration the engine produces an EngineCoreOutputs batch. The client receives it and feeds it through the OutputProcessor, which maintains per-request state and drives detokenization and logprob accumulation. The assembled RequestOutput objects are either returned synchronously (offline LLM) or pushed into per-request RequestOutputCollector queues (online AsyncLLM).
Figure: End-to-end output processing pipeline
Sources: vllm/v1/engine/output_processor.py413-445 vllm/v1/engine/llm_engine.py294-332 vllm/v1/engine/async_llm.py660-700
These types are defined in vllm/v1/engine/__init__.py and are serialized over ZMQ using msgspec.
| Class | Description |
|---|---|
EngineCoreOutput | One request's output for a single step: token IDs, logprobs, finish reason |
EngineCoreOutputs | Batch of EngineCoreOutput objects plus scheduler stats and timestamp |
FinishReason | IntEnum: STOP=0, LENGTH=1, ABORT=2, ERROR=3 |
EngineCoreOutput fields (vllm/v1/engine/__init__.py150-181):
| Field | Type | Meaning |
|---|---|---|
request_id | str | Internal request identifier |
new_token_ids | list[int] | Tokens generated in this step |
new_logprobs | LogprobsLists | None | Per-token sample logprobs |
new_prompt_logprobs_tensors | LogprobsTensors | None | Prompt-token logprobs |
pooling_output | torch.Tensor | None | Embedding for pooling models |
finish_reason | FinishReason | None | Non-None when request is complete |
stop_reason | int | str | None | Stop token ID or string |
num_cached_tokens | int | Prefix cache hit count |
routed_experts | np.ndarray | None | MoE routing info |
Defined in vllm/outputs.py.
RequestOutput — top-level object returned to callers:
request_id, prompt, prompt_token_ids, prompt_logprobsoutputs: list[CompletionOutput] — one per completion sequencefinished: boolmetrics: RequestStateStats | Nonenum_cached_tokens: int | NoneCompletionOutput — a single generated sequence:
index: int — position within n parallel completionstext: str — detokenized outputtoken_ids: Sequence[int]cumulative_logprob: float | Nonelogprobs: SampleLogprobs | Nonefinish_reason: str | Nonestop_reason: int | str | NonePoolingRequestOutput — used for embedding/pooling tasks, wraps a PoolingOutput containing the raw tensor.
Sources: vllm/outputs.py22-200 vllm/v1/engine/__init__.py150-228
OutputProcessor (vllm/v1/engine/output_processor.py413) is instantiated once per engine client (both LLMEngine and AsyncLLM) and owns all live request state.
OutputProcessor
├── request_states: dict[str, RequestState]
├── parent_requests: dict[str, ParentRequest] # for n > 1
├── external_req_ids: defaultdict[str, list[str]]
├── lora_states: LoRARequestStates
├── tokenizer: TokenizerLike | None
└── stream_interval: int
Key methods:
| Method | Purpose |
|---|---|
add_request(request, prompt, parent_req, index, queue) | Registers a new request, creates its RequestState |
process_outputs(outputs, timestamp, iteration_stats) | Main per-step processing loop |
abort_requests(request_ids, internal) | Removes request states, optionally notifies queues |
get_num_unfinished_requests() | Used by engine loop termination check |
propagate_error(e) | Pushes an exception to all active queues |
process_outputs() returns an OutputProcessorOutput dataclass with:
request_outputs: list[RequestOutput | PoolingRequestOutput]reqs_to_abort: list[str] — requests that hit a client-side stop string and must be cancelled at EngineCoreSources: vllm/v1/engine/output_processor.py109-113 vllm/v1/engine/output_processor.py413-460
RequestState (vllm/v1/engine/output_processor.py129-410) holds the complete mutable state for one in-flight request.
Figure: RequestState class composition
RequestState.from_new_request() is the factory that wires up the detokenizer and logprobs processor from the incoming EngineCoreRequest.
Sources: vllm/v1/engine/output_processor.py129-268
Detokenization is performed incrementally token-by-token to support streaming. The IncrementalDetokenizer base class and its two implementations are in vllm/v1/engine/detokenizer.py.
Figure: Detokenizer selection
Sources: vllm/v1/engine/detokenizer.py30-65
Uses the tokenizers library's DecodeStream (vllm/v1/engine/detokenizer.py169-220):
prompt_token_ids at construction (native prefill)decode_next(token_id) steps the streamspaces_between_special_tokens by tracking added-token IDsFalls back to detokenize_incrementally() from vllm/tokenizers/detokenizer_utils.py. Maintains a prefix-aware decode buffer.
The update() method on BaseIncrementalDetokenizer (vllm/v1/engine/detokenizer.py97-144) performs stop-string matching after detokenizing new tokens:
output_textcheck_stop_strings() on the new suffix of output_textoutput_text (unless include_stop_str_in_output=True)None) to the callerWhen a stop string is detected client-side, the request ID is added to reqs_to_abort in OutputProcessorOutput, and the LLMEngine or AsyncLLM sends an abort signal back to EngineCore.
A stop_buffer_length is maintained to hold back the last max(len(s) for s in stop) - 1 characters during streaming, so that partial stop strings are not prematurely emitted.
Sources: vllm/v1/engine/detokenizer.py68-166
LogprobsProcessor (vllm/v1/engine/logprobs.py29) accumulates per-position logprob data across steps.
It is created via from_new_request() which reads SamplingParams.logprobs and SamplingParams.prompt_logprobs to determine what to collect.
Per-step update (update(engine_core_output)):
new_logprobs is present: calls _update_sample_logprobs(), which iterates over each position in LogprobsLists, detokenizes top-k tokens, and calls append_logprobs_for_next_position() to build the SampleLogprobs listnew_prompt_logprobs_tensors is present: calls _update_prompt_logprobs() during the prefill phaseThe cumulative_logprob is updated by summing the sampled token logprob at each position.
In DELTA mode, pop_prompt_logprobs() is called once to return and discard prompt logprobs after they are first emitted, preventing re-transmission on subsequent steps.
Sources: vllm/v1/engine/logprobs.py29 vllm/outputs.py22-50
RequestState.make_request_output() (vllm/v1/engine/output_processor.py269-331) assembles the final output object. Its logic:
finish_reason is None and output_kind == FINAL_ONLY, returns None (suppress intermediate outputs).stream_interval > 1, only emits an output when finished, on the first token, or when detokenizer.num_output_tokens() - sent_tokens_offset >= stream_interval.pooling_output is not None, returns a PoolingRequestOutput wrapping a PoolingOutput._new_completion_output() to build a CompletionOutputparent_req is not None (parallel sampling), delegates to parent_req.get_outputs() to aggregate child results_new_request_output() to wrap everything in a RequestOutput_new_completion_output() (vllm/v1/engine/output_processor.py376-407):
text from detokenizer.get_next_output_text(finished, delta)detokenizer.output_token_ids for the full token listSources: vllm/v1/engine/output_processor.py269-410
RequestOutputCollector (vllm/v1/engine/output_processor.py45-106) is an asyncio-compatible single-slot buffer used in the AsyncLLM path to hand off outputs from the output_handler task to the generate() coroutine.
Figure: RequestOutputCollector interaction
Key behaviors:
get_nowait() returns immediately if data is ready, avoiding asyncio task switching under load.output_kind == DELTA and the producer is ahead of the consumer, successive put() calls merge outputs via RequestOutput.add() — concatenating text and extending token_ids and logprobs.put(exception) stores the exception; the next get() or get_nowait() raises it.The output_kind is set at construction time from SamplingParams.output_kind:
DELTA → aggregate=True, merge successive outputsCUMULATIVE / FINAL_ONLY → aggregate=False, replaceSources: vllm/v1/engine/output_processor.py45-106 vllm/outputs.py145-173
In the AsyncLLM path, a background asyncio task (output_handler) continuously pulls from EngineCore and pushes to per-request queues.
Figure: AsyncLLM output handler loop
The chunk size is controlled by VLLM_V1_OUTPUT_PROC_CHUNK_SIZE to avoid blocking the asyncio event loop for too long when large batches arrive.
Sources: vllm/v1/engine/async_llm.py641-760
For offline inference with LLM, LLMEngine.step() (vllm/v1/engine/llm_engine.py294-332) executes synchronously:
engine_core.get_output() — pulls one EngineCoreOutputs (blocks until available)output_processor.process_outputs(outputs.outputs, ...) — processes all outputsengine_core.abort_requests(processed_outputs.reqs_to_abort) — sends stop-string aborts backlogger_manager.record(...) if stats logging enabledlist[RequestOutput | PoolingRequestOutput]The LLM.generate() method collects these results by calling _run_engine() in a loop until output_processor.has_unfinished_requests() returns False.
Sources: vllm/v1/engine/llm_engine.py294-332 vllm/entrypoints/llm.py424-484
When SamplingParams.n > 1, both LLMEngine and AsyncLLM fan out the request into n child requests before sending to EngineCore.
Figure: Parallel sampling fan-out and aggregation
Each child has its own RequestState, its own IncrementalDetokenizer, and its own LogprobsProcessor. ParentRequest.get_outputs() collects CompletionOutput from each child as they complete. A combined RequestOutput is only emitted when at least one child has new output to report. The finished flag on the combined output is True only when all n children are done.
Sources: vllm/v1/engine/async_llm.py387-404 vllm/v1/engine/llm_engine.py277-292 vllm/v1/engine/parallel_sampling.py
RequestOutputKind is part of SamplingParams and controls what RequestState.make_request_output() emits:
RequestOutputKind | Behavior |
|---|---|
CUMULATIVE | Each output contains all tokens generated so far. token_ids = full list, text = full string. |
DELTA | Each output contains only tokens generated since the previous output. RequestOutputCollector merges if consumer lags. |
FINAL_ONLY | No intermediate outputs. A single output is emitted when the request finishes. |
Sources: vllm/v1/engine/output_processor.py278-331 vllm/v1/engine/output_processor.py376-407
FinishReason (vllm/v1/engine/__init__.py42-62) is an IntEnum used compactly in serialization:
| Value | String | Cause |
|---|---|---|
STOP = 0 | "stop" | EOS token or matched stop string |
LENGTH = 1 | "length" | max_tokens exhausted or max_model_len reached |
ABORT = 2 | "abort" | Client disconnected or explicit abort |
ERROR = 3 | "error" | Retryable internal error (e.g., KV load failure) |
Stop strings can be detected in two places:
EngineCore side: EOS token matching, handled by the scheduler. Results in finish_reason=STOP on the EngineCoreOutput.OutputProcessor side: Text-level stop-string matching via IncrementalDetokenizer.update(). When detected, the OutputProcessor adds the request to reqs_to_abort and sends an abort to EngineCore to halt further generation.The stop_reason field carries either the matched stop string (str) or the stop token ID (int).
Sources: vllm/v1/engine/__init__.py42-62 vllm/v1/engine/detokenizer.py97-144 vllm/v1/engine/llm_engine.py314-317
Refresh this wiki
This wiki was recently refreshed. Please wait 5 days to refresh again.