This page documents vLLM's implementation of the OpenAI Responses API serving layer, the tool calling plugin system shared between Chat Completions and Responses APIs, function call parsing, and MCP tool integration. For the broader OpenAI-compatible server setup and route registration, see OpenAI-Compatible API Server. For structured output backends used by named and required tool calling, see Structured Output Generation. For reasoning parsers (which integrate with both APIs), see Output Processing.
The Responses API (/v1/responses) is an OpenAI-compatible endpoint that extends beyond Chat Completions with support for stateful multi-turn sessions, built-in server-side tools, MCP tool servers, background execution, reasoning items, and per-response storage.
OpenAIServingResponses in vllm/entrypoints/openai/responses/serving.py160-267 extends OpenAIServing from vllm/entrypoints/openai/engine/serving.py224 and is the primary handler for the /v1/responses endpoint.
Key fields established at construction:
| Field | Type | Description |
|---|---|---|
parser | ParserManager | Unified parser wrapping tool parser + reasoning parser |
use_harmony | bool | True when model_type == "gpt_oss" (OpenAI OSS models) |
enable_store | bool | Controlled by VLLM_ENABLE_RESPONSES_API_STORE env var |
response_store | dict[str, ResponsesResponse] | In-memory map of response ID → stored response |
msg_store | dict[str, list[ChatCompletionMessageParam]] | Map of response ID → input messages |
event_store | dict[str, tuple[deque, asyncio.Event]] | Background streaming event queues |
tool_server | ToolServer | None | Optional built-in/MCP tool execution server |
tool_call_id_type | str | "kimi_k2" or "random" depending on model |
background_tasks | dict[str, asyncio.Task] | Active background request tasks |
Sources: vllm/entrypoints/openai/responses/serving.py160-267
ResponsesRequest from vllm/entrypoints/openai/responses/protocol.py is the Pydantic model for incoming requests. Key fields:
| Field | Type | Description |
|---|---|---|
input | str | list | Plain text or structured input items (messages, reasoning items, tool outputs) |
instructions | str | None | System-level prompt prefix |
tools | list[Tool] | Tool definitions: function, code_interpreter, web_search_preview, mcp, etc. |
tool_choice | str | "auto", "required", "none", or named function |
stream | bool | Whether to return SSE events |
store | bool | Whether to persist the response for later retrieval |
background | bool | Run asynchronously and poll for result |
previous_response_id | str | None | Links to a prior stored response for multi-turn |
previous_input_messages | list | None | Alternative to previous_response_id |
max_output_tokens | int | None | Max tokens to generate |
reasoning | dict | None | Reasoning effort ("low", "medium", "high") |
Sources: vllm/entrypoints/openai/responses/serving.py293-324 vllm/entrypoints/openai/responses/serving.py596-639
Responses API Request Flow (OpenAIServingResponses.create_responses):
Sources: vllm/entrypoints/openai/responses/serving.py326-594 vllm/entrypoints/openai/responses/serving.py657-804
The context type is selected inside create_responses() at vllm/entrypoints/openai/responses/serving.py453-479:
use_harmony=True → HarmonyContext (non-streaming)
StreamingHarmonyContext (streaming)
use_harmony=False, VLLM_USE_EXPERIMENTAL_PARSER_CONTEXT=1
→ ParsableContext
use_harmony=False, default
→ SimpleContext
| Status | When Set |
|---|---|
"queued" | Background request created, not yet started |
"completed" | Generation finished successfully |
"incomplete" | Stopped due to max_output_tokens |
"cancelled" | Request cancelled via /v1/responses/{id}/cancel |
Sources: vllm/entrypoints/openai/responses/serving.py684-703
ConversationContext vllm/entrypoints/openai/responses/context.py107-140 is the abstract base class for tracking model output across multiple tool call turns. Each subclass handles a different execution mode.
ConversationContext Class Hierarchy and Capabilities:
Sources: vllm/entrypoints/openai/responses/context.py107-300
SimpleContext vllm/entrypoints/openai/responses/context.py165 is the default path for non-Harmony models:
RequestOutput from the engine in last_output / final_output.num_prompt_tokens, num_output_tokens, num_cached_tokens.input_messages / output_messages for enable_response_messages support.num_reasoning_tokens post-generation from the reasoning parser using accumulated _accumulated_token_ids.Used when model_type == "gpt_oss". These contexts use the openai_harmony library's StreamableParser to decode token IDs into structured Message objects token-by-token.
call_tool() dispatches to the appropriate ToolServer capability and appends the tool result as a new harmony message.need_builtin_tool_call() checks the parser's last message to see if the model addressed a built-in tool recipient.TurnMetrics vllm/entrypoints/openai/responses/context.py75-104:TurnMetrics Field | Meaning |
|---|---|
input_tokens | Prompt token count for this turn |
output_tokens | Generated token count for this turn |
cached_input_tokens | KV-cached tokens in the prompt |
tool_output_tokens | Tokens contributed by tool results (computed as prompt growth minus previous output) |
StreamingHarmonyContext extends HarmonyContext to expose last_content_delta for incremental SSE emission.
Sources: vllm/entrypoints/openai/responses/context.py75-104 tests/entrypoints/test_context.py86-175
Enabled via VLLM_USE_EXPERIMENTAL_PARSER_CONTEXT=1. This context instantiates a ResponsesParser (from vllm/entrypoints/openai/parser/responses_parser.py) that processes token IDs during generation rather than post-hoc. It supports reasoning items and function tool calls. Output items are assembled via parser.make_response_output_items_from_parsable_context().
Sources: vllm/entrypoints/openai/responses/serving.py460-476 tests/entrypoints/openai/responses/test_parsable_context.py30-57
The Responses API streaming output follows the OpenAI Responses SSE protocol. All event-building logic lives in vllm/entrypoints/openai/responses/streaming_events.py
StreamingState vllm/entrypoints/openai/responses/streaming_events.py90 is a mutable dataclass tracking state between emitted SSE events:
| Field | Purpose |
|---|---|
current_content_index | Active content part index |
current_item_index | Active output item index |
current_item_id | ID of the currently open output item |
last_item_type | Type of previously finalized item (for done events) |
last_recipient | Harmony recipient for detecting tool call transitions |
| Function | Events Emitted |
|---|---|
emit_content_delta_events(ctx, state) | response.output_text.delta, response.reasoning_text.delta |
emit_previous_item_done_events(ctx, state) | response.content_part.done, response.output_item.done |
emit_tool_action_events(ctx, state) | Function call, MCP call, code interpreter, web search events |
Each stream must emit matched open/close events:
| Done / Completed | Added / In-Progress / Delta |
|---|---|
response.completed | response.created |
response.output_item.done | response.output_item.added |
response.content_part.done | response.content_part.added |
response.output_text.done | response.output_text.delta |
response.reasoning_text.done | response.reasoning_text.delta |
response.function_call_arguments.done | response.function_call_arguments.delta |
response.mcp_call.completed | response.mcp_call.in_progress |
response.code_interpreter_call.completed | response.code_interpreter_call.in_progress |
response.web_search_call.completed | response.web_search_call.in_progress |
Sources: vllm/entrypoints/openai/responses/streaming_events.py1-100 tests/entrypoints/openai/responses/conftest.py22-46
The Chat Completion API (/v1/chat/completions) supports tool calling through a ToolParser plugin that post-processes model output text to extract structured tool calls.
ToolParser (abstract) and ToolParserManager (registry) are defined in vllm/tool_parsers/abstract_tool_parser.py and exported from vllm/tool_parsers/__init__.py
ToolParser required interface:
| Method | Called When |
|---|---|
extract_tool_calls(model_output, request) | Non-streaming: processes full output string |
extract_tool_calls_streaming(prev_text, curr_text, delta_text, prev_ids, curr_ids, delta_ids, request) | Streaming: called on each token delta |
adjust_request(request) (optional) | Before generation: e.g., to disable skip_special_tokens |
ToolParserManager.register_lazy_module(name, module_path, class_name) registers parsers without importing them at startup. Parsers are loaded on first use.
ToolParser Registry and Chat Completion Dispatch:
Sources: vllm/tool_parsers/__init__.py1-100 vllm/entrypoints/openai/chat_completion/serving.py127-137
| Parser Name | Class (in vllm/tool_parsers/) | Supported Models |
|---|---|---|
hermes | Hermes2ProToolParser | Nous Hermes 2 Pro+, Qwen2.5, Granite 4.0 |
mistral | MistralToolParser | Mistral 7B-Instruct v0.3+ |
llama3_json | Llama3JsonToolParser | Llama 3.1, 3.2 (JSON-based) |
llama4_pythonic | Llama4PythonicToolParser | Llama 4 |
pythonic | PythonicToolParser | Llama 3.2, ToolACE-8B, Llama 4 |
internlm | InternLMToolParser | InternLM 2.5+ |
jamba | JambaToolParser | AI21 Jamba 1.5 |
xlam | xLAMToolParser | Salesforce Llama-xLAM, Qwen-xLAM |
granite | GraniteToolParser | IBM Granite 3.x |
granite-20b-fc | Granite20bFCToolParser | IBM Granite 20B FC |
deepseek_v3 | DeepSeekV3ToolParser | DeepSeek-V3-0324, R1-0528 |
deepseek_v31 | DeepSeekV31ToolParser | DeepSeek-V3.1 |
minimax_m1 | (minimax parser) | MiniMax M1 |
openai | (openai parser) | OpenAI OSS 20B/120B |
kimi_k2 | KimiK2ToolParser | Kimi-K2-Instruct |
hunyuan_a13b | (hunyuan parser) | Hunyuan A13B |
longcat | (longcat parser) | LongCat Flash Chat |
glm45 | Glm4MoeModelToolParser | GLM-4.5, GLM-4.6 |
glm47 | Glm47MoeModelToolParser | GLM-4.7 |
functiongemma | FunctionGemmaToolParser | FunctionGemma 270M |
qwen3_xml | (qwen3 xml parser) | Qwen3-Coder |
olmo3 | (olmo3 parser) | OLMo 3 |
gigachat3 | (gigachat parser) | GigaChat 3 |
Sources: vllm/tool_parsers/__init__.py24-100 docs/features/tool_calling.md127-432
tool_choice | Behavior | Backend |
|---|---|---|
"none" | No tool calls; plain text response. Optionally suppress tool defs with --exclude-tools-when-tool-choice-none | N/A |
"auto" | Model decides; parsed by ToolParser | --enable-auto-tool-choice + --tool-call-parser |
"required" | Model forced to call ≥1 tool; output constrained to tool JSON schema | Structured outputs |
{"type": "function", "function": {"name": "f"}} | Forces named function call | Structured outputs |
For "auto", ToolParser.extract_tool_calls() inspects the raw generation text for model-specific tool call syntax (XML tags, JSON arrays, Python lists, etc.).
For "required" and named tool choice, vLLM uses the structured output backend (xgrammar or outlines) to constrain the model to a valid JSON schema matching the tool parameter definitions.
Sources: docs/features/tool_calling.md87-110 vllm/entrypoints/openai/chat_completion/serving.py256-283
The Responses API supports three tool categories with different execution responsibilities:
| Category | Tool Types | Execution Location |
|---|---|---|
| Function tools | "function" | Client-side: vLLM generates call, client executes and provides result |
| Built-in tools | "code_interpreter", "web_search_preview", "container" | Server-side via ToolServer |
| MCP tools | "mcp" | Server-side via MCP protocol sessions |
construct_tool_dicts() from vllm/entrypoints/openai/responses/utils.py converts tools entries into dictionaries for prompt construction. The model generates JSON arguments; these are parsed and returned as ResponseFunctionToolCall items in response.output.
extract_tool_types() vllm/entrypoints/openai/responses/serving.py403 extracts the set of tool type strings from the request's tools list for routing to built-in handlers.
When --tool-server demo is specified, vLLM instantiates a ToolServer (from vllm/entrypoints/mcp/tool_server.py) that provides:
browser → handles web_search_preview requestspython → handles code_interpreter requestscontainer → handles container requestsThe canonical mapping is BUILTIN_TOOL_TO_MCP_SERVER_LABEL in vllm/entrypoints/openai/parser/harmony_utils.py42-49 At request time, create_responses() matches requested_tool_types against tool_server.has_tool() to populate builtin_tool_list vllm/entrypoints/openai/responses/serving.py403-425
MCP tools allow requests to specify arbitrary external tool servers. A request includes Mcp type entries with server_label, server_url, and optional allowed_tools.
_extract_allowed_tools_from_mcp_requests() vllm/entrypoints/openai/responses/serving.py121-157 builds a server_label → allowed_tools mapping. The "*" wildcard is normalized to None (all tools allowed).
_initialize_tool_sessions() vllm/entrypoints/openai/responses/serving.py641-655 calls context.init_tool_sessions(), which establishes per-request MCP connections managed by an AsyncExitStack.
MCP Tool Execution Sequence:
Sources: vllm/entrypoints/openai/responses/serving.py641-655 vllm/entrypoints/openai/responses/context.py107-140 tests/entrypoints/openai/responses/test_mcp_tools.py39-98
The Harmony path activates when hf_config.model_type == "gpt_oss" vllm/entrypoints/openai/responses/serving.py223 and uses the openai_harmony library for encoding and parsing.
harmony_utils.pyvllm/entrypoints/openai/parser/harmony_utils.py provides:
| Function | Purpose |
|---|---|
get_encoding() | Loads and caches HarmonyEncodingName.HARMONY_GPT_OSS |
render_for_completion(messages) | Converts list[Message] → list[int] token IDs via harmony encoding |
get_stop_tokens_for_assistant_actions() | Returns <|return|> and <|call|> stop token IDs added to stop_token_ids |
get_system_message(...) | Builds SYSTEM role Message with model identity, reasoning effort, date, tool descriptions |
get_developer_message(instructions, tools) | Builds DEVELOPER role Message with instructions and function tool definitions |
get_streamable_parser_for_assistant() | Creates a StreamableParser for incremental token-by-token decoding |
parse_chat_inputs_to_harmony_messages(chat_msgs) | Converts Chat Completion format messages to harmony Message objects |
auto_drop_analysis_messages(msgs) | Removes stale analysis messages from multi-turn context |
Sources: vllm/entrypoints/openai/parser/harmony_utils.py60-332
Harmony messages use a channel attribute to encode their semantic role:
| Channel | Recipient | Interpretation |
|---|---|---|
"final" | N/A | Visible assistant text output to the user |
"analysis" | N/A | Hidden chain-of-thought reasoning (counts toward num_reasoning_tokens) |
"commentary" | None | Visible preamble text (does not count as reasoning) |
"commentary" | tool name | Hidden tool call arguments (counts toward num_reasoning_tokens) |
Sources: vllm/entrypoints/openai/parser/harmony_utils.py226-316 tests/entrypoints/test_context.py239-274
vllm/entrypoints/openai/responses/harmony.py handles bidirectional conversion:
response_input_to_harmony() converts ResponsesRequest.input items (messages, reasoning items, function_call, function_call_output) to harmony Message objects.harmony_to_response_output() converts completed harmony Message list to list[ResponseOutputItem] (message, reasoning, function_call, mcp_call, web_search output items).parser_state_to_response_output() converts the in-progress StreamableParser state to response output items during SSE streaming.Sources: vllm/entrypoints/openai/responses/harmony.py1-50 tests/entrypoints/openai/responses/test_harmony_utils.py
The responses_full_generator() assembles ResponseUsage from context-reported metrics vllm/entrypoints/openai/responses/serving.py762-785:
ResponseUsage
├── input_tokens (sum of all turns' prompt tokens)
├── output_tokens (sum of all turns' generated tokens)
├── input_tokens_details
│ ├── cached_tokens (sum of KV-cached tokens)
│ ├── input_tokens_per_turn
│ └── cached_tokens_per_turn
└── output_tokens_details
├── reasoning_tokens (analysis channel + tool-directed commentary)
├── tool_output_tokens (prompt growth from tool results)
├── output_tokens_per_turn
└── tool_output_tokens_per_turn
Sources: vllm/entrypoints/openai/responses/serving.py762-785
To add a parser for a new model family:
1. Implement ToolParser:
class MyModelToolParser(ToolParser):
def __init__(self, tokenizer: TokenizerLike):
super().__init__(tokenizer)
def adjust_request(self, request: ChatCompletionRequest) -> ChatCompletionRequest:
# e.g., disable skip_special_tokens if the model uses special tokens for tool calls
return request
def extract_tool_calls(
self, model_output: str, request: ChatCompletionRequest
) -> ExtractedToolCallInformation:
# Parse model_output string and return structured tool calls
...
def extract_tool_calls_streaming(
self,
previous_text: str, current_text: str, delta_text: str,
previous_token_ids, current_token_ids, delta_token_ids,
request: ChatCompletionRequest,
) -> DeltaMessage | None:
# Return incremental DeltaMessage or None if not enough context yet
...
2. Register with ToolParserManager:
ToolParserManager.register_lazy_module(
name="my_model",
module_path="vllm.tool_parsers.my_model_parser",
class_name="MyModelToolParser",
)
3. Use at server startup:
For external plugins, pass --tool-parser-plugin /absolute/path/to/plugin.py instead of registering in the package.
Sources: docs/features/tool_calling.md462-525 vllm/tool_parsers/__init__.py1-50
| Flag | Purpose |
|---|---|
--enable-auto-tool-choice | Enable tool_choice="auto" mode in Chat Completions |
--tool-call-parser <name> | Select tool parser (e.g., hermes, llama3_json, pythonic) |
--tool-parser-plugin <path> | Load external tool parser plugin |
--chat-template <path> | Tool-compatible Jinja2 chat template path |
--reasoning-parser <name> | Reasoning parser (also used in Responses API for num_reasoning_tokens) |
--tool-server demo | Enable built-in tool server (code interpreter, web search, container) |
--exclude-tools-when-tool-choice-none | Omit tool definitions from prompt when tool_choice="none" |
| Variable | Default | Purpose |
|---|---|---|
VLLM_ENABLE_RESPONSES_API_STORE | 0 | Enable response storage; required for store=True and background requests |
VLLM_USE_EXPERIMENTAL_PARSER_CONTEXT | 0 | Use ParsableContext for token-level parsing during generation |
VLLM_GPT_OSS_SYSTEM_TOOL_MCP_LABELS | "" | Comma-separated MCP server labels treated as system tools |
VLLM_GPT_OSS_HARMONY_SYSTEM_INSTRUCTIONS | 0 | When 1, embed instructions in the system message; otherwise in developer message |
VLLM_SYSTEM_START_DATE | current date | Pin the conversation start date in harmony system messages |
Sources: vllm/entrypoints/openai/responses/serving.py215-265 vllm/entrypoints/openai/parser/harmony_utils.py80-96 tests/entrypoints/openai/responses/test_mcp_tools.py113-130
Refresh this wiki
This wiki was recently refreshed. Please wait 5 days to refresh again.