Responses API and Tool Calling

Relevant source files

This page documents vLLM's implementation of the OpenAI Responses API serving layer, the tool calling plugin system shared between Chat Completions and Responses APIs, function call parsing, and MCP tool integration. For the broader OpenAI-compatible server setup and route registration, see OpenAI-Compatible API Server. For structured output backends used by named and required tool calling, see Structured Output Generation. For reasoning parsers (which integrate with both APIs), see Output Processing.

Responses API Serving Layer

The Responses API (/v1/responses) is an OpenAI-compatible endpoint that extends beyond Chat Completions with support for stateful multi-turn sessions, built-in server-side tools, MCP tool servers, background execution, reasoning items, and per-response storage.

OpenAIServingResponses

OpenAIServingResponses in vllm/entrypoints/openai/responses/serving.py160-267 extends OpenAIServing from vllm/entrypoints/openai/engine/serving.py224 and is the primary handler for the /v1/responses endpoint.

Key fields established at construction:

Field	Type	Description
`parser`	`ParserManager`	Unified parser wrapping tool parser + reasoning parser
`use_harmony`	`bool`	`True` when `model_type == "gpt_oss"` (OpenAI OSS models)
`enable_store`	`bool`	Controlled by `VLLM_ENABLE_RESPONSES_API_STORE` env var
`response_store`	`dict[str, ResponsesResponse]`	In-memory map of response ID → stored response
`msg_store`	`dict[str, list[ChatCompletionMessageParam]]`	Map of response ID → input messages
`event_store`	`dict[str, tuple[deque, asyncio.Event]]`	Background streaming event queues
`tool_server`	`ToolServer \| None`	Optional built-in/MCP tool execution server
`tool_call_id_type`	`str`	`"kimi_k2"` or `"random"` depending on model
`background_tasks`	`dict[str, asyncio.Task]`	Active background request tasks

Sources: vllm/entrypoints/openai/responses/serving.py160-267

Request Protocol

ResponsesRequest from vllm/entrypoints/openai/responses/protocol.py is the Pydantic model for incoming requests. Key fields:

Field	Type	Description
`input`	`str \| list`	Plain text or structured input items (messages, reasoning items, tool outputs)
`instructions`	`str \| None`	System-level prompt prefix
`tools`	`list[Tool]`	Tool definitions: `function`, `code_interpreter`, `web_search_preview`, `mcp`, etc.
`tool_choice`	`str`	`"auto"`, `"required"`, `"none"`, or named function
`stream`	`bool`	Whether to return SSE events
`store`	`bool`	Whether to persist the response for later retrieval
`background`	`bool`	Run asynchronously and poll for result
`previous_response_id`	`str \| None`	Links to a prior stored response for multi-turn
`previous_input_messages`	`list \| None`	Alternative to `previous_response_id`
`max_output_tokens`	`int \| None`	Max tokens to generate
`reasoning`	`dict \| None`	Reasoning effort (`"low"`, `"medium"`, `"high"`)

Sources: vllm/entrypoints/openai/responses/serving.py293-324 vllm/entrypoints/openai/responses/serving.py596-639

Request Lifecycle

Responses API Request Flow (OpenAIServingResponses.create_responses):

Sources: vllm/entrypoints/openai/responses/serving.py326-594 vllm/entrypoints/openai/responses/serving.py657-804

Context Selection

The context type is selected inside create_responses() at vllm/entrypoints/openai/responses/serving.py453-479:

use_harmony=True  →  HarmonyContext (non-streaming)
                     StreamingHarmonyContext (streaming)

use_harmony=False, VLLM_USE_EXPERIMENTAL_PARSER_CONTEXT=1
                 →  ParsableContext

use_harmony=False, default
                 →  SimpleContext

Response Status Values

Status	When Set
`"queued"`	Background request created, not yet started
`"completed"`	Generation finished successfully
`"incomplete"`	Stopped due to `max_output_tokens`
`"cancelled"`	Request cancelled via `/v1/responses/{id}/cancel`

Sources: vllm/entrypoints/openai/responses/serving.py684-703

Conversation Contexts

ConversationContext vllm/entrypoints/openai/responses/context.py107-140 is the abstract base class for tracking model output across multiple tool call turns. Each subclass handles a different execution mode.

ConversationContext Class Hierarchy and Capabilities:

Sources: vllm/entrypoints/openai/responses/context.py107-300

SimpleContext

SimpleContext vllm/entrypoints/openai/responses/context.py165 is the default path for non-Harmony models:

Accumulates the final RequestOutput from the engine in last_output / final_output.
Tracks num_prompt_tokens, num_output_tokens, num_cached_tokens.
Does not support multi-turn tool execution (the tool loop is a no-op).
Collects raw message tokens in input_messages / output_messages for enable_response_messages support.
Derives num_reasoning_tokens post-generation from the reasoning parser using accumulated _accumulated_token_ids.

HarmonyContext and StreamingHarmonyContext

Used when model_type == "gpt_oss". These contexts use the openai_harmony library's StreamableParser to decode token IDs into structured Message objects token-by-token.

call_tool() dispatches to the appropriate ToolServer capability and appends the tool result as a new harmony message.
need_builtin_tool_call() checks the parser's last message to see if the model addressed a built-in tool recipient.
Per-turn token accounting is handled by TurnMetrics vllm/entrypoints/openai/responses/context.py75-104:

`TurnMetrics` Field	Meaning
`input_tokens`	Prompt token count for this turn
`output_tokens`	Generated token count for this turn
`cached_input_tokens`	KV-cached tokens in the prompt
`tool_output_tokens`	Tokens contributed by tool results (computed as prompt growth minus previous output)

StreamingHarmonyContext extends HarmonyContext to expose last_content_delta for incremental SSE emission.

Sources: vllm/entrypoints/openai/responses/context.py75-104 tests/entrypoints/test_context.py86-175

ParsableContext (Experimental)

Enabled via VLLM_USE_EXPERIMENTAL_PARSER_CONTEXT=1. This context instantiates a ResponsesParser (from vllm/entrypoints/openai/parser/responses_parser.py) that processes token IDs during generation rather than post-hoc. It supports reasoning items and function tool calls. Output items are assembled via parser.make_response_output_items_from_parsable_context().

Sources: vllm/entrypoints/openai/responses/serving.py460-476 tests/entrypoints/openai/responses/test_parsable_context.py30-57

Streaming Events

The Responses API streaming output follows the OpenAI Responses SSE protocol. All event-building logic lives in vllm/entrypoints/openai/responses/streaming_events.py

StreamingState

StreamingState vllm/entrypoints/openai/responses/streaming_events.py90 is a mutable dataclass tracking state between emitted SSE events:

Field	Purpose
`current_content_index`	Active content part index
`current_item_index`	Active output item index
`current_item_id`	ID of the currently open output item
`last_item_type`	Type of previously finalized item (for done events)
`last_recipient`	Harmony recipient for detecting tool call transitions

Event Emitters

Function	Events Emitted
`emit_content_delta_events(ctx, state)`	`response.output_text.delta`, `response.reasoning_text.delta`
`emit_previous_item_done_events(ctx, state)`	`response.content_part.done`, `response.output_item.done`
`emit_tool_action_events(ctx, state)`	Function call, MCP call, code interpreter, web search events

Required Event Pairs

Each stream must emit matched open/close events:

Done / Completed	Added / In-Progress / Delta
`response.completed`	`response.created`
`response.output_item.done`	`response.output_item.added`
`response.content_part.done`	`response.content_part.added`
`response.output_text.done`	`response.output_text.delta`
`response.reasoning_text.done`	`response.reasoning_text.delta`
`response.function_call_arguments.done`	`response.function_call_arguments.delta`
`response.mcp_call.completed`	`response.mcp_call.in_progress`
`response.code_interpreter_call.completed`	`response.code_interpreter_call.in_progress`
`response.web_search_call.completed`	`response.web_search_call.in_progress`

Sources: vllm/entrypoints/openai/responses/streaming_events.py1-100 tests/entrypoints/openai/responses/conftest.py22-46

Tool Calling in Chat Completions

The Chat Completion API (/v1/chat/completions) supports tool calling through a ToolParser plugin that post-processes model output text to extract structured tool calls.

ToolParser and ToolParserManager

ToolParser (abstract) and ToolParserManager (registry) are defined in vllm/tool_parsers/abstract_tool_parser.py and exported from vllm/tool_parsers/__init__.py

ToolParser required interface:

Method	Called When
`extract_tool_calls(model_output, request)`	Non-streaming: processes full output string
`extract_tool_calls_streaming(prev_text, curr_text, delta_text, prev_ids, curr_ids, delta_ids, request)`	Streaming: called on each token delta
`adjust_request(request)` (optional)	Before generation: e.g., to disable `skip_special_tokens`

ToolParserManager.register_lazy_module(name, module_path, class_name) registers parsers without importing them at startup. Parsers are loaded on first use.

ToolParser Registry and Chat Completion Dispatch:

Sources: vllm/tool_parsers/__init__.py1-100 vllm/entrypoints/openai/chat_completion/serving.py127-137

Registered Tool Parsers

Parser Name	Class (in `vllm/tool_parsers/`)	Supported Models
`hermes`	`Hermes2ProToolParser`	Nous Hermes 2 Pro+, Qwen2.5, Granite 4.0
`mistral`	`MistralToolParser`	Mistral 7B-Instruct v0.3+
`llama3_json`	`Llama3JsonToolParser`	Llama 3.1, 3.2 (JSON-based)
`llama4_pythonic`	`Llama4PythonicToolParser`	Llama 4
`pythonic`	`PythonicToolParser`	Llama 3.2, ToolACE-8B, Llama 4
`internlm`	`InternLMToolParser`	InternLM 2.5+
`jamba`	`JambaToolParser`	AI21 Jamba 1.5
`xlam`	`xLAMToolParser`	Salesforce Llama-xLAM, Qwen-xLAM
`granite`	`GraniteToolParser`	IBM Granite 3.x
`granite-20b-fc`	`Granite20bFCToolParser`	IBM Granite 20B FC
`deepseek_v3`	`DeepSeekV3ToolParser`	DeepSeek-V3-0324, R1-0528
`deepseek_v31`	`DeepSeekV31ToolParser`	DeepSeek-V3.1
`minimax_m1`	(minimax parser)	MiniMax M1
`openai`	(openai parser)	OpenAI OSS 20B/120B
`kimi_k2`	`KimiK2ToolParser`	Kimi-K2-Instruct
`hunyuan_a13b`	(hunyuan parser)	Hunyuan A13B
`longcat`	(longcat parser)	LongCat Flash Chat
`glm45`	`Glm4MoeModelToolParser`	GLM-4.5, GLM-4.6
`glm47`	`Glm47MoeModelToolParser`	GLM-4.7
`functiongemma`	`FunctionGemmaToolParser`	FunctionGemma 270M
`qwen3_xml`	(qwen3 xml parser)	Qwen3-Coder
`olmo3`	(olmo3 parser)	OLMo 3
`gigachat3`	(gigachat parser)	GigaChat 3

Sources: vllm/tool_parsers/__init__.py24-100 docs/features/tool_calling.md127-432

Tool Choice Modes

`tool_choice`	Behavior	Backend
`"none"`	No tool calls; plain text response. Optionally suppress tool defs with `--exclude-tools-when-tool-choice-none`	N/A
`"auto"`	Model decides; parsed by `ToolParser`	`--enable-auto-tool-choice` + `--tool-call-parser`
`"required"`	Model forced to call ≥1 tool; output constrained to tool JSON schema	Structured outputs
`{"type": "function", "function": {"name": "f"}}`	Forces named function call	Structured outputs

For "auto", ToolParser.extract_tool_calls() inspects the raw generation text for model-specific tool call syntax (XML tags, JSON arrays, Python lists, etc.).

For "required" and named tool choice, vLLM uses the structured output backend (xgrammar or outlines) to constrain the model to a valid JSON schema matching the tool parameter definitions.

Sources: docs/features/tool_calling.md87-110 vllm/entrypoints/openai/chat_completion/serving.py256-283

Tool Calling in the Responses API

The Responses API supports three tool categories with different execution responsibilities:

Category	Tool Types	Execution Location
Function tools	`"function"`	Client-side: vLLM generates call, client executes and provides result
Built-in tools	`"code_interpreter"`, `"web_search_preview"`, `"container"`	Server-side via `ToolServer`
MCP tools	`"mcp"`	Server-side via MCP protocol sessions

Function Tools

construct_tool_dicts() from vllm/entrypoints/openai/responses/utils.py converts tools entries into dictionaries for prompt construction. The model generates JSON arguments; these are parsed and returned as ResponseFunctionToolCall items in response.output.

extract_tool_types() vllm/entrypoints/openai/responses/serving.py403 extracts the set of tool type strings from the request's tools list for routing to built-in handlers.

Built-in Tools and ToolServer

When --tool-server demo is specified, vLLM instantiates a ToolServer (from vllm/entrypoints/mcp/tool_server.py) that provides:

browser → handles web_search_preview requests
python → handles code_interpreter requests
container → handles container requests

The canonical mapping is BUILTIN_TOOL_TO_MCP_SERVER_LABEL in vllm/entrypoints/openai/parser/harmony_utils.py42-49 At request time, create_responses() matches requested_tool_types against tool_server.has_tool() to populate builtin_tool_list vllm/entrypoints/openai/responses/serving.py403-425

MCP Tools

MCP tools allow requests to specify arbitrary external tool servers. A request includes Mcp type entries with server_label, server_url, and optional allowed_tools.

_extract_allowed_tools_from_mcp_requests() vllm/entrypoints/openai/responses/serving.py121-157 builds a server_label → allowed_tools mapping. The "*" wildcard is normalized to None (all tools allowed).

_initialize_tool_sessions() vllm/entrypoints/openai/responses/serving.py641-655 calls context.init_tool_sessions(), which establishes per-request MCP connections managed by an AsyncExitStack.

MCP Tool Execution Sequence:

Sources: vllm/entrypoints/openai/responses/serving.py641-655 vllm/entrypoints/openai/responses/context.py107-140 tests/entrypoints/openai/responses/test_mcp_tools.py39-98

Harmony Integration (GPT-OSS Models)

The Harmony path activates when hf_config.model_type == "gpt_oss" vllm/entrypoints/openai/responses/serving.py223 and uses the openai_harmony library for encoding and parsing.

Key Utilities in `harmony_utils.py`

vllm/entrypoints/openai/parser/harmony_utils.py provides:

Function	Purpose
`get_encoding()`	Loads and caches `HarmonyEncodingName.HARMONY_GPT_OSS`
`render_for_completion(messages)`	Converts `list[Message]` → `list[int]` token IDs via harmony encoding
`get_stop_tokens_for_assistant_actions()`	Returns `<\|return\|>` and `<\|call\|>` stop token IDs added to `stop_token_ids`
`get_system_message(...)`	Builds `SYSTEM` role `Message` with model identity, reasoning effort, date, tool descriptions
`get_developer_message(instructions, tools)`	Builds `DEVELOPER` role `Message` with instructions and function tool definitions
`get_streamable_parser_for_assistant()`	Creates a `StreamableParser` for incremental token-by-token decoding
`parse_chat_inputs_to_harmony_messages(chat_msgs)`	Converts Chat Completion format messages to harmony `Message` objects
`auto_drop_analysis_messages(msgs)`	Removes stale analysis messages from multi-turn context

Sources: vllm/entrypoints/openai/parser/harmony_utils.py60-332

Harmony Message Channels

Harmony messages use a channel attribute to encode their semantic role:

Channel	Recipient	Interpretation
`"final"`	N/A	Visible assistant text output to the user
`"analysis"`	N/A	Hidden chain-of-thought reasoning (counts toward `num_reasoning_tokens`)
`"commentary"`	`None`	Visible preamble text (does not count as reasoning)
`"commentary"`	tool name	Hidden tool call arguments (counts toward `num_reasoning_tokens`)

Sources: vllm/entrypoints/openai/parser/harmony_utils.py226-316 tests/entrypoints/test_context.py239-274

Harmony ↔ Responses API Conversion

vllm/entrypoints/openai/responses/harmony.py handles bidirectional conversion:

Input parsing: response_input_to_harmony() converts ResponsesRequest.input items (messages, reasoning items, function_call, function_call_output) to harmony Message objects.
Output building: harmony_to_response_output() converts completed harmony Message list to list[ResponseOutputItem] (message, reasoning, function_call, mcp_call, web_search output items).
Streaming output: parser_state_to_response_output() converts the in-progress StreamableParser state to response output items during SSE streaming.

Sources: vllm/entrypoints/openai/responses/harmony.py1-50 tests/entrypoints/openai/responses/test_harmony_utils.py

Harmony Multi-Turn Usage Tracking

The responses_full_generator() assembles ResponseUsage from context-reported metrics vllm/entrypoints/openai/responses/serving.py762-785:

ResponseUsage
├── input_tokens            (sum of all turns' prompt tokens)
├── output_tokens           (sum of all turns' generated tokens)
├── input_tokens_details
│   ├── cached_tokens       (sum of KV-cached tokens)
│   ├── input_tokens_per_turn
│   └── cached_tokens_per_turn
└── output_tokens_details
    ├── reasoning_tokens    (analysis channel + tool-directed commentary)
    ├── tool_output_tokens  (prompt growth from tool results)
    ├── output_tokens_per_turn
    └── tool_output_tokens_per_turn

Sources: vllm/entrypoints/openai/responses/serving.py762-785

Writing Custom Tool Parsers

To add a parser for a new model family:

1. Implement ToolParser:

class MyModelToolParser(ToolParser):
    def __init__(self, tokenizer: TokenizerLike):
        super().__init__(tokenizer)

    def adjust_request(self, request: ChatCompletionRequest) -> ChatCompletionRequest:
        # e.g., disable skip_special_tokens if the model uses special tokens for tool calls
        return request

    def extract_tool_calls(
        self, model_output: str, request: ChatCompletionRequest
    ) -> ExtractedToolCallInformation:
        # Parse model_output string and return structured tool calls
        ...

    def extract_tool_calls_streaming(
        self,
        previous_text: str, current_text: str, delta_text: str,
        previous_token_ids, current_token_ids, delta_token_ids,
        request: ChatCompletionRequest,
    ) -> DeltaMessage | None:
        # Return incremental DeltaMessage or None if not enough context yet
        ...

2. Register with ToolParserManager:

ToolParserManager.register_lazy_module(
    name="my_model",
    module_path="vllm.tool_parsers.my_model_parser",
    class_name="MyModelToolParser",
)

3. Use at server startup:

For external plugins, pass --tool-parser-plugin /absolute/path/to/plugin.py instead of registering in the package.

Sources: docs/features/tool_calling.md462-525 vllm/tool_parsers/__init__.py1-50

Configuration Reference

Server Flags

Flag	Purpose
`--enable-auto-tool-choice`	Enable `tool_choice="auto"` mode in Chat Completions
`--tool-call-parser <name>`	Select tool parser (e.g., `hermes`, `llama3_json`, `pythonic`)
`--tool-parser-plugin <path>`	Load external tool parser plugin
`--chat-template <path>`	Tool-compatible Jinja2 chat template path
`--reasoning-parser <name>`	Reasoning parser (also used in Responses API for `num_reasoning_tokens`)
`--tool-server demo`	Enable built-in tool server (code interpreter, web search, container)
`--exclude-tools-when-tool-choice-none`	Omit tool definitions from prompt when `tool_choice="none"`

Environment Variables

Variable	Default	Purpose
`VLLM_ENABLE_RESPONSES_API_STORE`	`0`	Enable response storage; required for `store=True` and background requests
`VLLM_USE_EXPERIMENTAL_PARSER_CONTEXT`	`0`	Use `ParsableContext` for token-level parsing during generation
`VLLM_GPT_OSS_SYSTEM_TOOL_MCP_LABELS`	`""`	Comma-separated MCP server labels treated as system tools
`VLLM_GPT_OSS_HARMONY_SYSTEM_INSTRUCTIONS`	`0`	When `1`, embed `instructions` in the system message; otherwise in developer message
`VLLM_SYSTEM_START_DATE`	current date	Pin the conversation start date in harmony system messages

Sources: vllm/entrypoints/openai/responses/serving.py215-265 vllm/entrypoints/openai/parser/harmony_utils.py80-96 tests/entrypoints/openai/responses/test_mcp_tools.py113-130

Responses API and Tool Calling

Relevant source files

Responses API Serving Layer

OpenAIServingResponses

Key fields established at construction:

Field	Type	Description
`parser`	`ParserManager`	Unified parser wrapping tool parser + reasoning parser
`use_harmony`	`bool`	`True` when `model_type == "gpt_oss"` (OpenAI OSS models)
`enable_store`	`bool`	Controlled by `VLLM_ENABLE_RESPONSES_API_STORE` env var
`response_store`	`dict[str, ResponsesResponse]`	In-memory map of response ID → stored response
`msg_store`	`dict[str, list[ChatCompletionMessageParam]]`	Map of response ID → input messages
`event_store`	`dict[str, tuple[deque, asyncio.Event]]`	Background streaming event queues
`tool_server`	`ToolServer \| None`	Optional built-in/MCP tool execution server
`tool_call_id_type`	`str`	`"kimi_k2"` or `"random"` depending on model
`background_tasks`	`dict[str, asyncio.Task]`	Active background request tasks

Sources: vllm/entrypoints/openai/responses/serving.py160-267

Request Protocol

ResponsesRequest from vllm/entrypoints/openai/responses/protocol.py is the Pydantic model for incoming requests. Key fields:

Field	Type	Description
`input`	`str \| list`	Plain text or structured input items (messages, reasoning items, tool outputs)
`instructions`	`str \| None`	System-level prompt prefix
`tools`	`list[Tool]`	Tool definitions: `function`, `code_interpreter`, `web_search_preview`, `mcp`, etc.
`tool_choice`	`str`	`"auto"`, `"required"`, `"none"`, or named function
`stream`	`bool`	Whether to return SSE events
`store`	`bool`	Whether to persist the response for later retrieval
`background`	`bool`	Run asynchronously and poll for result
`previous_response_id`	`str \| None`	Links to a prior stored response for multi-turn
`previous_input_messages`	`list \| None`	Alternative to `previous_response_id`
`max_output_tokens`	`int \| None`	Max tokens to generate
`reasoning`	`dict \| None`	Reasoning effort (`"low"`, `"medium"`, `"high"`)

Sources: vllm/entrypoints/openai/responses/serving.py293-324 vllm/entrypoints/openai/responses/serving.py596-639

Request Lifecycle

Responses API Request Flow (OpenAIServingResponses.create_responses):

Sources: vllm/entrypoints/openai/responses/serving.py326-594 vllm/entrypoints/openai/responses/serving.py657-804

Context Selection

The context type is selected inside create_responses() at vllm/entrypoints/openai/responses/serving.py453-479:

use_harmony=True  →  HarmonyContext (non-streaming)
                     StreamingHarmonyContext (streaming)

use_harmony=False, VLLM_USE_EXPERIMENTAL_PARSER_CONTEXT=1
                 →  ParsableContext

use_harmony=False, default
                 →  SimpleContext

Response Status Values

Status	When Set
`"queued"`	Background request created, not yet started
`"completed"`	Generation finished successfully
`"incomplete"`	Stopped due to `max_output_tokens`
`"cancelled"`	Request cancelled via `/v1/responses/{id}/cancel`

Sources: vllm/entrypoints/openai/responses/serving.py684-703

Conversation Contexts

ConversationContext Class Hierarchy and Capabilities:

Sources: vllm/entrypoints/openai/responses/context.py107-300

SimpleContext

SimpleContext vllm/entrypoints/openai/responses/context.py165 is the default path for non-Harmony models:

Accumulates the final RequestOutput from the engine in last_output / final_output.
Tracks num_prompt_tokens, num_output_tokens, num_cached_tokens.
Does not support multi-turn tool execution (the tool loop is a no-op).
Collects raw message tokens in input_messages / output_messages for enable_response_messages support.
Derives num_reasoning_tokens post-generation from the reasoning parser using accumulated _accumulated_token_ids.

HarmonyContext and StreamingHarmonyContext

Used when model_type == "gpt_oss". These contexts use the openai_harmony library's StreamableParser to decode token IDs into structured Message objects token-by-token.

call_tool() dispatches to the appropriate ToolServer capability and appends the tool result as a new harmony message.
need_builtin_tool_call() checks the parser's last message to see if the model addressed a built-in tool recipient.
Per-turn token accounting is handled by TurnMetrics vllm/entrypoints/openai/responses/context.py75-104:

`TurnMetrics` Field	Meaning
`input_tokens`	Prompt token count for this turn
`output_tokens`	Generated token count for this turn
`cached_input_tokens`	KV-cached tokens in the prompt
`tool_output_tokens`	Tokens contributed by tool results (computed as prompt growth minus previous output)

StreamingHarmonyContext extends HarmonyContext to expose last_content_delta for incremental SSE emission.

Sources: vllm/entrypoints/openai/responses/context.py75-104 tests/entrypoints/test_context.py86-175

ParsableContext (Experimental)

Sources: vllm/entrypoints/openai/responses/serving.py460-476 tests/entrypoints/openai/responses/test_parsable_context.py30-57

Streaming Events

The Responses API streaming output follows the OpenAI Responses SSE protocol. All event-building logic lives in vllm/entrypoints/openai/responses/streaming_events.py

StreamingState

StreamingState vllm/entrypoints/openai/responses/streaming_events.py90 is a mutable dataclass tracking state between emitted SSE events:

Field	Purpose
`current_content_index`	Active content part index
`current_item_index`	Active output item index
`current_item_id`	ID of the currently open output item
`last_item_type`	Type of previously finalized item (for done events)
`last_recipient`	Harmony recipient for detecting tool call transitions

Event Emitters

Function	Events Emitted
`emit_content_delta_events(ctx, state)`	`response.output_text.delta`, `response.reasoning_text.delta`
`emit_previous_item_done_events(ctx, state)`	`response.content_part.done`, `response.output_item.done`
`emit_tool_action_events(ctx, state)`	Function call, MCP call, code interpreter, web search events

Required Event Pairs

Each stream must emit matched open/close events:

Done / Completed	Added / In-Progress / Delta
`response.completed`	`response.created`
`response.output_item.done`	`response.output_item.added`
`response.content_part.done`	`response.content_part.added`
`response.output_text.done`	`response.output_text.delta`
`response.reasoning_text.done`	`response.reasoning_text.delta`
`response.function_call_arguments.done`	`response.function_call_arguments.delta`
`response.mcp_call.completed`	`response.mcp_call.in_progress`
`response.code_interpreter_call.completed`	`response.code_interpreter_call.in_progress`
`response.web_search_call.completed`	`response.web_search_call.in_progress`

Sources: vllm/entrypoints/openai/responses/streaming_events.py1-100 tests/entrypoints/openai/responses/conftest.py22-46

Tool Calling in Chat Completions

The Chat Completion API (/v1/chat/completions) supports tool calling through a ToolParser plugin that post-processes model output text to extract structured tool calls.

ToolParser and ToolParserManager

ToolParser (abstract) and ToolParserManager (registry) are defined in vllm/tool_parsers/abstract_tool_parser.py and exported from vllm/tool_parsers/__init__.py

ToolParser required interface:

Method	Called When
`extract_tool_calls(model_output, request)`	Non-streaming: processes full output string
`extract_tool_calls_streaming(prev_text, curr_text, delta_text, prev_ids, curr_ids, delta_ids, request)`	Streaming: called on each token delta
`adjust_request(request)` (optional)	Before generation: e.g., to disable `skip_special_tokens`

ToolParserManager.register_lazy_module(name, module_path, class_name) registers parsers without importing them at startup. Parsers are loaded on first use.

ToolParser Registry and Chat Completion Dispatch:

Sources: vllm/tool_parsers/__init__.py1-100 vllm/entrypoints/openai/chat_completion/serving.py127-137

Registered Tool Parsers

Parser Name	Class (in `vllm/tool_parsers/`)	Supported Models
`hermes`	`Hermes2ProToolParser`	Nous Hermes 2 Pro+, Qwen2.5, Granite 4.0
`mistral`	`MistralToolParser`	Mistral 7B-Instruct v0.3+
`llama3_json`	`Llama3JsonToolParser`	Llama 3.1, 3.2 (JSON-based)
`llama4_pythonic`	`Llama4PythonicToolParser`	Llama 4
`pythonic`	`PythonicToolParser`	Llama 3.2, ToolACE-8B, Llama 4
`internlm`	`InternLMToolParser`	InternLM 2.5+
`jamba`	`JambaToolParser`	AI21 Jamba 1.5
`xlam`	`xLAMToolParser`	Salesforce Llama-xLAM, Qwen-xLAM
`granite`	`GraniteToolParser`	IBM Granite 3.x
`granite-20b-fc`	`Granite20bFCToolParser`	IBM Granite 20B FC
`deepseek_v3`	`DeepSeekV3ToolParser`	DeepSeek-V3-0324, R1-0528
`deepseek_v31`	`DeepSeekV31ToolParser`	DeepSeek-V3.1
`minimax_m1`	(minimax parser)	MiniMax M1
`openai`	(openai parser)	OpenAI OSS 20B/120B
`kimi_k2`	`KimiK2ToolParser`	Kimi-K2-Instruct
`hunyuan_a13b`	(hunyuan parser)	Hunyuan A13B
`longcat`	(longcat parser)	LongCat Flash Chat
`glm45`	`Glm4MoeModelToolParser`	GLM-4.5, GLM-4.6
`glm47`	`Glm47MoeModelToolParser`	GLM-4.7
`functiongemma`	`FunctionGemmaToolParser`	FunctionGemma 270M
`qwen3_xml`	(qwen3 xml parser)	Qwen3-Coder
`olmo3`	(olmo3 parser)	OLMo 3
`gigachat3`	(gigachat parser)	GigaChat 3

Sources: vllm/tool_parsers/__init__.py24-100 docs/features/tool_calling.md127-432

Tool Choice Modes

`tool_choice`	Behavior	Backend
`"none"`	No tool calls; plain text response. Optionally suppress tool defs with `--exclude-tools-when-tool-choice-none`	N/A
`"auto"`	Model decides; parsed by `ToolParser`	`--enable-auto-tool-choice` + `--tool-call-parser`
`"required"`	Model forced to call ≥1 tool; output constrained to tool JSON schema	Structured outputs
`{"type": "function", "function": {"name": "f"}}`	Forces named function call	Structured outputs

For "auto", ToolParser.extract_tool_calls() inspects the raw generation text for model-specific tool call syntax (XML tags, JSON arrays, Python lists, etc.).

For "required" and named tool choice, vLLM uses the structured output backend (xgrammar or outlines) to constrain the model to a valid JSON schema matching the tool parameter definitions.

Sources: docs/features/tool_calling.md87-110 vllm/entrypoints/openai/chat_completion/serving.py256-283

Tool Calling in the Responses API

The Responses API supports three tool categories with different execution responsibilities:

Category	Tool Types	Execution Location
Function tools	`"function"`	Client-side: vLLM generates call, client executes and provides result
Built-in tools	`"code_interpreter"`, `"web_search_preview"`, `"container"`	Server-side via `ToolServer`
MCP tools	`"mcp"`	Server-side via MCP protocol sessions

Function Tools

extract_tool_types() vllm/entrypoints/openai/responses/serving.py403 extracts the set of tool type strings from the request's tools list for routing to built-in handlers.

Built-in Tools and ToolServer

When --tool-server demo is specified, vLLM instantiates a ToolServer (from vllm/entrypoints/mcp/tool_server.py) that provides:

browser → handles web_search_preview requests
python → handles code_interpreter requests
container → handles container requests

MCP Tools

MCP tools allow requests to specify arbitrary external tool servers. A request includes Mcp type entries with server_label, server_url, and optional allowed_tools.

_initialize_tool_sessions() vllm/entrypoints/openai/responses/serving.py641-655 calls context.init_tool_sessions(), which establishes per-request MCP connections managed by an AsyncExitStack.

MCP Tool Execution Sequence:

Sources: vllm/entrypoints/openai/responses/serving.py641-655 vllm/entrypoints/openai/responses/context.py107-140 tests/entrypoints/openai/responses/test_mcp_tools.py39-98

Harmony Integration (GPT-OSS Models)

The Harmony path activates when hf_config.model_type == "gpt_oss" vllm/entrypoints/openai/responses/serving.py223 and uses the openai_harmony library for encoding and parsing.

Key Utilities in `harmony_utils.py`

vllm/entrypoints/openai/parser/harmony_utils.py provides:

Function	Purpose
`get_encoding()`	Loads and caches `HarmonyEncodingName.HARMONY_GPT_OSS`
`render_for_completion(messages)`	Converts `list[Message]` → `list[int]` token IDs via harmony encoding
`get_stop_tokens_for_assistant_actions()`	Returns `<\|return\|>` and `<\|call\|>` stop token IDs added to `stop_token_ids`
`get_system_message(...)`	Builds `SYSTEM` role `Message` with model identity, reasoning effort, date, tool descriptions
`get_developer_message(instructions, tools)`	Builds `DEVELOPER` role `Message` with instructions and function tool definitions
`get_streamable_parser_for_assistant()`	Creates a `StreamableParser` for incremental token-by-token decoding
`parse_chat_inputs_to_harmony_messages(chat_msgs)`	Converts Chat Completion format messages to harmony `Message` objects
`auto_drop_analysis_messages(msgs)`	Removes stale analysis messages from multi-turn context

Sources: vllm/entrypoints/openai/parser/harmony_utils.py60-332

Harmony Message Channels

Harmony messages use a channel attribute to encode their semantic role:

Channel	Recipient	Interpretation
`"final"`	N/A	Visible assistant text output to the user
`"analysis"`	N/A	Hidden chain-of-thought reasoning (counts toward `num_reasoning_tokens`)
`"commentary"`	`None`	Visible preamble text (does not count as reasoning)
`"commentary"`	tool name	Hidden tool call arguments (counts toward `num_reasoning_tokens`)

Sources: vllm/entrypoints/openai/parser/harmony_utils.py226-316 tests/entrypoints/test_context.py239-274

Harmony ↔ Responses API Conversion

vllm/entrypoints/openai/responses/harmony.py handles bidirectional conversion:

Input parsing: response_input_to_harmony() converts ResponsesRequest.input items (messages, reasoning items, function_call, function_call_output) to harmony Message objects.
Output building: harmony_to_response_output() converts completed harmony Message list to list[ResponseOutputItem] (message, reasoning, function_call, mcp_call, web_search output items).
Streaming output: parser_state_to_response_output() converts the in-progress StreamableParser state to response output items during SSE streaming.

Sources: vllm/entrypoints/openai/responses/harmony.py1-50 tests/entrypoints/openai/responses/test_harmony_utils.py

Harmony Multi-Turn Usage Tracking

The responses_full_generator() assembles ResponseUsage from context-reported metrics vllm/entrypoints/openai/responses/serving.py762-785:

ResponseUsage
├── input_tokens            (sum of all turns' prompt tokens)
├── output_tokens           (sum of all turns' generated tokens)
├── input_tokens_details
│   ├── cached_tokens       (sum of KV-cached tokens)
│   ├── input_tokens_per_turn
│   └── cached_tokens_per_turn
└── output_tokens_details
    ├── reasoning_tokens    (analysis channel + tool-directed commentary)
    ├── tool_output_tokens  (prompt growth from tool results)
    ├── output_tokens_per_turn
    └── tool_output_tokens_per_turn

Sources: vllm/entrypoints/openai/responses/serving.py762-785

Writing Custom Tool Parsers

To add a parser for a new model family:

1. Implement ToolParser:

class MyModelToolParser(ToolParser):
    def __init__(self, tokenizer: TokenizerLike):
        super().__init__(tokenizer)

    def adjust_request(self, request: ChatCompletionRequest) -> ChatCompletionRequest:
        # e.g., disable skip_special_tokens if the model uses special tokens for tool calls
        return request

    def extract_tool_calls(
        self, model_output: str, request: ChatCompletionRequest
    ) -> ExtractedToolCallInformation:
        # Parse model_output string and return structured tool calls
        ...

    def extract_tool_calls_streaming(
        self,
        previous_text: str, current_text: str, delta_text: str,
        previous_token_ids, current_token_ids, delta_token_ids,
        request: ChatCompletionRequest,
    ) -> DeltaMessage | None:
        # Return incremental DeltaMessage or None if not enough context yet
        ...

2. Register with ToolParserManager:

ToolParserManager.register_lazy_module(
    name="my_model",
    module_path="vllm.tool_parsers.my_model_parser",
    class_name="MyModelToolParser",
)

3. Use at server startup:

For external plugins, pass --tool-parser-plugin /absolute/path/to/plugin.py instead of registering in the package.

Sources: docs/features/tool_calling.md462-525 vllm/tool_parsers/__init__.py1-50

Configuration Reference

Server Flags

Flag	Purpose
`--enable-auto-tool-choice`	Enable `tool_choice="auto"` mode in Chat Completions
`--tool-call-parser <name>`	Select tool parser (e.g., `hermes`, `llama3_json`, `pythonic`)
`--tool-parser-plugin <path>`	Load external tool parser plugin
`--chat-template <path>`	Tool-compatible Jinja2 chat template path
`--reasoning-parser <name>`	Reasoning parser (also used in Responses API for `num_reasoning_tokens`)
`--tool-server demo`	Enable built-in tool server (code interpreter, web search, container)
`--exclude-tools-when-tool-choice-none`	Omit tool definitions from prompt when `tool_choice="none"`

Environment Variables

Variable	Default	Purpose
`VLLM_ENABLE_RESPONSES_API_STORE`	`0`	Enable response storage; required for `store=True` and background requests
`VLLM_USE_EXPERIMENTAL_PARSER_CONTEXT`	`0`	Use `ParsableContext` for token-level parsing during generation
`VLLM_GPT_OSS_SYSTEM_TOOL_MCP_LABELS`	`""`	Comma-separated MCP server labels treated as system tools
`VLLM_GPT_OSS_HARMONY_SYSTEM_INSTRUCTIONS`	`0`	When `1`, embed `instructions` in the system message; otherwise in developer message
`VLLM_SYSTEM_START_DATE`	current date	Pin the conversation start date in harmony system messages

Sources: vllm/entrypoints/openai/responses/serving.py215-265 vllm/entrypoints/openai/parser/harmony_utils.py80-96 tests/entrypoints/openai/responses/test_mcp_tools.py113-130

Responses API and Tool Calling

Responses API Serving Layer

OpenAIServingResponses

Request Protocol

Request Lifecycle

Context Selection

Response Status Values

Conversation Contexts

SimpleContext

HarmonyContext and StreamingHarmonyContext

ParsableContext (Experimental)

Streaming Events

StreamingState

Event Emitters

Required Event Pairs

Tool Calling in Chat Completions

ToolParser and ToolParserManager

Registered Tool Parsers

Tool Choice Modes

Tool Calling in the Responses API

Function Tools

Built-in Tools and ToolServer

MCP Tools

Harmony Integration (GPT-OSS Models)

Key Utilities in harmony_utils.py

Harmony Message Channels

Harmony ↔ Responses API Conversion

Harmony Multi-Turn Usage Tracking

Writing Custom Tool Parsers

Configuration Reference

Server Flags

Environment Variables

On this page

Responses API and Tool Calling

Responses API Serving Layer

OpenAIServingResponses

Request Protocol

Request Lifecycle

Context Selection

Response Status Values

Conversation Contexts

SimpleContext

HarmonyContext and StreamingHarmonyContext

ParsableContext (Experimental)

Streaming Events

StreamingState

Event Emitters

Required Event Pairs

Tool Calling in Chat Completions

ToolParser and ToolParserManager

Registered Tool Parsers

Tool Choice Modes

Tool Calling in the Responses API

Function Tools

Built-in Tools and ToolServer

MCP Tools

Harmony Integration (GPT-OSS Models)

Key Utilities in harmony_utils.py

Harmony Message Channels

Harmony ↔ Responses API Conversion

Harmony Multi-Turn Usage Tracking

Writing Custom Tool Parsers

Configuration Reference

Server Flags

Environment Variables

On this page

Key Utilities in `harmony_utils.py`

Key Utilities in `harmony_utils.py`