This page documents the FastAPI-based HTTP server that exposes vLLM's inference capabilities through an OpenAI-compatible REST API. It covers the server's entry points, startup sequence, route registration, middleware stack, engine client lifecycle, and multi-process deployment modes.
For documentation on how chat messages and multimodal content are parsed before being dispatched to the engine, see Chat Utilities and Multimodal Input Handling. For structured output and tool-calling integration, see Structured Output Generation and Responses API and Tool Calling. For the underlying engine client and async engine architecture, see Engine Core and Client APIs.
High-Level Request Flow
Sources: vllm/entrypoints/cli/serve.py42-134 vllm/entrypoints/openai/api_server.py464-530
The vLLM CLI is defined in vllm/entrypoints/cli/main.py and dispatches subcommands through the CLISubcommand base class. The serve subcommand is registered by ServeSubcommand in vllm/entrypoints/cli/serve.py
The dispatch logic in ServeSubcommand.cmd() examines args.api_server_count:
| Condition | Mode | Function |
|---|---|---|
api_server_count == 1 | Single process | run_server(args) |
api_server_count > 1 | Multi-process (data parallel) | run_multi_api_server(args) |
api_server_count < 1 | Headless (engine only) | run_headless(args) |
The default value of api_server_count is derived from data_parallel_size unless overridden.
Sources: vllm/entrypoints/cli/main.py16-79 vllm/entrypoints/cli/serve.py42-134
The key startup steps performed by run_server_worker() vllm/entrypoints/openai/api_server.py474-530:
setup_server(args) – validates arguments, binds the socket (TCP or Unix domain), sets ulimit, installs a SIGTERM handler.build_async_engine_client(args) – async context manager that creates the AsyncLLM engine and yields an EngineClient for the lifetime of the server.engine_client.get_supported_tasks() – queries the engine to discover which task types the loaded model supports.build_app(args, supported_tasks) – constructs the FastAPI application, registers all route routers, and applies middleware.init_app_state(engine_client, app.state, args, supported_tasks) – populates app.state with the engine client, serving models, tokenization service, and per-task serving handlers.serve_http(app, sock=sock, ...) – starts the uvicorn server loop.Sources: vllm/entrypoints/openai/api_server.py419-530
build_app)build_app() vllm/entrypoints/openai/api_server.py158-288 creates the FastAPI instance and conditionally registers routers based on supported_tasks.
Routes registered unconditionally:
| Router | Attachment function | Example paths |
|---|---|---|
| General serve routes | register_vllm_serve_api_routers(app) | /health, /version, /metrics, /tokenize, /detokenize |
| Model listing | register_models_api_router(app) | /v1/models |
| SageMaker compat | register_sagemaker_api_router(app, supported_tasks) | /invocations, /ping |
Routes registered when "generate" in supported_tasks:
| Router | Attachment function | Example paths |
|---|---|---|
| Text generation | register_generate_api_routers(app) | /v1/completions, /v1/chat/completions |
| Disaggregated serving | attach_disagg_router(app) | Internal disagg endpoints |
| RLHF | attach_rlhf_router(app) | RLHF endpoints |
| Elastic EP scaling | elastic_ep_attach_router(app) | /scale_elastic_ep, /pause, /resume |
Routes registered when "transcription" in supported_tasks:
| Router | Attachment function | Example paths |
|---|---|---|
| Speech to text | register_speech_to_text_api_router(app) | /v1/audio/transcriptions |
Routes registered when "realtime" in supported_tasks:
| Router | Attachment function | Example paths |
|---|---|---|
| Realtime | register_realtime_api_router(app) | WebSocket realtime endpoint |
Routes registered for pooling tasks (embeddings, scoring, reranking):
| Router | Attachment function | Example paths |
|---|---|---|
| Pooling | register_pooling_api_routers(app, supported_tasks) | /v1/embeddings, /v1/score, /v1/rerank |
Sources: vllm/entrypoints/openai/api_server.py158-288
Middleware is applied in this order (last applied runs first):
| Middleware | Condition | Description |
|---|---|---|
CORSMiddleware | Always | Configurable via --allowed-origins, --allowed-methods, --allowed-headers, --allow-credentials |
AuthenticationMiddleware | --api-key or VLLM_API_KEY set | Bearer token validation on /v1/ paths only |
XRequestIdMiddleware | --enable-request-id-headers | Adds X-Request-Id header to responses |
ScalingMiddleware | Always | Checks for scaling state before processing |
| Custom middleware | --middleware args | Loaded dynamically from import paths; can be a class or async function |
Exception handlers are registered for HTTPException (http_exception_handler) and RequestValidationError (validation_exception_handler).
Sources: vllm/entrypoints/openai/api_server.py241-288
init_app_state)init_app_state() vllm/entrypoints/openai/api_server.py291-379 populates app.state with all serving objects. This runs after the engine is ready and before the HTTP server starts accepting requests.
app.state field | Type | Description |
|---|---|---|
engine_client | EngineClient | Connection to the underlying AsyncLLM |
vllm_config | VllmConfig | Full engine configuration |
args | Namespace | CLI args |
openai_serving_models | OpenAIServingModels | Handles /v1/models and LoRA module registry |
openai_serving_tokenization | OpenAIServingTokenization | Handles /tokenize, /detokenize |
| (generate-specific) | OpenAIServingCompletion, OpenAIServingChat, etc. | Initialized by init_generate_state() |
log_stats | bool | Whether to log per-request stats |
enable_server_load_tracking | bool | Whether to track server_load_metrics |
OpenAIServingModels also calls init_static_loras() to pre-load LoRA adapters listed in --lora-modules.
Sources: vllm/entrypoints/openai/api_server.py291-379
The engine client is managed by build_async_engine_client() and build_async_engine_client_from_engine_args() as async context managers.
Key behaviors:
VLLM_WORKER_MULTIPROC_METHOD=forkserver, the forkserver is pre-loaded with vllm.v1.engine.async_llm before the engine is started.client_config (containing input_address, output_address, client_count, client_index) is passed through to AsyncLLM to configure ZMQ connections in multi-server mode.async_llm.shutdown() is called in the finally block of the context manager regardless of how the server exits.Sources: vllm/entrypoints/openai/api_server.py69-155
When api_server_count > 1, run_multi_api_server() vllm/entrypoints/cli/serve.py218-291 spawns separate OS processes for each API server, with each connected to a dedicated engine core.
Each worker process receives a client_config dict containing:
| Key | Description |
|---|---|
input_address | ZMQ address for sending requests to the engine |
output_address | ZMQ address for receiving results from the engine |
client_count | Total number of API server processes |
client_index | Index of this process |
stats_update_address | Optional address for receiving load-balancing stats from DPCoordinator |
APIServerProcessManager vllm/v1/utils.py159-225 uses multiprocessing.get_context("spawn") so each worker starts in a clean state. A weakref.finalize ensures worker processes are terminated when the manager is garbage-collected.
Sources: vllm/entrypoints/cli/serve.py218-307 vllm/v1/utils.py159-225
Arguments are defined across two dataclasses in vllm/entrypoints/openai/cli_args.py:
BaseFrontendArgsContains arguments that do not include host/port/SSL/HTTP-server-specific settings. Used in contexts where a frontend server runs embedded (e.g., Ray Serve integration).
| Argument | Type | Default | Description |
|---|---|---|---|
--lora-modules | list[LoRAModulePath] | None | LoRA adapters in name=path or JSON format |
--chat-template | str | None | Path or inline chat template |
--chat-template-content-format | str | "auto" | "string" or "openai" |
--response-role | str | "assistant" | Role returned in generation prompts |
--enable-auto-tool-choice | bool | False | Enable auto tool calling |
--tool-call-parser | str | None | Parser for tool call output |
--tool-parser-plugin | str | "" | Import path for custom tool parser |
--max-log-len | int | None | Max prompt chars to log |
--disable-frontend-multiprocessing | bool | False | Run frontend in-process with engine |
FrontendArgs (extends BaseFrontendArgs)| Argument | Type | Default | Description |
|---|---|---|---|
--host | str | None | Bind host |
--port | int | 8000 | Bind port |
--uds | str | None | Unix domain socket path (overrides host/port) |
--api-key | list[str] | None | Required Bearer tokens |
--allowed-origins | list[str] | ["*"] | CORS allowed origins |
--allowed-methods | list[str] | ["*"] | CORS allowed methods |
--allowed-headers | list[str] | ["*"] | CORS allowed headers |
--ssl-keyfile | str | None | TLS key file path |
--ssl-certfile | str | None | TLS cert file path |
--middleware | list[str] | [] | Additional ASGI middleware import paths |
--uvicorn-log-level | str | "info" | Log verbosity for uvicorn |
--root-path | str | None | FastAPI root_path for proxy deployments |
--disable-fastapi-docs | bool | False | Disable Swagger/ReDoc UI |
--enable-request-id-headers | bool | False | Emit X-Request-Id header |
make_arg_parser() vllm/entrypoints/openai/cli_args.py313-350 combines FrontendArgs.add_cli_args() and AsyncEngineArgs.add_cli_args() into a single parser. This is the parser used by both the vllm serve subcommand and the direct python -m vllm.entrypoints.openai.api_server invocation.
Sources: vllm/entrypoints/openai/cli_args.py69-373
validate_api_server_args() vllm/entrypoints/openai/api_server.py401-416 runs before the engine starts and raises KeyError for:
--enable-auto-tool-choice with an unregistered --tool-call-parser--reasoning-parser (from structured_outputs_config)validate_parsed_serve_args() vllm/entrypoints/openai/cli_args.py353-365 runs at CLI parse time and raises TypeError for:
--enable-auto-tool-choice without --tool-call-parser--enable-log-outputs without --enable-log-requestsTool parsers are registered via ToolParserManager and reasoning parsers via ReasoningParserManager. Plugins are loaded from paths specified in --tool-parser-plugin and --reasoning-parser-plugin.
Sources: vllm/entrypoints/openai/api_server.py401-433 vllm/entrypoints/openai/cli_args.py353-365
setup_server() vllm/entrypoints/openai/api_server.py419-461 binds the socket before the engine is initialized to avoid a race condition with Ray (see GitHub #8204).
| Mode | Function | Socket type |
|---|---|---|
| TCP (default) | create_server_socket((host, port)) | AF_INET or AF_INET6 |
Unix domain socket (--uds) | create_server_unix_socket(path) | AF_UNIX |
Both functions set SO_REUSEADDR and SO_REUSEPORT on TCP sockets, enabling multiple workers to share the same port. set_ulimit() is also called to raise the open-file-descriptor limit so uvicorn doesn't silently drop connections under high concurrency.
Sources: vllm/entrypoints/openai/api_server.py382-461
The AuthenticationMiddleware only enforces the API key on paths starting with /v1/. Several endpoints on the same HTTP server are not protected by this middleware:
Always unprotected:
/health, /ping, /version, /metrics – operational endpoints/tokenize, /detokenize – utility endpoints/invocations – SageMaker-compatible inference (same capability as /v1/completions)/pause, /resume, /scale_elastic_ep – operational controlConditionally available (not protected):
/tokenizer_info – only when --enable-tokenizer-info-endpoint; may expose chat templates/server_info, /collective_rpc, /sleep, /wake_up, /reset_prefix_cache, etc. – only when VLLM_SERVER_DEV_MODE=1; must not be enabled in productionThe recommended deployment pattern is to place vLLM behind a reverse proxy that allowlists only the endpoints clients should access.
Sources: docs/usage/security.md112-224
run_headless() vllm/entrypoints/cli/serve.py137-215 starts engine core processes without any API server. This is used in disaggregated or multi-node setups where the front-end runs on a different host.
--headless implies api_server_count=0.vllm_config is created with headless=True.CoreEngineProcManager launches one engine process per local data-parallel rank.node_rank_within_dp > 0 (a non-head node in a multi-node pipeline/tensor-parallel group), a MultiprocExecutor is started instead.Sources: vllm/entrypoints/cli/serve.py137-215
Refresh this wiki
This wiki was recently refreshed. Please wait 5 days to refresh again.