This page documents the HTTP API exposed by llama-server (tools/server/). It covers all endpoints, request and response formats, authentication, SSE streaming, sampling parameters, and the internal task dispatch system.
/chat/completions formatting, see 3.9The server is built from three primary components, each in a dedicated file:
| Component | File | Role |
|---|---|---|
server_http_context | tools/server/server-http.h | HTTP layer (httplib), route binding, SSE chunked responses |
server_context | tools/server/server-context.h | Inference engine, slot pool, task execution loop |
server_routes | tools/server/server.cpp | Handler functions wired to HTTP routes via ctx_http.get/post |
Component Architecture:
Sources: tools/server/server.cpp99-322 tools/server/server-context.h1-50 tools/server/server-queue.h1-30
Route-to-Handler Binding:
Sources: tools/server/server.cpp165-199
The binary is llama-server. Without --model, it starts in router mode (see 6.4).
Key startup flags:
| Flag | Default | Description |
|---|---|---|
-m, --model FNAME | (none) | Path to GGUF model |
--host HOST | 127.0.0.1 | Listening address |
--port PORT | 8080 | Listening port (0 = random) |
-np, --parallel N | 1 | Number of inference slots |
-c, --ctx-size N | 0 (from model) | Context window size |
--api-key KEY | (none) | Bearer token for auth |
--api-key-file FNAME | (none) | File with API keys (one per line) |
--metrics | disabled | Enable /metrics endpoint |
--no-slots | false | Disable /slots endpoint |
--embedding | disabled | Restrict to embedding mode |
--reranking | disabled | Enable reranking endpoint |
--cont-batching | enabled | Continuous batching |
--chat-template TEMPLATE | (from model) | Custom Jinja2 chat template string |
--chat-template-file FILE | (none) | Read chat template from file |
--jinja | false | Use Jinja2 renderer (vs. built-in) |
--reasoning-format FORMAT | none | none, deepseek, deepseek-legacy, auto |
--slot-save-path PATH | (none) | Directory for slot KV save/load |
--ssl-key-file FNAME | (none) | TLS private key (PEM) |
--ssl-cert-file FNAME | (none) | TLS certificate (PEM) |
Sources: tools/server/README.md30-119 common/arg.cpp73-80
When --api-key or --api-key-file is set, all non-public endpoints require:
Authorization: Bearer <key>
Public endpoints (no key required):
GET /health, GET /v1/healthGET /models, GET /v1/models, GET /api/tagsOn auth failure the server returns HTTP 401:
Route registration clearly marks which handlers bypass the key check at tools/server/server.cpp165-175
Sources: tools/server/server.cpp165-199 tools/server/server-common.cpp16-57
GET /health · GET /v1/healthAlways public. Returns HTTP 503 while the model is loading (SERVER_STATE_LOADING_MODEL), HTTP 200 when ready (SERVER_STATE_READY).
While loading:
Server states are defined at tools/server/server-context.cpp43-46
GET /metricsPrometheus-format metrics. Requires --metrics. Includes token throughput, slot activity, KV cache utilization, and request latencies.
GET /propsReturns model and server configuration: model name, context size, chat template info, supported features, web UI settings.
POST /propsSets per-session properties such as default system prompt and antiprompt strings.
Sources: tools/server/server-context.cpp43-46
GET /models · GET /v1/models · GET /api/tagsOpenAI- and Ollama-compatible model listing. Always public.
POST /api/showOllama-compatible model metadata endpoint.
POST /completion · POST /completionsNative llama.cpp completion. Dispatches SERVER_TASK_TYPE_COMPLETION with TASK_RESPONSE_TYPE_NONE.
Request fields:
| Field | Type | Default | Description |
|---|---|---|---|
prompt | string or array | required | Prompt text or array of token IDs |
n_predict | int | -1 | Max tokens (-1 = unlimited) |
stream | bool | false | SSE streaming |
temperature | float | 0.80 | Sampling temperature |
top_k | int | 40 | Top-K sampling |
top_p | float | 0.95 | Top-P nucleus sampling |
min_p | float | 0.05 | Min-P threshold |
seed | int | random | RNG seed |
stop | array | [] | Stop strings |
cache_prompt | bool | true | Reuse KV cache for matching prefix |
n_keep | int | 0 | Tokens to preserve on context shift |
grammar | string | "" | GBNF grammar string |
json_schema | object | null | JSON schema (auto-converted to grammar) |
logit_bias | array | [] | [{"token": id, "bias": float}] |
n_probs | int | 0 | Top-N probabilities to return |
lora | array | [] | [{"id": int, "scale": float}] per-request LoRA |
timings_per_token | bool | false | Include per-token timing in stream events |
return_tokens | bool | false | Include generated token IDs in response |
n_cache_reuse | int | 0 | Min chunk size for KV shift reuse |
t_max_predict_ms | int | -1 | Max generation wall-clock time (ms) |
speculative | object | {} | Speculative decoding overrides |
Non-streaming response:
SSE stream events (one per token):
data: {"content": "generated", "stop": false}
data: {"content": " text", "stop": false}
data: {"content": "", "stop": true, "timings": {...}, "tokens_predicted": 42}
Sources: tools/server/server-task.h48-86 tools/server/server-task.cpp97-135
POST /v1/completionsOpenAI-compatible text completion. Dispatches with TASK_RESPONSE_TYPE_OAI_CMPL. Accepts: model, prompt, max_tokens, temperature, stream, stop, logit_bias, frequency_penalty, presence_penalty, seed, n, suffix.
Response:
Sources: tools/server/server-task.h34
POST /chat/completions · POST /v1/chat/completions · POST /api/chatOpenAI-compatible chat. Applies the model's chat template to format messages. Dispatches with TASK_RESPONSE_TYPE_OAI_CHAT.
Key request fields:
| Field | Type | Default | Description |
|---|---|---|---|
model | string | (ignored) | Passed through in response |
messages | array | required | [{"role": "user"/"assistant"/"system", "content": "..."}] |
max_tokens | int | -1 | Max generated tokens |
stream | bool | false | SSE streaming |
temperature | float | 0.80 | Sampling temperature |
top_p | float | 0.95 | Top-P sampling |
frequency_penalty | float | 0.0 | Frequency penalty |
presence_penalty | float | 0.0 | Presence penalty |
seed | int | random | RNG seed |
stop | string or array | [] | Stop sequences |
tools | array | [] | Tool/function definitions |
tool_choice | string or object | "auto" | Tool selection strategy |
response_format | object | null | {"type": "json_object"} or {"type": "json_schema", "json_schema": {...}} |
logit_bias | object | null | {token_id_str: bias} |
Message content may be a string or an array of content parts for multimodal input:
Non-streaming response:
reasoning_content is populated when --reasoning-format is set and the model emits chain-of-thought tokens (e.g., DeepSeek-R1 <think> blocks). See common_reasoning_format in common/common.h330-338 and 3.9 for template details.
SSE stream format:
data: {"id": "chatcmpl-...", "object": "chat.completion.chunk", "choices": [{"delta": {"content": "part"}, "index": 0}]}
data: {"id": "chatcmpl-...", "object": "chat.completion.chunk", "choices": [{"delta": {}, "finish_reason": "stop", "index": 0}], "usage": {...}}
data: [DONE]
Sources: tools/server/server-task.h34 tools/server/tests/unit/test_chat_completion.py28-52
POST /v1/responsesOpenAI Responses API format. Routes to the same inference path as chat completions but uses TASK_RESPONSE_TYPE_OAI_RESP response formatting.
Sources: tools/server/server-task.h36 tools/server/server.cpp180
POST /v1/messagesAnthropic-compatible chat. Uses TASK_RESPONSE_TYPE_ANTHROPIC.
Request:
Response:
POST /v1/messages/count_tokensReturns token count for an Anthropic-format request without running inference.
Sources: tools/server/server-task.h38 tools/server/server.cpp181-182
POST /infillFill-in-the-middle for code models with FIM token support. Dispatches SERVER_TASK_TYPE_INFILL.
Request:
Response format is identical to /completion.
Sources: tools/server/server-task.h22 tools/server/server.cpp183
POST /embeddings · POST /embedding · POST /v1/embeddingsGenerate embeddings. Requires --embedding at startup. Dispatches SERVER_TASK_TYPE_EMBEDDING.
Request:
input may be a single string or an array of strings.
Response (/embeddings, native):
Response (/v1/embeddings, OAI-compatible):
Sources: tools/server/server-task.h18 tools/server/server.cpp184-186
POST /rerank · POST /reranking · POST /v1/rerank · POST /v1/rerankingScore document relevance against a query. Requires --reranking. Dispatches SERVER_TASK_TYPE_RERANK.
Request:
Response:
Sources: tools/server/server-task.h19 tools/server/server.cpp187-190
POST /tokenizeTokenize text without inference.
Request:
Response:
POST /detokenizeConvert token IDs back to text.
Request:
Response:
Sources: tools/server/server.cpp191-192
POST /apply-templateApply the model's chat template to a messages array and return the formatted string.
Request:
Response:
Sources: tools/server/server.cpp193
LoRA adapters are loaded at startup via --lora or --lora-scaled. The API adjusts scales at runtime.
GET /lora-adaptersDispatches SERVER_TASK_TYPE_GET_LORA.
POST /lora-adaptersDispatches SERVER_TASK_TYPE_SET_LORA. Sets adapter scales for subsequent requests.
LoRA scale management and cache invalidation logic is in tools/server/server-common.cpp91-154
Sources: tools/server/server.cpp195-196 tools/server/server-task.h27-28
The server maintains a pool of server_slot objects (one per --parallel value). Each slot holds a llama_context* and optional mtmd_context*.
GET /slotsReturns state of all inference slots. Requires absence of --no-slots.
POST /slots/:id_slotSave, restore, or erase slot KV cache state. Requires --slot-save-path.
action | Task Type | Description |
|---|---|---|
"save" | SERVER_TASK_TYPE_SLOT_SAVE | Write KV cache to file |
"restore" | SERVER_TASK_TYPE_SLOT_RESTORE | Load KV cache from file |
"erase" | SERVER_TASK_TYPE_SLOT_ERASE | Clear slot KV cache |
Request (save):
Sources: tools/server/server.cpp197-199 tools/server/server-task.h27-29
Slot State Machine (server_slot):
Sources: tools/server/server-context.cpp34-41 tools/server/server-context.cpp300-319
When "stream": true is set, the server responds with:
Content-Type: text/event-streamTransfer-Encoding: chunkedNative format (/completion, /completions):
data: {"content": "chunk one", "stop": false}
data: {"content": " chunk two", "stop": false}
data: {"content": "", "stop": true, "timings": {...}, "tokens_predicted": 42}
OpenAI chat format (/chat/completions, /v1/chat/completions):
data: {"id": "chatcmpl-...", "object": "chat.completion.chunk", "choices": [{"delta": {"role": "assistant", "content": ""}, "index": 0}]}
data: {"id": "chatcmpl-...", "object": "chat.completion.chunk", "choices": [{"delta": {"content": "Hello"}, "index": 0}]}
data: {"id": "chatcmpl-...", "object": "chat.completion.chunk", "choices": [{"delta": {}, "finish_reason": "stop", "index": 0}], "usage": {...}}
data: [DONE]
When "timings_per_token": true is set in the request, each stream event includes a timings object with per-token latency data.
Sources: tools/server/server-task.h49-51 tools/server/server-context.cpp1-30
All errors return a consistent JSON envelope:
Error type reference (from error_type enum):
type string | HTTP Code | Trigger |
|---|---|---|
invalid_request_error | 400 | Malformed body, unknown parameters |
exceed_context_size_error | 400 | Prompt exceeds context window |
authentication_error | 401 | Missing or invalid API key |
permission_error | 403 | Access denied |
not_found_error | 404 | Unknown resource |
not_supported_error | 501 | Feature not enabled or unavailable |
unavailable_error | 503 | No slots free, server still loading |
server_error | 500 | Unexpected internal exception |
Every route handler is wrapped by ex_wrapper (tools/server/server.cpp35-67), which catches C++ exceptions and converts them to the appropriate error response. std::invalid_argument maps to 400; all other exceptions map to 500.
Sources: tools/server/server-common.cpp16-57 tools/server/server.cpp35-67
All generation endpoints accept these fields, mapping to common_params_sampling (common/common.h180-245):
| Parameter | Type | Default | Description |
|---|---|---|---|
temperature | float | 0.80 | Softmax temperature (0 = greedy) |
dynatemp_range | float | 0.0 | Dynamic temperature range (0 = disabled) |
dynatemp_exponent | float | 1.0 | Dynamic temperature exponent |
top_k | int | 40 | Top-K cutoff (0 = all tokens) |
top_p | float | 0.95 | Top-P nucleus threshold |
min_p | float | 0.05 | Min-P threshold |
typical_p | float | 1.0 | Locally-typical sampling (1.0 = disabled) |
top_n_sigma | float | -1.0 | Top-N-Sigma (-1 = disabled) |
repeat_penalty | float | 1.0 | Repetition penalty (1.0 = disabled) |
repeat_last_n | int | 64 | Window for repetition penalty |
presence_penalty | float | 0.0 | OAI presence penalty |
frequency_penalty | float | 0.0 | OAI frequency penalty |
dry_multiplier | float | 0.0 | DRY repetition penalty multiplier |
dry_base | float | 1.75 | DRY base value |
dry_allowed_length | int | 2 | DRY allowed sequence length |
dry_penalty_last_n | int | -1 | DRY scan window (-1 = context size) |
xtc_probability | float | 0.0 | XTC sampling probability |
xtc_threshold | float | 0.1 | XTC threshold |
mirostat | int | 0 | Mirostat mode (0=off, 1=v1, 2=v2) |
mirostat_tau | float | 5.0 | Mirostat target entropy |
mirostat_eta | float | 0.1 | Mirostat learning rate |
seed | int | 0xFFFFFFFF | RNG seed (0xFFFFFFFF = random) |
n_probs | int | 0 | Top-N token probabilities to return |
min_keep | int | 0 | Minimum tokens each sampler must retain |
grammar | string | "" | GBNF grammar string |
grammar_lazy | bool | false | Grammar activates only on trigger tokens |
json_schema | object | null | JSON schema (auto-converted to GBNF grammar) |
ignore_eos | bool | false | Ignore EOS/EOG tokens |
logit_bias | array | [] | [{"token": id, "bias": float}] adjustments |
See 3.8 for the sampling chain architecture.
Sources: common/common.h180-245 tools/server/server-task.cpp28-89
Task Dispatch Flow:
Sources: tools/server/server-queue.h1-50 tools/server/server-context.cpp1-30
Task types (server_task_type in tools/server/server-task.h16-29):
| Task Type | Endpoint |
|---|---|
SERVER_TASK_TYPE_COMPLETION | /completion, /chat/completions, /v1/completions, etc. |
SERVER_TASK_TYPE_EMBEDDING | /embeddings, /v1/embeddings |
SERVER_TASK_TYPE_RERANK | /rerank, /v1/rerank |
SERVER_TASK_TYPE_INFILL | /infill |
SERVER_TASK_TYPE_CANCEL | Client disconnect |
SERVER_TASK_TYPE_METRICS | /metrics |
SERVER_TASK_TYPE_SLOT_SAVE | POST /slots/:id with "action": "save" |
SERVER_TASK_TYPE_SLOT_RESTORE | POST /slots/:id with "action": "restore" |
SERVER_TASK_TYPE_SLOT_ERASE | POST /slots/:id with "action": "erase" |
SERVER_TASK_TYPE_GET_LORA | GET /lora-adapters |
SERVER_TASK_TYPE_SET_LORA | POST /lora-adapters |
Response format selection (task_response_type in tools/server/server-task.h32-39):
task_response_type | Format |
|---|---|
TASK_RESPONSE_TYPE_NONE | Native llama.cpp format |
TASK_RESPONSE_TYPE_OAI_CHAT | OpenAI chat completions |
TASK_RESPONSE_TYPE_OAI_CMPL | OpenAI text completions |
TASK_RESPONSE_TYPE_OAI_RESP | OpenAI Responses API |
TASK_RESPONSE_TYPE_OAI_EMBD | OpenAI embeddings |
TASK_RESPONSE_TYPE_ANTHROPIC | Anthropic Messages API |
Sources: tools/server/server-task.h16-39
When llama-server is started without --model, it operates as a router that manages child server processes. Two additional endpoints are registered:
| Endpoint | Description |
|---|---|
POST /models/load | Start a child server with a specified model |
POST /models/unload | Stop a running child server |
All other endpoints proxy to the appropriate child instance via models_routes->proxy_get and models_routes->proxy_post. See 6.4 for full documentation.
Sources: tools/server/server.cpp124-163
Refresh this wiki
This wiki was recently refreshed. Please wait 2 days to refresh again.