llama-server HTTP API

Relevant source files

Purpose and Scope

This page documents the HTTP API exposed by llama-server (tools/server/). It covers all endpoints, request and response formats, authentication, SSE streaming, sampling parameters, and the internal task dispatch system.

For the built-in web UI served by llama-server, see 6.3
For router mode and multi-model management endpoints, see 6.4
For build instructions, see 2.1
For chat template rendering that drives /chat/completions formatting, see 3.9
For sampling algorithms referenced in this page, see 3.8

Server Architecture

The server is built from three primary components, each in a dedicated file:

Component	File	Role
`server_http_context`	`tools/server/server-http.h`	HTTP layer (httplib), route binding, SSE chunked responses
`server_context`	`tools/server/server-context.h`	Inference engine, slot pool, task execution loop
`server_routes`	`tools/server/server.cpp`	Handler functions wired to HTTP routes via `ctx_http.get/post`

Component Architecture:

Sources: tools/server/server.cpp99-322 tools/server/server-context.h1-50 tools/server/server-queue.h1-30

Route-to-Handler Binding:

Sources: tools/server/server.cpp165-199

Starting the Server

The binary is llama-server. Without --model, it starts in router mode (see 6.4).

Key startup flags:

Flag	Default	Description
`-m, --model FNAME`	(none)	Path to GGUF model
`--host HOST`	`127.0.0.1`	Listening address
`--port PORT`	`8080`	Listening port (0 = random)
`-np, --parallel N`	`1`	Number of inference slots
`-c, --ctx-size N`	`0` (from model)	Context window size
`--api-key KEY`	(none)	Bearer token for auth
`--api-key-file FNAME`	(none)	File with API keys (one per line)
`--metrics`	disabled	Enable `/metrics` endpoint
`--no-slots`	false	Disable `/slots` endpoint
`--embedding`	disabled	Restrict to embedding mode
`--reranking`	disabled	Enable reranking endpoint
`--cont-batching`	enabled	Continuous batching
`--chat-template TEMPLATE`	(from model)	Custom Jinja2 chat template string
`--chat-template-file FILE`	(none)	Read chat template from file
`--jinja`	false	Use Jinja2 renderer (vs. built-in)
`--reasoning-format FORMAT`	`none`	`none`, `deepseek`, `deepseek-legacy`, `auto`
`--slot-save-path PATH`	(none)	Directory for slot KV save/load
`--ssl-key-file FNAME`	(none)	TLS private key (PEM)
`--ssl-cert-file FNAME`	(none)	TLS certificate (PEM)

Sources: tools/server/README.md30-119 common/arg.cpp73-80

Authentication

When --api-key or --api-key-file is set, all non-public endpoints require:

Authorization: Bearer <key>

Public endpoints (no key required):

GET /health, GET /v1/health
GET /models, GET /v1/models, GET /api/tags

On auth failure the server returns HTTP 401:

Route registration clearly marks which handlers bypass the key check at tools/server/server.cpp165-175

Sources: tools/server/server.cpp165-199 tools/server/server-common.cpp16-57

Endpoint Reference

Health & Status

`GET /health` · `GET /v1/health`

Always public. Returns HTTP 503 while the model is loading (SERVER_STATE_LOADING_MODEL), HTTP 200 when ready (SERVER_STATE_READY).

While loading:

Server states are defined at tools/server/server-context.cpp43-46

`GET /metrics`

Prometheus-format metrics. Requires --metrics. Includes token throughput, slot activity, KV cache utilization, and request latencies.

`GET /props`

Returns model and server configuration: model name, context size, chat template info, supported features, web UI settings.

`POST /props`

Sets per-session properties such as default system prompt and antiprompt strings.

Sources: tools/server/server-context.cpp43-46

Models

`GET /models` · `GET /v1/models` · `GET /api/tags`

OpenAI- and Ollama-compatible model listing. Always public.

`POST /api/show`

Ollama-compatible model metadata endpoint.

Text Completion

`POST /completion` · `POST /completions`

Native llama.cpp completion. Dispatches SERVER_TASK_TYPE_COMPLETION with TASK_RESPONSE_TYPE_NONE.

Request fields:

Field	Type	Default	Description
`prompt`	string or array	required	Prompt text or array of token IDs
`n_predict`	int	-1	Max tokens (-1 = unlimited)
`stream`	bool	false	SSE streaming
`temperature`	float	0.80	Sampling temperature
`top_k`	int	40	Top-K sampling
`top_p`	float	0.95	Top-P nucleus sampling
`min_p`	float	0.05	Min-P threshold
`seed`	int	random	RNG seed
`stop`	array	`[]`	Stop strings
`cache_prompt`	bool	true	Reuse KV cache for matching prefix
`n_keep`	int	0	Tokens to preserve on context shift
`grammar`	string	`""`	GBNF grammar string
`json_schema`	object	null	JSON schema (auto-converted to grammar)
`logit_bias`	array	`[]`	`[{"token": id, "bias": float}]`
`n_probs`	int	0	Top-N probabilities to return
`lora`	array	`[]`	`[{"id": int, "scale": float}]` per-request LoRA
`timings_per_token`	bool	false	Include per-token timing in stream events
`return_tokens`	bool	false	Include generated token IDs in response
`n_cache_reuse`	int	0	Min chunk size for KV shift reuse
`t_max_predict_ms`	int	-1	Max generation wall-clock time (ms)
`speculative`	object	`{}`	Speculative decoding overrides

Non-streaming response:

SSE stream events (one per token):

data: {"content": "generated", "stop": false}

data: {"content": " text", "stop": false}

data: {"content": "", "stop": true, "timings": {...}, "tokens_predicted": 42}

Sources: tools/server/server-task.h48-86 tools/server/server-task.cpp97-135

`POST /v1/completions`

OpenAI-compatible text completion. Dispatches with TASK_RESPONSE_TYPE_OAI_CMPL. Accepts: model, prompt, max_tokens, temperature, stream, stop, logit_bias, frequency_penalty, presence_penalty, seed, n, suffix.

Response:

Sources: tools/server/server-task.h34

Chat Completions

`POST /chat/completions` · `POST /v1/chat/completions` · `POST /api/chat`

OpenAI-compatible chat. Applies the model's chat template to format messages. Dispatches with TASK_RESPONSE_TYPE_OAI_CHAT.

Key request fields:

Field	Type	Default	Description
`model`	string	(ignored)	Passed through in response
`messages`	array	required	`[{"role": "user"/"assistant"/"system", "content": "..."}]`
`max_tokens`	int	-1	Max generated tokens
`stream`	bool	false	SSE streaming
`temperature`	float	0.80	Sampling temperature
`top_p`	float	0.95	Top-P sampling
`frequency_penalty`	float	0.0	Frequency penalty
`presence_penalty`	float	0.0	Presence penalty
`seed`	int	random	RNG seed
`stop`	string or array	`[]`	Stop sequences
`tools`	array	`[]`	Tool/function definitions
`tool_choice`	string or object	`"auto"`	Tool selection strategy
`response_format`	object	null	`{"type": "json_object"}` or `{"type": "json_schema", "json_schema": {...}}`
`logit_bias`	object	null	`{token_id_str: bias}`

Message content may be a string or an array of content parts for multimodal input:

Non-streaming response:

reasoning_content is populated when --reasoning-format is set and the model emits chain-of-thought tokens (e.g., DeepSeek-R1 <think> blocks). See common_reasoning_format in common/common.h330-338 and 3.9 for template details.

SSE stream format:

data: {"id": "chatcmpl-...", "object": "chat.completion.chunk", "choices": [{"delta": {"content": "part"}, "index": 0}]}

data: {"id": "chatcmpl-...", "object": "chat.completion.chunk", "choices": [{"delta": {}, "finish_reason": "stop", "index": 0}], "usage": {...}}

data: [DONE]

Sources: tools/server/server-task.h34 tools/server/tests/unit/test_chat_completion.py28-52

`POST /v1/responses`

OpenAI Responses API format. Routes to the same inference path as chat completions but uses TASK_RESPONSE_TYPE_OAI_RESP response formatting.

Sources: tools/server/server-task.h36 tools/server/server.cpp180

Anthropic Messages API

`POST /v1/messages`

Anthropic-compatible chat. Uses TASK_RESPONSE_TYPE_ANTHROPIC.

Request:

Response:

`POST /v1/messages/count_tokens`

Returns token count for an Anthropic-format request without running inference.

Sources: tools/server/server-task.h38 tools/server/server.cpp181-182

Infill (Fill-in-the-Middle)

`POST /infill`

Fill-in-the-middle for code models with FIM token support. Dispatches SERVER_TASK_TYPE_INFILL.

Request:

Response format is identical to /completion.

Sources: tools/server/server-task.h22 tools/server/server.cpp183

Embeddings

`POST /embeddings` · `POST /embedding` · `POST /v1/embeddings`

Generate embeddings. Requires --embedding at startup. Dispatches SERVER_TASK_TYPE_EMBEDDING.

Request:

input may be a single string or an array of strings.

Response (/embeddings, native):

Response (/v1/embeddings, OAI-compatible):

Sources: tools/server/server-task.h18 tools/server/server.cpp184-186

Reranking

`POST /rerank` · `POST /reranking` · `POST /v1/rerank` · `POST /v1/reranking`

Score document relevance against a query. Requires --reranking. Dispatches SERVER_TASK_TYPE_RERANK.

Request:

Response:

Sources: tools/server/server-task.h19 tools/server/server.cpp187-190

Tokenization Utilities

`POST /tokenize`

Tokenize text without inference.

Request:

Response:

`POST /detokenize`

Convert token IDs back to text.

Request:

Response:

Sources: tools/server/server.cpp191-192

Chat Template Application

`POST /apply-template`

Apply the model's chat template to a messages array and return the formatted string.

Request:

Response:

Sources: tools/server/server.cpp193

LoRA Adapters

LoRA adapters are loaded at startup via --lora or --lora-scaled. The API adjusts scales at runtime.

`GET /lora-adapters`

Dispatches SERVER_TASK_TYPE_GET_LORA.

`POST /lora-adapters`

Dispatches SERVER_TASK_TYPE_SET_LORA. Sets adapter scales for subsequent requests.

LoRA scale management and cache invalidation logic is in tools/server/server-common.cpp91-154

Sources: tools/server/server.cpp195-196 tools/server/server-task.h27-28

Slot Management

The server maintains a pool of server_slot objects (one per --parallel value). Each slot holds a llama_context* and optional mtmd_context*.

`GET /slots`

Returns state of all inference slots. Requires absence of --no-slots.

`POST /slots/:id_slot`

Save, restore, or erase slot KV cache state. Requires --slot-save-path.

`action`	Task Type	Description
`"save"`	`SERVER_TASK_TYPE_SLOT_SAVE`	Write KV cache to file
`"restore"`	`SERVER_TASK_TYPE_SLOT_RESTORE`	Load KV cache from file
`"erase"`	`SERVER_TASK_TYPE_SLOT_ERASE`	Clear slot KV cache

Request (save):

Sources: tools/server/server.cpp197-199 tools/server/server-task.h27-29

Slot State Lifecycle

Slot State Machine (server_slot):

Sources: tools/server/server-context.cpp34-41 tools/server/server-context.cpp300-319

SSE Streaming

When "stream": true is set, the server responds with:

Content-Type: text/event-stream
Transfer-Encoding: chunked

Native format (/completion, /completions):

data: {"content": "chunk one", "stop": false}

data: {"content": " chunk two", "stop": false}

data: {"content": "", "stop": true, "timings": {...}, "tokens_predicted": 42}

OpenAI chat format (/chat/completions, /v1/chat/completions):

data: {"id": "chatcmpl-...", "object": "chat.completion.chunk", "choices": [{"delta": {"role": "assistant", "content": ""}, "index": 0}]}

data: {"id": "chatcmpl-...", "object": "chat.completion.chunk", "choices": [{"delta": {"content": "Hello"}, "index": 0}]}

data: {"id": "chatcmpl-...", "object": "chat.completion.chunk", "choices": [{"delta": {}, "finish_reason": "stop", "index": 0}], "usage": {...}}

data: [DONE]

When "timings_per_token": true is set in the request, each stream event includes a timings object with per-token latency data.

Sources: tools/server/server-task.h49-51 tools/server/server-context.cpp1-30

Error Responses

All errors return a consistent JSON envelope:

Error type reference (from error_type enum):

`type` string	HTTP Code	Trigger
`invalid_request_error`	400	Malformed body, unknown parameters
`exceed_context_size_error`	400	Prompt exceeds context window
`authentication_error`	401	Missing or invalid API key
`permission_error`	403	Access denied
`not_found_error`	404	Unknown resource
`not_supported_error`	501	Feature not enabled or unavailable
`unavailable_error`	503	No slots free, server still loading
`server_error`	500	Unexpected internal exception

Every route handler is wrapped by ex_wrapper (tools/server/server.cpp35-67), which catches C++ exceptions and converts them to the appropriate error response. std::invalid_argument maps to 400; all other exceptions map to 500.

Sources: tools/server/server-common.cpp16-57 tools/server/server.cpp35-67

Sampling Parameters

All generation endpoints accept these fields, mapping to common_params_sampling (common/common.h180-245):

Parameter	Type	Default	Description
`temperature`	float	0.80	Softmax temperature (0 = greedy)
`dynatemp_range`	float	0.0	Dynamic temperature range (0 = disabled)
`dynatemp_exponent`	float	1.0	Dynamic temperature exponent
`top_k`	int	40	Top-K cutoff (0 = all tokens)
`top_p`	float	0.95	Top-P nucleus threshold
`min_p`	float	0.05	Min-P threshold
`typical_p`	float	1.0	Locally-typical sampling (1.0 = disabled)
`top_n_sigma`	float	-1.0	Top-N-Sigma (-1 = disabled)
`repeat_penalty`	float	1.0	Repetition penalty (1.0 = disabled)
`repeat_last_n`	int	64	Window for repetition penalty
`presence_penalty`	float	0.0	OAI presence penalty
`frequency_penalty`	float	0.0	OAI frequency penalty
`dry_multiplier`	float	0.0	DRY repetition penalty multiplier
`dry_base`	float	1.75	DRY base value
`dry_allowed_length`	int	2	DRY allowed sequence length
`dry_penalty_last_n`	int	-1	DRY scan window (-1 = context size)
`xtc_probability`	float	0.0	XTC sampling probability
`xtc_threshold`	float	0.1	XTC threshold
`mirostat`	int	0	Mirostat mode (0=off, 1=v1, 2=v2)
`mirostat_tau`	float	5.0	Mirostat target entropy
`mirostat_eta`	float	0.1	Mirostat learning rate
`seed`	int	0xFFFFFFFF	RNG seed (0xFFFFFFFF = random)
`n_probs`	int	0	Top-N token probabilities to return
`min_keep`	int	0	Minimum tokens each sampler must retain
`grammar`	string	`""`	GBNF grammar string
`grammar_lazy`	bool	false	Grammar activates only on trigger tokens
`json_schema`	object	null	JSON schema (auto-converted to GBNF grammar)
`ignore_eos`	bool	false	Ignore EOS/EOG tokens
`logit_bias`	array	`[]`	`[{"token": id, "bias": float}]` adjustments

See 3.8 for the sampling chain architecture.

Sources: common/common.h180-245 tools/server/server-task.cpp28-89

Internal Task Dispatch

Task Dispatch Flow:

Sources: tools/server/server-queue.h1-50 tools/server/server-context.cpp1-30

Task types (server_task_type in tools/server/server-task.h16-29):

Task Type	Endpoint
`SERVER_TASK_TYPE_COMPLETION`	`/completion`, `/chat/completions`, `/v1/completions`, etc.
`SERVER_TASK_TYPE_EMBEDDING`	`/embeddings`, `/v1/embeddings`
`SERVER_TASK_TYPE_RERANK`	`/rerank`, `/v1/rerank`
`SERVER_TASK_TYPE_INFILL`	`/infill`
`SERVER_TASK_TYPE_CANCEL`	Client disconnect
`SERVER_TASK_TYPE_METRICS`	`/metrics`
`SERVER_TASK_TYPE_SLOT_SAVE`	`POST /slots/:id` with `"action": "save"`
`SERVER_TASK_TYPE_SLOT_RESTORE`	`POST /slots/:id` with `"action": "restore"`
`SERVER_TASK_TYPE_SLOT_ERASE`	`POST /slots/:id` with `"action": "erase"`
`SERVER_TASK_TYPE_GET_LORA`	`GET /lora-adapters`
`SERVER_TASK_TYPE_SET_LORA`	`POST /lora-adapters`

Response format selection (task_response_type in tools/server/server-task.h32-39):

`task_response_type`	Format
`TASK_RESPONSE_TYPE_NONE`	Native llama.cpp format
`TASK_RESPONSE_TYPE_OAI_CHAT`	OpenAI chat completions
`TASK_RESPONSE_TYPE_OAI_CMPL`	OpenAI text completions
`TASK_RESPONSE_TYPE_OAI_RESP`	OpenAI Responses API
`TASK_RESPONSE_TYPE_OAI_EMBD`	OpenAI embeddings
`TASK_RESPONSE_TYPE_ANTHROPIC`	Anthropic Messages API

Sources: tools/server/server-task.h16-39

Router Mode

When llama-server is started without --model, it operates as a router that manages child server processes. Two additional endpoints are registered:

Endpoint	Description
`POST /models/load`	Start a child server with a specified model
`POST /models/unload`	Stop a running child server

All other endpoints proxy to the appropriate child instance via models_routes->proxy_get and models_routes->proxy_post. See 6.4 for full documentation.

Sources: tools/server/server.cpp124-163

llama-server HTTP API

Relevant source files

Purpose and Scope

For the built-in web UI served by llama-server, see 6.3
For router mode and multi-model management endpoints, see 6.4
For build instructions, see 2.1
For chat template rendering that drives /chat/completions formatting, see 3.9
For sampling algorithms referenced in this page, see 3.8

Server Architecture

The server is built from three primary components, each in a dedicated file:

Component	File	Role
`server_http_context`	`tools/server/server-http.h`	HTTP layer (httplib), route binding, SSE chunked responses
`server_context`	`tools/server/server-context.h`	Inference engine, slot pool, task execution loop
`server_routes`	`tools/server/server.cpp`	Handler functions wired to HTTP routes via `ctx_http.get/post`

Component Architecture:

Sources: tools/server/server.cpp99-322 tools/server/server-context.h1-50 tools/server/server-queue.h1-30

Route-to-Handler Binding:

Sources: tools/server/server.cpp165-199

Starting the Server

The binary is llama-server. Without --model, it starts in router mode (see 6.4).

Key startup flags:

Flag	Default	Description
`-m, --model FNAME`	(none)	Path to GGUF model
`--host HOST`	`127.0.0.1`	Listening address
`--port PORT`	`8080`	Listening port (0 = random)
`-np, --parallel N`	`1`	Number of inference slots
`-c, --ctx-size N`	`0` (from model)	Context window size
`--api-key KEY`	(none)	Bearer token for auth
`--api-key-file FNAME`	(none)	File with API keys (one per line)
`--metrics`	disabled	Enable `/metrics` endpoint
`--no-slots`	false	Disable `/slots` endpoint
`--embedding`	disabled	Restrict to embedding mode
`--reranking`	disabled	Enable reranking endpoint
`--cont-batching`	enabled	Continuous batching
`--chat-template TEMPLATE`	(from model)	Custom Jinja2 chat template string
`--chat-template-file FILE`	(none)	Read chat template from file
`--jinja`	false	Use Jinja2 renderer (vs. built-in)
`--reasoning-format FORMAT`	`none`	`none`, `deepseek`, `deepseek-legacy`, `auto`
`--slot-save-path PATH`	(none)	Directory for slot KV save/load
`--ssl-key-file FNAME`	(none)	TLS private key (PEM)
`--ssl-cert-file FNAME`	(none)	TLS certificate (PEM)

Sources: tools/server/README.md30-119 common/arg.cpp73-80

Authentication

When --api-key or --api-key-file is set, all non-public endpoints require:

Authorization: Bearer <key>

Public endpoints (no key required):

GET /health, GET /v1/health
GET /models, GET /v1/models, GET /api/tags

On auth failure the server returns HTTP 401:

Route registration clearly marks which handlers bypass the key check at tools/server/server.cpp165-175

Sources: tools/server/server.cpp165-199 tools/server/server-common.cpp16-57

Endpoint Reference

Health & Status

`GET /health` · `GET /v1/health`

Always public. Returns HTTP 503 while the model is loading (SERVER_STATE_LOADING_MODEL), HTTP 200 when ready (SERVER_STATE_READY).

While loading:

Server states are defined at tools/server/server-context.cpp43-46

`GET /metrics`

Prometheus-format metrics. Requires --metrics. Includes token throughput, slot activity, KV cache utilization, and request latencies.

`GET /props`

Returns model and server configuration: model name, context size, chat template info, supported features, web UI settings.

`POST /props`

Sets per-session properties such as default system prompt and antiprompt strings.

Sources: tools/server/server-context.cpp43-46

Models

`GET /models` · `GET /v1/models` · `GET /api/tags`

OpenAI- and Ollama-compatible model listing. Always public.

`POST /api/show`

Ollama-compatible model metadata endpoint.

Text Completion

`POST /completion` · `POST /completions`

Native llama.cpp completion. Dispatches SERVER_TASK_TYPE_COMPLETION with TASK_RESPONSE_TYPE_NONE.

Request fields:

Field	Type	Default	Description
`prompt`	string or array	required	Prompt text or array of token IDs
`n_predict`	int	-1	Max tokens (-1 = unlimited)
`stream`	bool	false	SSE streaming
`temperature`	float	0.80	Sampling temperature
`top_k`	int	40	Top-K sampling
`top_p`	float	0.95	Top-P nucleus sampling
`min_p`	float	0.05	Min-P threshold
`seed`	int	random	RNG seed
`stop`	array	`[]`	Stop strings
`cache_prompt`	bool	true	Reuse KV cache for matching prefix
`n_keep`	int	0	Tokens to preserve on context shift
`grammar`	string	`""`	GBNF grammar string
`json_schema`	object	null	JSON schema (auto-converted to grammar)
`logit_bias`	array	`[]`	`[{"token": id, "bias": float}]`
`n_probs`	int	0	Top-N probabilities to return
`lora`	array	`[]`	`[{"id": int, "scale": float}]` per-request LoRA
`timings_per_token`	bool	false	Include per-token timing in stream events
`return_tokens`	bool	false	Include generated token IDs in response
`n_cache_reuse`	int	0	Min chunk size for KV shift reuse
`t_max_predict_ms`	int	-1	Max generation wall-clock time (ms)
`speculative`	object	`{}`	Speculative decoding overrides

Non-streaming response:

SSE stream events (one per token):

data: {"content": "generated", "stop": false}

data: {"content": " text", "stop": false}

data: {"content": "", "stop": true, "timings": {...}, "tokens_predicted": 42}

Sources: tools/server/server-task.h48-86 tools/server/server-task.cpp97-135

`POST /v1/completions`

Response:

Sources: tools/server/server-task.h34

Chat Completions

`POST /chat/completions` · `POST /v1/chat/completions` · `POST /api/chat`

OpenAI-compatible chat. Applies the model's chat template to format messages. Dispatches with TASK_RESPONSE_TYPE_OAI_CHAT.

Key request fields:

Field	Type	Default	Description
`model`	string	(ignored)	Passed through in response
`messages`	array	required	`[{"role": "user"/"assistant"/"system", "content": "..."}]`
`max_tokens`	int	-1	Max generated tokens
`stream`	bool	false	SSE streaming
`temperature`	float	0.80	Sampling temperature
`top_p`	float	0.95	Top-P sampling
`frequency_penalty`	float	0.0	Frequency penalty
`presence_penalty`	float	0.0	Presence penalty
`seed`	int	random	RNG seed
`stop`	string or array	`[]`	Stop sequences
`tools`	array	`[]`	Tool/function definitions
`tool_choice`	string or object	`"auto"`	Tool selection strategy
`response_format`	object	null	`{"type": "json_object"}` or `{"type": "json_schema", "json_schema": {...}}`
`logit_bias`	object	null	`{token_id_str: bias}`

Message content may be a string or an array of content parts for multimodal input:

Non-streaming response:

SSE stream format:

data: {"id": "chatcmpl-...", "object": "chat.completion.chunk", "choices": [{"delta": {"content": "part"}, "index": 0}]}

data: {"id": "chatcmpl-...", "object": "chat.completion.chunk", "choices": [{"delta": {}, "finish_reason": "stop", "index": 0}], "usage": {...}}

data: [DONE]

Sources: tools/server/server-task.h34 tools/server/tests/unit/test_chat_completion.py28-52

`POST /v1/responses`

OpenAI Responses API format. Routes to the same inference path as chat completions but uses TASK_RESPONSE_TYPE_OAI_RESP response formatting.

Sources: tools/server/server-task.h36 tools/server/server.cpp180

Anthropic Messages API

`POST /v1/messages`

Anthropic-compatible chat. Uses TASK_RESPONSE_TYPE_ANTHROPIC.

Request:

Response:

`POST /v1/messages/count_tokens`

Returns token count for an Anthropic-format request without running inference.

Sources: tools/server/server-task.h38 tools/server/server.cpp181-182

Infill (Fill-in-the-Middle)

`POST /infill`

Fill-in-the-middle for code models with FIM token support. Dispatches SERVER_TASK_TYPE_INFILL.

Request:

Response format is identical to /completion.

Sources: tools/server/server-task.h22 tools/server/server.cpp183

Embeddings

`POST /embeddings` · `POST /embedding` · `POST /v1/embeddings`

Generate embeddings. Requires --embedding at startup. Dispatches SERVER_TASK_TYPE_EMBEDDING.

Request:

input may be a single string or an array of strings.

Response (/embeddings, native):

Response (/v1/embeddings, OAI-compatible):

Sources: tools/server/server-task.h18 tools/server/server.cpp184-186

Reranking

`POST /rerank` · `POST /reranking` · `POST /v1/rerank` · `POST /v1/reranking`

Score document relevance against a query. Requires --reranking. Dispatches SERVER_TASK_TYPE_RERANK.

Request:

Response:

Sources: tools/server/server-task.h19 tools/server/server.cpp187-190

Tokenization Utilities

`POST /tokenize`

Tokenize text without inference.

Request:

Response:

`POST /detokenize`

Convert token IDs back to text.

Request:

Response:

Sources: tools/server/server.cpp191-192

Chat Template Application

`POST /apply-template`

Apply the model's chat template to a messages array and return the formatted string.

Request:

Response:

Sources: tools/server/server.cpp193

LoRA Adapters

LoRA adapters are loaded at startup via --lora or --lora-scaled. The API adjusts scales at runtime.

`GET /lora-adapters`

Dispatches SERVER_TASK_TYPE_GET_LORA.

`POST /lora-adapters`

Dispatches SERVER_TASK_TYPE_SET_LORA. Sets adapter scales for subsequent requests.

LoRA scale management and cache invalidation logic is in tools/server/server-common.cpp91-154

Sources: tools/server/server.cpp195-196 tools/server/server-task.h27-28

Slot Management

The server maintains a pool of server_slot objects (one per --parallel value). Each slot holds a llama_context* and optional mtmd_context*.

`GET /slots`

Returns state of all inference slots. Requires absence of --no-slots.

`POST /slots/:id_slot`

Save, restore, or erase slot KV cache state. Requires --slot-save-path.

`action`	Task Type	Description
`"save"`	`SERVER_TASK_TYPE_SLOT_SAVE`	Write KV cache to file
`"restore"`	`SERVER_TASK_TYPE_SLOT_RESTORE`	Load KV cache from file
`"erase"`	`SERVER_TASK_TYPE_SLOT_ERASE`	Clear slot KV cache

Request (save):

Sources: tools/server/server.cpp197-199 tools/server/server-task.h27-29

Slot State Lifecycle

Slot State Machine (server_slot):

Sources: tools/server/server-context.cpp34-41 tools/server/server-context.cpp300-319

SSE Streaming

When "stream": true is set, the server responds with:

Content-Type: text/event-stream
Transfer-Encoding: chunked

Native format (/completion, /completions):

data: {"content": "chunk one", "stop": false}

data: {"content": " chunk two", "stop": false}

data: {"content": "", "stop": true, "timings": {...}, "tokens_predicted": 42}

OpenAI chat format (/chat/completions, /v1/chat/completions):

data: {"id": "chatcmpl-...", "object": "chat.completion.chunk", "choices": [{"delta": {"role": "assistant", "content": ""}, "index": 0}]}

data: {"id": "chatcmpl-...", "object": "chat.completion.chunk", "choices": [{"delta": {"content": "Hello"}, "index": 0}]}

data: {"id": "chatcmpl-...", "object": "chat.completion.chunk", "choices": [{"delta": {}, "finish_reason": "stop", "index": 0}], "usage": {...}}

data: [DONE]

When "timings_per_token": true is set in the request, each stream event includes a timings object with per-token latency data.

Sources: tools/server/server-task.h49-51 tools/server/server-context.cpp1-30

Error Responses

All errors return a consistent JSON envelope:

Error type reference (from error_type enum):

`type` string	HTTP Code	Trigger
`invalid_request_error`	400	Malformed body, unknown parameters
`exceed_context_size_error`	400	Prompt exceeds context window
`authentication_error`	401	Missing or invalid API key
`permission_error`	403	Access denied
`not_found_error`	404	Unknown resource
`not_supported_error`	501	Feature not enabled or unavailable
`unavailable_error`	503	No slots free, server still loading
`server_error`	500	Unexpected internal exception

Sources: tools/server/server-common.cpp16-57 tools/server/server.cpp35-67

Sampling Parameters

All generation endpoints accept these fields, mapping to common_params_sampling (common/common.h180-245):

Parameter	Type	Default	Description
`temperature`	float	0.80	Softmax temperature (0 = greedy)
`dynatemp_range`	float	0.0	Dynamic temperature range (0 = disabled)
`dynatemp_exponent`	float	1.0	Dynamic temperature exponent
`top_k`	int	40	Top-K cutoff (0 = all tokens)
`top_p`	float	0.95	Top-P nucleus threshold
`min_p`	float	0.05	Min-P threshold
`typical_p`	float	1.0	Locally-typical sampling (1.0 = disabled)
`top_n_sigma`	float	-1.0	Top-N-Sigma (-1 = disabled)
`repeat_penalty`	float	1.0	Repetition penalty (1.0 = disabled)
`repeat_last_n`	int	64	Window for repetition penalty
`presence_penalty`	float	0.0	OAI presence penalty
`frequency_penalty`	float	0.0	OAI frequency penalty
`dry_multiplier`	float	0.0	DRY repetition penalty multiplier
`dry_base`	float	1.75	DRY base value
`dry_allowed_length`	int	2	DRY allowed sequence length
`dry_penalty_last_n`	int	-1	DRY scan window (-1 = context size)
`xtc_probability`	float	0.0	XTC sampling probability
`xtc_threshold`	float	0.1	XTC threshold
`mirostat`	int	0	Mirostat mode (0=off, 1=v1, 2=v2)
`mirostat_tau`	float	5.0	Mirostat target entropy
`mirostat_eta`	float	0.1	Mirostat learning rate
`seed`	int	0xFFFFFFFF	RNG seed (0xFFFFFFFF = random)
`n_probs`	int	0	Top-N token probabilities to return
`min_keep`	int	0	Minimum tokens each sampler must retain
`grammar`	string	`""`	GBNF grammar string
`grammar_lazy`	bool	false	Grammar activates only on trigger tokens
`json_schema`	object	null	JSON schema (auto-converted to GBNF grammar)
`ignore_eos`	bool	false	Ignore EOS/EOG tokens
`logit_bias`	array	`[]`	`[{"token": id, "bias": float}]` adjustments

See 3.8 for the sampling chain architecture.

Sources: common/common.h180-245 tools/server/server-task.cpp28-89

Internal Task Dispatch

Task Dispatch Flow:

Sources: tools/server/server-queue.h1-50 tools/server/server-context.cpp1-30

Task types (server_task_type in tools/server/server-task.h16-29):

Task Type	Endpoint
`SERVER_TASK_TYPE_COMPLETION`	`/completion`, `/chat/completions`, `/v1/completions`, etc.
`SERVER_TASK_TYPE_EMBEDDING`	`/embeddings`, `/v1/embeddings`
`SERVER_TASK_TYPE_RERANK`	`/rerank`, `/v1/rerank`
`SERVER_TASK_TYPE_INFILL`	`/infill`
`SERVER_TASK_TYPE_CANCEL`	Client disconnect
`SERVER_TASK_TYPE_METRICS`	`/metrics`
`SERVER_TASK_TYPE_SLOT_SAVE`	`POST /slots/:id` with `"action": "save"`
`SERVER_TASK_TYPE_SLOT_RESTORE`	`POST /slots/:id` with `"action": "restore"`
`SERVER_TASK_TYPE_SLOT_ERASE`	`POST /slots/:id` with `"action": "erase"`
`SERVER_TASK_TYPE_GET_LORA`	`GET /lora-adapters`
`SERVER_TASK_TYPE_SET_LORA`	`POST /lora-adapters`

Response format selection (task_response_type in tools/server/server-task.h32-39):

`task_response_type`	Format
`TASK_RESPONSE_TYPE_NONE`	Native llama.cpp format
`TASK_RESPONSE_TYPE_OAI_CHAT`	OpenAI chat completions
`TASK_RESPONSE_TYPE_OAI_CMPL`	OpenAI text completions
`TASK_RESPONSE_TYPE_OAI_RESP`	OpenAI Responses API
`TASK_RESPONSE_TYPE_OAI_EMBD`	OpenAI embeddings
`TASK_RESPONSE_TYPE_ANTHROPIC`	Anthropic Messages API

Sources: tools/server/server-task.h16-39

Router Mode

When llama-server is started without --model, it operates as a router that manages child server processes. Two additional endpoints are registered:

Endpoint	Description
`POST /models/load`	Start a child server with a specified model
`POST /models/unload`	Stop a running child server

All other endpoints proxy to the appropriate child instance via models_routes->proxy_get and models_routes->proxy_post. See 6.4 for full documentation.

Sources: tools/server/server.cpp124-163

llama-server HTTP API

Purpose and Scope

Server Architecture

Starting the Server

Authentication

Endpoint Reference

Health & Status

GET /health · GET /v1/health

GET /metrics

GET /props

POST /props

Models

GET /models · GET /v1/models · GET /api/tags

POST /api/show

Text Completion

POST /completion · POST /completions

POST /v1/completions

Chat Completions

POST /chat/completions · POST /v1/chat/completions · POST /api/chat

POST /v1/responses

Anthropic Messages API

POST /v1/messages

POST /v1/messages/count_tokens

Infill (Fill-in-the-Middle)

POST /infill

Embeddings

POST /embeddings · POST /embedding · POST /v1/embeddings

Reranking

POST /rerank · POST /reranking · POST /v1/rerank · POST /v1/reranking

Tokenization Utilities

POST /tokenize

POST /detokenize

Chat Template Application

POST /apply-template

LoRA Adapters

GET /lora-adapters

POST /lora-adapters

Slot Management

GET /slots

POST /slots/:id_slot

Slot State Lifecycle

SSE Streaming

Error Responses

Sampling Parameters

Internal Task Dispatch

Router Mode

On this page

llama-server HTTP API

Purpose and Scope

Server Architecture

Starting the Server

Authentication

Endpoint Reference

Health & Status

GET /health · GET /v1/health

GET /metrics

GET /props

POST /props

Models

GET /models · GET /v1/models · GET /api/tags

POST /api/show

Text Completion

POST /completion · POST /completions

POST /v1/completions

Chat Completions

POST /chat/completions · POST /v1/chat/completions · POST /api/chat

POST /v1/responses

Anthropic Messages API

POST /v1/messages

POST /v1/messages/count_tokens

Infill (Fill-in-the-Middle)

POST /infill

Embeddings

POST /embeddings · POST /embedding · POST /v1/embeddings

Reranking

POST /rerank · POST /reranking · POST /v1/rerank · POST /v1/reranking

Tokenization Utilities

POST /tokenize

POST /detokenize

Chat Template Application

`GET /health` · `GET /v1/health`

`GET /metrics`

`GET /props`

`POST /props`

`GET /models` · `GET /v1/models` · `GET /api/tags`

`POST /api/show`

`POST /completion` · `POST /completions`

`POST /v1/completions`

`POST /chat/completions` · `POST /v1/chat/completions` · `POST /api/chat`

`POST /v1/responses`

`POST /v1/messages`

`POST /v1/messages/count_tokens`

`POST /infill`

`POST /embeddings` · `POST /embedding` · `POST /v1/embeddings`

`POST /rerank` · `POST /reranking` · `POST /v1/rerank` · `POST /v1/reranking`

`POST /tokenize`

`POST /detokenize`

`POST /apply-template`

`GET /lora-adapters`

`POST /lora-adapters`

`GET /slots`

`POST /slots/:id_slot`

`GET /health` · `GET /v1/health`

`GET /metrics`

`GET /props`

`POST /props`

`GET /models` · `GET /v1/models` · `GET /api/tags`

`POST /api/show`

`POST /completion` · `POST /completions`

`POST /v1/completions`

`POST /chat/completions` · `POST /v1/chat/completions` · `POST /api/chat`

`POST /v1/responses`

`POST /v1/messages`

`POST /v1/messages/count_tokens`

`POST /infill`

`POST /embeddings` · `POST /embedding` · `POST /v1/embeddings`

`POST /rerank` · `POST /reranking` · `POST /v1/rerank` · `POST /v1/reranking`

`POST /tokenize`

`POST /detokenize`

`POST /apply-template`

`GET /lora-adapters`

`POST /lora-adapters`

`GET /slots`

`POST /slots/:id_slot`