Model Acquisition and Management

Relevant source files

This page documents how llama.cpp downloads and manages model files from remote repositories. It covers the -hf CLI flag, the common/download.cpp module, HuggingFace integration, ETag-based caching, resume support, parallel GGUF split downloads, and the MODEL_ENDPOINT override mechanism.

For documentation on the GGUF binary file format itself, see 7.1. For quantization of locally-held models, see 7.3. For the llama-server router mode's /models/load endpoint, see 6.4.

Overview

llama.cpp tools can run against a locally-held .gguf file or download one automatically from a remote model host. The -hf flag on any tool triggers the download subsystem, which handles caching, resumable transfers, multi-part GGUF splits, and endpoint overrides.

Sources: README.md44-54

Endpoint Configuration

By default the download subsystem resolves models from HuggingFace (https://huggingface.co). The MODEL_ENDPOINT environment variable overrides this, allowing alternative model registries (e.g. ModelScope):

A HuggingFace token can be supplied to access private or gated repositories via a bearer token header. When OpenSSL support is absent at build time, HTTPS connections will fail with a clear error message directing the user to rebuild with -DLLAMA_BUILD_BORINGSSL=ON or -DLLAMA_OPENSSL=ON.

Sources: README.md300-318 common/http.h60-69

Download Module Architecture

The download subsystem lives in common/download.cpp and common/download.h, compiled into the common static library alongside common.cpp, arg.cpp, and others.

Download module — principal flow

Sources: common/download.cpp1-490 common/CMakeLists.txt64-66

Key Data Structures and Functions

Symbol	File	Role
`common_params_model`	`common/common.h`	Holds repo name, file path, tag, and bearer token for a model to download
`common_remote_params`	`common/download.h`	Parameters for generic remote GET (timeout, max_size, headers)
`common_header_list`	`common/download.h`	`std::vector<std::pair<std::string,std::string>>` of custom HTTP headers
`ProgressBar`	`common/download.cpp`	Thread-safe multi-line terminal progress display (TTY-aware)
`common_http_url`	`common/http.h`	Parsed URL parts: scheme, user, password, host, path

Public API summary

Function	Signature (simplified)	Purpose
`common_download_split_repo_tag`	`(string) → {repo, tag}`	Parses `owner/repo[:quant]` into components
`common_download_model`	`(common_params_model, token, offline, headers) → bool`	High-level: fetches manifest, downloads all required files
`common_download_file_single`	`(url, path, token, offline, headers) → int`	Downloads one file; returns HTTP status or -1
`common_remote_get_content`	`(url, common_remote_params) → {status, vector<char>}`	Generic in-memory HTTP GET

Sources: common/download.cpp133-141 common/download.cpp439-490

Repo/Tag Parsing

The -hf argument accepts the format <owner>/<repo>[:quant], where the optional :quant suffix names a quantization variant. common_download_split_repo_tag validates and splits this string:

Input: "ggml-org/gemma-3-1b-it-GGUF:Q4_K_M"
Output: repo = "ggml-org/gemma-3-1b-it-GGUF", tag = "Q4_K_M"
Default tag when none provided: "latest"

Repo names must match ^[A-Za-z0-9_.\-]+\/[A-Za-z0-9_.\-]+$ — anything else throws std::invalid_argument.

Sources: common/download.cpp54-141

File Download Process

Single-File Download

common_download_file_single_online handles one file from a remote URL to a local path:

Issue a HEAD request to read ETag, Content-Length, and Accept-Ranges.
Compare the server ETag against the cached .etag file alongside the local path.
If the ETag matches, return 304 (treated as "Not Modified — use cache").
Otherwise delete any stale cached file and begin downloading.
Use <path>.downloadInProgress as a staging path.
If the server supports Accept-Ranges: bytes, resume incomplete transfers using a Range: bytes=N- header.
Rename the staging path to the final path atomically on success.
Persist the server ETag to <path>.etag.
Retry up to 3 times with exponential backoff (2 s, 4 s).

ETag caching

Sources: common/download.cpp290-403

Parallel Multi-File Download

common_download_file_multiple takes a std::vector<std::pair<url, path>> and launches one std::async future per file, then waits for all to complete. This is used when a model consists of multiple GGUF split shards.

Sources: common/download.cpp457-488

GGUF Split File Handling

Large models are often distributed as multiple sharded .gguf files (e.g. model-00001-of-00004.gguf). The download subsystem handles this via a manifest file:

A manifest is a JSON file cached locally under the path returned by get_manifest_path(repo, tag).
The manifest path format is: manifest={owner}={repo}={tag}.json stored in the llama.cpp cache directory (via fs_get_cache_file).
Path separators (/) in the repo name are replaced with = to remain valid on Windows.
common_download_model fetches the manifest (if absent or stale), reads the list of shard files, then calls common_download_file_multiple to download all shards in parallel.
gguf.h is included in download.cpp to read GGUF metadata from split files and verify their integrity.

Manifest cache naming

Input	Cached manifest path
`ggml-org/Llama-3.2-1B-GGUF`, tag `Q4_K_M`	`manifest=ggml-org=Llama-3.2-1B-GGUF=Q4_K_M.json`
`owner/model`, tag `latest`	`manifest=owner=model=latest.json`

Sources: common/download.cpp59-67 common/download.cpp1-10

HTTP Client Infrastructure

common/http.h provides a thin wrapper over the vendored cpp-httplib library.

Function	Purpose
`common_http_parse_url(url)`	Returns `common_http_url` with scheme, user, password, host, path
`common_http_client(url)`	Returns `{httplib::Client, common_http_url}`; throws if HTTPS requested without TLS support
`common_http_show_masked_url(parts)`	Redacts credentials in URLs for log output (`**:**@`)

HTTPS support is conditional on a TLS library being found at build time. The CMake option LLAMA_OPENSSL=ON (default) enables system OpenSSL detection. LLAMA_BUILD_BORINGSSL=ON and LLAMA_BUILD_LIBRESSL=ON fetch and build their respective libraries as subprojects. If none is found, HTTP-only builds will refuse to open https:// URLs at runtime.

Sources: common/http.h1-85 vendor/cpp-httplib/CMakeLists.txt36-183

Offline Mode

Every public download function accepts an offline boolean. When true:

If the local file exists in cache, it is used as-is (returns 304).
If the local file is absent, the function logs an error and returns -1.
No network requests are made.

This allows llama.cpp tools to operate in air-gapped environments as long as models were previously downloaded.

Sources: common/download.cpp439-455

Memory Fitting (`llama_params_fit`)

llama_params_fit adjusts model parameters (such as number of GPU layers or context size) to fit within the available hardware memory. This function is called after a model is loaded or selected, prior to inference context creation. It queries device memory capacities through the backend system and scales down parameters that would exceed available VRAM or RAM, enabling automatic model–hardware matching without manual tuning.

This is typically invoked by the argument-parsing layer (common/arg.cpp) when the -hf flow selects a model automatically, so that the selected quantization variant is used at a configuration that will actually fit.

Sources: common/download.cpp490 common/CMakeLists.txt44-103

Component Relationships

Download subsystem — code entity map

Sources: common/download.cpp1-22 common/CMakeLists.txt44-103

Model Acquisition and Management

Relevant source files

For documentation on the GGUF binary file format itself, see 7.1. For quantization of locally-held models, see 7.3. For the llama-server router mode's /models/load endpoint, see 6.4.

Overview

Sources: README.md44-54

Endpoint Configuration

Sources: README.md300-318 common/http.h60-69

Download Module Architecture

The download subsystem lives in common/download.cpp and common/download.h, compiled into the common static library alongside common.cpp, arg.cpp, and others.

Download module — principal flow

Sources: common/download.cpp1-490 common/CMakeLists.txt64-66

Key Data Structures and Functions

Symbol	File	Role
`common_params_model`	`common/common.h`	Holds repo name, file path, tag, and bearer token for a model to download
`common_remote_params`	`common/download.h`	Parameters for generic remote GET (timeout, max_size, headers)
`common_header_list`	`common/download.h`	`std::vector<std::pair<std::string,std::string>>` of custom HTTP headers
`ProgressBar`	`common/download.cpp`	Thread-safe multi-line terminal progress display (TTY-aware)
`common_http_url`	`common/http.h`	Parsed URL parts: scheme, user, password, host, path

Public API summary

Function	Signature (simplified)	Purpose
`common_download_split_repo_tag`	`(string) → {repo, tag}`	Parses `owner/repo[:quant]` into components
`common_download_model`	`(common_params_model, token, offline, headers) → bool`	High-level: fetches manifest, downloads all required files
`common_download_file_single`	`(url, path, token, offline, headers) → int`	Downloads one file; returns HTTP status or -1
`common_remote_get_content`	`(url, common_remote_params) → {status, vector<char>}`	Generic in-memory HTTP GET

Sources: common/download.cpp133-141 common/download.cpp439-490

Repo/Tag Parsing

The -hf argument accepts the format <owner>/<repo>[:quant], where the optional :quant suffix names a quantization variant. common_download_split_repo_tag validates and splits this string:

Input: "ggml-org/gemma-3-1b-it-GGUF:Q4_K_M"
Output: repo = "ggml-org/gemma-3-1b-it-GGUF", tag = "Q4_K_M"
Default tag when none provided: "latest"

Repo names must match ^[A-Za-z0-9_.\-]+\/[A-Za-z0-9_.\-]+$ — anything else throws std::invalid_argument.

Sources: common/download.cpp54-141

File Download Process

Single-File Download

common_download_file_single_online handles one file from a remote URL to a local path:

Issue a HEAD request to read ETag, Content-Length, and Accept-Ranges.
Compare the server ETag against the cached .etag file alongside the local path.
If the ETag matches, return 304 (treated as "Not Modified — use cache").
Otherwise delete any stale cached file and begin downloading.
Use <path>.downloadInProgress as a staging path.
If the server supports Accept-Ranges: bytes, resume incomplete transfers using a Range: bytes=N- header.
Rename the staging path to the final path atomically on success.
Persist the server ETag to <path>.etag.
Retry up to 3 times with exponential backoff (2 s, 4 s).

ETag caching

Sources: common/download.cpp290-403

Parallel Multi-File Download

Sources: common/download.cpp457-488

GGUF Split File Handling

Large models are often distributed as multiple sharded .gguf files (e.g. model-00001-of-00004.gguf). The download subsystem handles this via a manifest file:

A manifest is a JSON file cached locally under the path returned by get_manifest_path(repo, tag).
The manifest path format is: manifest={owner}={repo}={tag}.json stored in the llama.cpp cache directory (via fs_get_cache_file).
Path separators (/) in the repo name are replaced with = to remain valid on Windows.
common_download_model fetches the manifest (if absent or stale), reads the list of shard files, then calls common_download_file_multiple to download all shards in parallel.
gguf.h is included in download.cpp to read GGUF metadata from split files and verify their integrity.

Manifest cache naming

Input	Cached manifest path
`ggml-org/Llama-3.2-1B-GGUF`, tag `Q4_K_M`	`manifest=ggml-org=Llama-3.2-1B-GGUF=Q4_K_M.json`
`owner/model`, tag `latest`	`manifest=owner=model=latest.json`

Sources: common/download.cpp59-67 common/download.cpp1-10

HTTP Client Infrastructure

common/http.h provides a thin wrapper over the vendored cpp-httplib library.

Function	Purpose
`common_http_parse_url(url)`	Returns `common_http_url` with scheme, user, password, host, path
`common_http_client(url)`	Returns `{httplib::Client, common_http_url}`; throws if HTTPS requested without TLS support
`common_http_show_masked_url(parts)`	Redacts credentials in URLs for log output (`**:**@`)

Sources: common/http.h1-85 vendor/cpp-httplib/CMakeLists.txt36-183

Offline Mode

Every public download function accepts an offline boolean. When true:

If the local file exists in cache, it is used as-is (returns 304).
If the local file is absent, the function logs an error and returns -1.
No network requests are made.

This allows llama.cpp tools to operate in air-gapped environments as long as models were previously downloaded.

Sources: common/download.cpp439-455

Memory Fitting (`llama_params_fit`)

Sources: common/download.cpp490 common/CMakeLists.txt44-103

Component Relationships

Download subsystem — code entity map

Sources: common/download.cpp1-22 common/CMakeLists.txt44-103

Model Acquisition and Management

Overview

Endpoint Configuration

Download Module Architecture

Key Data Structures and Functions

Repo/Tag Parsing

File Download Process

Single-File Download

Parallel Multi-File Download

GGUF Split File Handling

HTTP Client Infrastructure

Offline Mode

Memory Fitting (`llama_params_fit`)

Component Relationships

On this page

Model Acquisition and Management

Overview

Endpoint Configuration

Download Module Architecture

Key Data Structures and Functions

Repo/Tag Parsing

File Download Process

Single-File Download

Parallel Multi-File Download

GGUF Split File Handling

HTTP Client Infrastructure

Offline Mode

Memory Fitting (`llama_params_fit`)

Component Relationships

On this page

Model Acquisition and Management

Overview

Endpoint Configuration

Download Module Architecture

Key Data Structures and Functions

Repo/Tag Parsing

File Download Process

Single-File Download

Parallel Multi-File Download

GGUF Split File Handling

HTTP Client Infrastructure

Offline Mode

Memory Fitting (llama_params_fit)

Component Relationships

On this page

Model Acquisition and Management

Overview

Endpoint Configuration

Download Module Architecture

Key Data Structures and Functions

Repo/Tag Parsing

File Download Process

Single-File Download

Parallel Multi-File Download

GGUF Split File Handling

HTTP Client Infrastructure

Offline Mode

Memory Fitting (llama_params_fit)

Component Relationships

On this page

Memory Fitting (`llama_params_fit`)

Memory Fitting (`llama_params_fit`)