This page documents how llama.cpp downloads and manages model files from remote repositories. It covers the -hf CLI flag, the common/download.cpp module, HuggingFace integration, ETag-based caching, resume support, parallel GGUF split downloads, and the MODEL_ENDPOINT override mechanism.
For documentation on the GGUF binary file format itself, see 7.1. For quantization of locally-held models, see 7.3. For the llama-server router mode's /models/load endpoint, see 6.4.
llama.cpp tools can run against a locally-held .gguf file or download one automatically from a remote model host. The -hf flag on any tool triggers the download subsystem, which handles caching, resumable transfers, multi-part GGUF splits, and endpoint overrides.
Sources: README.md44-54
By default the download subsystem resolves models from HuggingFace (https://huggingface.co). The MODEL_ENDPOINT environment variable overrides this, allowing alternative model registries (e.g. ModelScope):
A HuggingFace token can be supplied to access private or gated repositories via a bearer token header. When OpenSSL support is absent at build time, HTTPS connections will fail with a clear error message directing the user to rebuild with -DLLAMA_BUILD_BORINGSSL=ON or -DLLAMA_OPENSSL=ON.
Sources: README.md300-318 common/http.h60-69
The download subsystem lives in common/download.cpp and common/download.h, compiled into the common static library alongside common.cpp, arg.cpp, and others.
Download module — principal flow
Sources: common/download.cpp1-490 common/CMakeLists.txt64-66
| Symbol | File | Role |
|---|---|---|
common_params_model | common/common.h | Holds repo name, file path, tag, and bearer token for a model to download |
common_remote_params | common/download.h | Parameters for generic remote GET (timeout, max_size, headers) |
common_header_list | common/download.h | std::vector<std::pair<std::string,std::string>> of custom HTTP headers |
ProgressBar | common/download.cpp | Thread-safe multi-line terminal progress display (TTY-aware) |
common_http_url | common/http.h | Parsed URL parts: scheme, user, password, host, path |
Public API summary
| Function | Signature (simplified) | Purpose |
|---|---|---|
common_download_split_repo_tag | (string) → {repo, tag} | Parses owner/repo[:quant] into components |
common_download_model | (common_params_model, token, offline, headers) → bool | High-level: fetches manifest, downloads all required files |
common_download_file_single | (url, path, token, offline, headers) → int | Downloads one file; returns HTTP status or -1 |
common_remote_get_content | (url, common_remote_params) → {status, vector<char>} | Generic in-memory HTTP GET |
Sources: common/download.cpp133-141 common/download.cpp439-490
The -hf argument accepts the format <owner>/<repo>[:quant], where the optional :quant suffix names a quantization variant. common_download_split_repo_tag validates and splits this string:
"ggml-org/gemma-3-1b-it-GGUF:Q4_K_M"repo = "ggml-org/gemma-3-1b-it-GGUF", tag = "Q4_K_M""latest"Repo names must match ^[A-Za-z0-9_.\-]+\/[A-Za-z0-9_.\-]+$ — anything else throws std::invalid_argument.
Sources: common/download.cpp54-141
common_download_file_single_online handles one file from a remote URL to a local path:
HEAD request to read ETag, Content-Length, and Accept-Ranges.ETag against the cached .etag file alongside the local path.304 (treated as "Not Modified — use cache").<path>.downloadInProgress as a staging path.Accept-Ranges: bytes, resume incomplete transfers using a Range: bytes=N- header.<path>.etag.ETag caching
Sources: common/download.cpp290-403
common_download_file_multiple takes a std::vector<std::pair<url, path>> and launches one std::async future per file, then waits for all to complete. This is used when a model consists of multiple GGUF split shards.
Sources: common/download.cpp457-488
Large models are often distributed as multiple sharded .gguf files (e.g. model-00001-of-00004.gguf). The download subsystem handles this via a manifest file:
get_manifest_path(repo, tag).manifest={owner}={repo}={tag}.json stored in the llama.cpp cache directory (via fs_get_cache_file)./) in the repo name are replaced with = to remain valid on Windows.common_download_model fetches the manifest (if absent or stale), reads the list of shard files, then calls common_download_file_multiple to download all shards in parallel.gguf.h is included in download.cpp to read GGUF metadata from split files and verify their integrity.Manifest cache naming
| Input | Cached manifest path |
|---|---|
ggml-org/Llama-3.2-1B-GGUF, tag Q4_K_M | manifest=ggml-org=Llama-3.2-1B-GGUF=Q4_K_M.json |
owner/model, tag latest | manifest=owner=model=latest.json |
Sources: common/download.cpp59-67 common/download.cpp1-10
common/http.h provides a thin wrapper over the vendored cpp-httplib library.
| Function | Purpose |
|---|---|
common_http_parse_url(url) | Returns common_http_url with scheme, user, password, host, path |
common_http_client(url) | Returns {httplib::Client, common_http_url}; throws if HTTPS requested without TLS support |
common_http_show_masked_url(parts) | Redacts credentials in URLs for log output (****:****@) |
HTTPS support is conditional on a TLS library being found at build time. The CMake option LLAMA_OPENSSL=ON (default) enables system OpenSSL detection. LLAMA_BUILD_BORINGSSL=ON and LLAMA_BUILD_LIBRESSL=ON fetch and build their respective libraries as subprojects. If none is found, HTTP-only builds will refuse to open https:// URLs at runtime.
Sources: common/http.h1-85 vendor/cpp-httplib/CMakeLists.txt36-183
Every public download function accepts an offline boolean. When true:
304).-1.This allows llama.cpp tools to operate in air-gapped environments as long as models were previously downloaded.
Sources: common/download.cpp439-455
llama_params_fit)llama_params_fit adjusts model parameters (such as number of GPU layers or context size) to fit within the available hardware memory. This function is called after a model is loaded or selected, prior to inference context creation. It queries device memory capacities through the backend system and scales down parameters that would exceed available VRAM or RAM, enabling automatic model–hardware matching without manual tuning.
This is typically invoked by the argument-parsing layer (common/arg.cpp) when the -hf flow selects a model automatically, so that the selected quantization variant is used at a configuration that will actually fit.
Sources: common/download.cpp490 common/CMakeLists.txt44-103
Download subsystem — code entity map
Sources: common/download.cpp1-22 common/CMakeLists.txt44-103
Refresh this wiki
This wiki was recently refreshed. Please wait 2 days to refresh again.