This page provides an orientation for contributors to the llama.cpp repository. It covers the overall structure of the development workflow: how the build system is organized, how tests are structured and run, how CI/CD pipelines are configured, how GGML changes are synchronized from the upstream library, and how the project is packaged for deployment.
For detailed treatment of any individual topic, see the child pages:
The repository is organized as a layered set of CMake subprojects. The GGML tensor library lives under ggml/ and is synchronized from an external repository. The core libllama library, tools, examples, and tests are built on top of it.
Diagram: Top-level repository subproject relationships
Sources: .github/workflows/build.yml1-60 ci/run.sh1-50 scripts/sync-ggml-am.sh1-30
The build system is CMake-based. The top-level CMakeLists.txt controls which subcomponents are compiled. The most commonly used CMake flags follow a naming convention: LLAMA_* flags control the llama.cpp layer; GGML_* flags control the GGML tensor library layer.
Key CMake boolean flags used across CI jobs:
| Flag | Default | Purpose |
|---|---|---|
LLAMA_FATAL_WARNINGS | OFF | Treat compiler warnings as errors |
LLAMA_BUILD_TESTS | ON | Build the test suite |
LLAMA_BUILD_SERVER | ON | Build llama-server |
LLAMA_BUILD_TOOLS | ON | Build CLI tools |
LLAMA_BUILD_EXAMPLES | ON | Build example programs |
GGML_CUDA | OFF | Enable NVIDIA CUDA backend |
GGML_METAL | ON (Apple) | Enable Apple Metal backend |
GGML_VULKAN | OFF | Enable Vulkan backend |
GGML_HIP | OFF | Enable AMD HIP/ROCm backend |
GGML_SYCL | OFF | Enable Intel SYCL backend |
GGML_MUSA | OFF | Enable Moore Threads MUSA backend |
GGML_RPC | OFF | Enable RPC distributed backend |
GGML_BACKEND_DL | OFF | Enable dynamic loading of backends |
GGML_CPU_ALL_VARIANTS | OFF | Build all CPU SIMD variants |
GGML_NATIVE | ON | Use host CPU architecture flags |
BUILD_SHARED_LIBS | ON | Build shared libraries |
A typical CPU-only build:
cmake -B build -DLLAMA_FATAL_WARNINGS=ON -DGGML_RPC=ON
cmake --build build --config Release -j $(nproc)
A CUDA build:
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=89-real
cmake --build build
Sources: .github/workflows/build.yml79-97 ci/run.sh48-166 .github/workflows/release.yml21-22
Tests live in tests/ and are registered with CMake's ctest. The tests/CMakeLists.txt defines three helper functions for registering tests:
llama_build(source) — compiles an executable but does not register it as a testllama_test(target) — registers an already-built target as a CTest testllama_build_and_test(source) — compiles and registers in one call; automatically links common and get-model.cppTests are labeled so they can be run selectively with ctest -L <label>:
| Label | Purpose |
|---|---|
main | Standard unit and integration tests; always run in CI |
model | Tests requiring a downloaded model file |
python | Tests invoking Python scripts |
Diagram: Test registration flow in tests/CMakeLists.txt
Sources: tests/CMakeLists.txt1-117
| Test target | Source file | Notes |
|---|---|---|
test-tokenizer-0 | test-tokenizer-0.cpp | Built once; run with many vocab .gguf files |
test-sampling | test-sampling.cpp | Sampling algorithm correctness |
test-grammar-parser | test-grammar-parser.cpp | GBNF grammar parsing |
test-chat | test-chat.cpp | Chat template rendering |
test-json-schema-to-grammar | test-json-schema-to-grammar.cpp | JSON schema → GBNF |
test-backend-ops | test-backend-ops.cpp | Cross-backend tensor op validation |
test-quantize-fns | test-quantize-fns.cpp | Quantization function correctness |
test-thread-safety | test-thread-safety.cpp | Parallel inference correctness; requires model |
test-alloc | test-alloc.cpp | GGML allocator unit tests |
test-gguf | test-gguf.cpp | GGUF file parsing correctness |
test-opt | test-opt.cpp | Optimizer correctness |
test-mtmd-c-api | test-mtmd-c-api.c | Multimodal C API surface test |
Tests that use internal (non-exported) symbols are disabled on Windows when building shared libraries (BUILD_SHARED_LIBS=ON). Tests that depend on backend-specific symbols are skipped when GGML_BACKEND_DL=ON.
Running the test suite locally:
cd build
ctest -L main --verbose --timeout 900
Model-dependent tests require a model file and are run with the model label or via the LLAMACPP_TEST_MODELFILE environment variable.
Sources: tests/CMakeLists.txt119-267 ci/run.sh210-340
The primary workflow file is .github/workflows/build.yml It triggers on:
push to master (when source or build files change)pull_request (opened, synchronized, or reopened)workflow_dispatchConcurrent runs for the same branch or PR are cancelled automatically via the concurrency group setting.
Diagram: GitHub Actions CI matrix (build.yml)
Sources: .github/workflows/build.yml61-1200
Each job follows the same pattern: clone → ccache restore → install dependencies → cmake configure → cmake --build → ctest -L main.
The ccache action (ggml-org/[email protected]) is used pervasively to speed up builds. Caches are keyed per job and are only saved on push to master.
.github/workflows/release.yml produces binary artifacts for every push to master that changes source files. Artifacts are named using the pattern llama-<tag>-bin-<platform>.tar.gz or .zip.
The tag name is computed by the .github/actions/get-tag-name composite action:
master: b<commit_count> (e.g. b4200)<branch-name>-b<commit_count>-<short_hash>Release artifacts are built for:
| Platform | Backend |
|---|---|
| macOS arm64 | Metal |
| macOS x64 | CPU |
| Ubuntu x64 | CPU (with GGML_CPU_ALL_VARIANTS) |
| Ubuntu x64 | Vulkan |
| Ubuntu s390x | CPU |
| Ubuntu x64 | ROCm (HIP) |
| Windows x64/arm64 | CPU |
| Windows x64 | Vulkan, CUDA 12.4, CUDA 13.1 |
| Windows x64 | SYCL (Intel) |
| Windows arm64 | OpenCL/Adreno |
Sources: .github/workflows/release.yml1-614 .github/actions/get-tag-name/action.yml1-23
ci/run.sh is a shell script for heavy-duty CI tasks run on self-hosted runners. It handles hardware configurations not available on GitHub-hosted runners. It can be run locally using environment variables:
Supported build environment variables:
| Variable | Backend enabled |
|---|---|
GG_BUILD_CUDA=1 | NVIDIA CUDA |
GG_BUILD_ROCM=1 | AMD ROCm/HIP |
GG_BUILD_SYCL=1 | Intel SYCL |
GG_BUILD_VULKAN=1 | Vulkan |
GG_BUILD_WEBGPU=1 | WebGPU/Dawn |
GG_BUILD_MUSA=1 | Moore Threads MUSA |
GG_BUILD_KLEIDIAI=1 | ARM KleidiAI |
GG_BUILD_METAL=1 | Apple Metal |
The script runs the following CI functions in sequence (when applicable):
gg_run_ctest_debug — debug build + ctest -L maingg_run_ctest_release — release build + ctest -L main,pythongg_run_test_backend_ops_cpu (high-perf only) — runs test-backend-ops -b CPUgg_run_embd_bge_small — downloads and converts BGE embedding model, runs inferencegg_run_rerank_tiny — downloads and tests Jina rerankergg_run_test_scripts — runs bash test scripts in tools/gguf-split/ and tools/quantize/gg_run_qwen3_0_6b — full pipeline: download → convert → quantize (10 types) → perplexity → imatrix → save/load stateEach function has a paired gg_sum_* function that writes a Markdown summary to $OUT/README.md.
Sources: ci/run.sh48-709 ci/README.md1-34
The GGML tensor library at ggml/ originates from a separate upstream repository (https://github.com/ggml-org/ggml). Changes flow into llama.cpp via one of two mechanisms:
Diagram: GGML sync workflow
sync-ggml-am.sh)scripts/sync-ggml-am.sh reads the last synced commit from scripts/sync-ggml.last generates patches for all new commits using git format-patch, rewrites file paths to match the llama.cpp layout (e.g. src/ggml* → ggml/src/ggml*), rewrites PR number references (e.g. (#1234) → (ggml/1234)), and applies patches with git am.
Files synced from upstream:
| Upstream path | llama.cpp path |
|---|---|
CMakeLists.txt | ggml/CMakeLists.txt |
src/CMakeLists.txt | ggml/src/CMakeLists.txt |
cmake/*.cmake | ggml/cmake/ |
cmake/ggml-config.cmake.in | ggml/cmake/ggml-config.cmake.in |
src/ggml* | ggml/src/ggml* |
include/ggml*.h | ggml/include/ggml*.h |
include/gguf*.h | ggml/include/gguf*.h |
tests/test-opt.cpp | tests/test-opt.cpp |
tests/test-quantize-fns.cpp | tests/test-quantize-fns.cpp |
tests/test-quantize-perf.cpp | tests/test-quantize-perf.cpp |
tests/test-backend-ops.cpp | tests/test-backend-ops.cpp |
After a successful sync, scripts/sync-ggml.last is updated to the HEAD commit SHA of the upstream repo.
sync-ggml.sh)scripts/sync-ggml.sh performs a simple cp -rpv of files from ../ggml/ into ./ggml/. This is a simpler alternative when patch application is not needed (e.g., bulk updates).
ggml-config.cmake.in)ggml/cmake/ggml-config.cmake.in is a CMake package config template that allows downstream CMake projects to consume ggml as an installed library via find_package(ggml). It handles dependency discovery for all enabled backends (CUDA, Metal, Vulkan, HIP, SYCL, OpenCL, BLAS, OpenMP) and creates ggml::ggml, ggml::ggml-base, and ggml::all imported targets.
Sources: scripts/sync-ggml-am.sh1-158 scripts/sync-ggml.sh1-21 scripts/sync-ggml.last1-2 ggml/cmake/ggml-config.cmake.in1-192
Dockerfiles for all supported backends live in .devops/. They all follow a multi-stage pattern: a build stage compiles the project, and then full, light, and server stages produce increasingly minimal runtime images.
| Dockerfile | Backend | Base image |
|---|---|---|
.devops/cpu.Dockerfile | CPU | ubuntu:22.04 |
.devops/cuda.Dockerfile | CUDA | nvidia/cuda:<version>-devel-ubuntu22.04 |
.devops/rocm.Dockerfile | ROCm/HIP | rocm/dev-ubuntu-<version> |
.devops/vulkan.Dockerfile | Vulkan | ubuntu:26.04 |
.devops/musa.Dockerfile | MUSA | mthreads/musa:<version> |
.devops/intel.Dockerfile | SYCL | intel/deep-learning-essentials:<version> |
.devops/cann.Dockerfile | CANN (Ascend) | quay.io/ascend/cann:<version> |
.devops/s390x.Dockerfile | CPU (s390x) | gcc:<version> |
.devops/llama-cli-cann.Dockerfile | CANN (minimal) | ascendai/cann:<version> |
Diagram: Docker image target structure
The .devops/tools.sh script serves as the ENTRYPOINT for full images. It dispatches to the appropriate binary based on its first argument:
| Argument | Invokes |
|---|---|
--run / -r | llama-cli |
--run-legacy / -l | llama-completion |
--convert / -c | convert_hf_to_gguf.py |
--quantize / -q | llama-quantize |
--bench / -b | llama-bench |
--perplexity / -p | llama-perplexity |
--server / -s | llama-server |
Pre-built Docker images are published to ghcr.io/ggml-org/llama.cpp with tags such as full, light, server, full-cuda, server-rocm, full-vulkan, etc.
To build a CUDA image locally:
Sources: .devops/cpu.Dockerfile1-89 .devops/cuda.Dockerfile1-95 .devops/rocm.Dockerfile1-114 .devops/vulkan.Dockerfile1-91 .devops/musa.Dockerfile1-102 .devops/intel.Dockerfile1-95 .devops/cann.Dockerfile1-131 .devops/tools.sh1-53 docs/docker.md1-134
The project is distributed under the MIT License, covering all original code. The copyright holder is listed as "The ggml authors."
Sources: LICENSE1-22
The AUTHORS file is auto-generated by scripts/gen-authors.sh It runs git log over the master branch and deduplicates contributor names and emails. Contributors are not expected to edit this file manually.
Sources: scripts/gen-authors.sh1-10
Refresh this wiki
This wiki was recently refreshed. Please wait 2 days to refresh again.