This page orients new users to llama.cpp: what it provides, how to install it, how to obtain a model, and how to run inference for the first time. It serves as a navigation hub into the more detailed child pages.
For in-depth build instructions, see Installation and Building. For full coverage of command-line flags and parameters, see Configuration and Parameters. For a technical overview of the library internals, see Core Library Architecture.
llama.cpp is a C/C++ library for running large language model (LLM) inference locally, with no mandatory external dependencies. Its primary goals are minimal setup and broad hardware coverage.
Key capabilities:
| Capability | Description |
|---|---|
| CPU inference | Optimized via SIMD (AVX, AVX2, AVX512, NEON, SVE, RVV) |
| GPU acceleration | CUDA (NVIDIA), Metal (Apple), Vulkan, HIP (AMD), SYCL (Intel), CANN (Ascend), OpenCL |
| Quantization | 1.5-bit through 8-bit integer quantization for reduced memory use |
| Hybrid inference | CPU+GPU offloading when the model exceeds VRAM |
| HTTP API | OpenAI-compatible REST server via llama-server |
| Model format | Reads GGUF files; provides conversion scripts from HuggingFace formats |
The project is also the primary development environment for the `ggml` tensor library that underlies it.
Sources: README.md58-70
High-level flow from user command to token output:
Sources: README.md56-70 CMakeLists.txt192-215
There are four supported installation paths:
| Method | When to use |
|---|---|
Package manager (brew, nix, winget) | Quickest for macOS/Linux/Windows desktop use |
| Docker | Reproducible environment; GPU support via Docker flags |
| Pre-built binaries | No compiler required; from the GitHub releases page |
| Build from source | Required for custom backends (CUDA, Metal, Vulkan, etc.) |
The old Makefile-based build has been removed. CMake is the only supported build system (Makefile6-9).
For full source build instructions including CMake flags for each backend, see Installation and Building.
llama.cpp requires models in GGUF format. Models in HuggingFace format (.safetensors, .bin) must be converted first.
The -hf flag accepts a HuggingFace repository identifier and downloads the model automatically:
By default this downloads from HuggingFace. Set the environment variable MODEL_ENDPOINT to switch to an alternative host (e.g., https://www.modelscope.cn/).
Use convert_hf_to_gguf.py (installed to $prefix/bin by cmake --install) to convert HuggingFace model directories. See Model Conversion from HuggingFace for details.
The HuggingFace platform provides browser-based GGUF conversion and quantization at huggingface.co/spaces/ggml-org/gguf-my-repo.
Sources: README.md293-319
Tool and file mapping:
Sources: CMakeLists.txt192-215 examples/CMakeLists.txt1-45
llama-cli)Models that carry a built-in chat template activate conversation mode automatically. Use -cnv to force it and --chat-template NAME to override the template.
llama-server)The server exposes an OpenAI-compatible REST API. Key endpoints:
POST /v1/chat/completionsPOST /completionPOST /embeddingsSee llama-server HTTP API for the full endpoint reference.
llama-bench)Reports prompt processing throughput (pp) and token generation speed (tg) in tokens/second.
llama-simple)examples/simple contains a minimal application that directly calls the libllama C API. It is the recommended starting point for developers embedding llama.cpp in their own programs.
Sources: README.md321-502
How a chat request moves through the system:
Sources: README.md56-70
The most commonly adjusted parameters across all tools:
| Parameter | Flag | Default | Effect |
|---|---|---|---|
| Model file | -m <path> | — | Path to .gguf file |
| HuggingFace model | -hf <user>/<repo> | — | Download and use from HF |
| Context size | -c <n> | 4096 | Max tokens in the KV cache |
| GPU layers | -ngl <n> | 0 | Layers offloaded to GPU |
| Threads | -t <n> | system | CPU threads for generation |
| Parallel slots | -np <n> | 1 | Concurrent decode sequences |
| Batch size | -b <n> | 2048 | Prompt processing batch size |
For the complete parameter reference including sampling parameters and library API equivalents, see Configuration and Parameters.
| Backend | Target |
|---|---|
| CPU (default) | All x86, ARM, RISC-V platforms |
GGML_METAL=ON | Apple Silicon (M-series) |
GGML_CUDA=ON | NVIDIA GPUs |
GGML_HIP=ON | AMD GPUs |
GGML_VULKAN=ON | Cross-platform GPU |
GGML_SYCL=ON | Intel GPUs |
GGML_CANN=ON | Huawei Ascend NPU |
GGML_OPENCL=ON | Adreno (Qualcomm) GPU |
Each backend is selected at CMake configure time. See Installation and Building for the exact flags and GPU and Accelerator Backends for implementation details.
Sources: README.md272-291 CMakeLists.txt156-166
| Goal | Page |
|---|---|
| Build from source with a specific backend | Installation and Building |
| Learn all CLI flags and usage patterns | Basic Usage and Examples |
| Tune context size, sampling, threading | Configuration and Parameters |
| Understand the library C API | libllama Public API |
| Convert or quantize a model | Model Pipeline |
| Use the HTTP server API | llama-server HTTP API |
| Use a language binding | Language Bindings and Integration |
Refresh this wiki
This wiki was recently refreshed. Please wait 2 days to refresh again.