Getting Started

Relevant source files

This page orients new users to llama.cpp: what it provides, how to install it, how to obtain a model, and how to run inference for the first time. It serves as a navigation hub into the more detailed child pages.

For in-depth build instructions, see Installation and Building. For full coverage of command-line flags and parameters, see Configuration and Parameters. For a technical overview of the library internals, see Core Library Architecture.

What llama.cpp Provides

llama.cpp is a C/C++ library for running large language model (LLM) inference locally, with no mandatory external dependencies. Its primary goals are minimal setup and broad hardware coverage.

Key capabilities:

Capability	Description
CPU inference	Optimized via SIMD (AVX, AVX2, AVX512, NEON, SVE, RVV)
GPU acceleration	CUDA (NVIDIA), Metal (Apple), Vulkan, HIP (AMD), SYCL (Intel), CANN (Ascend), OpenCL
Quantization	1.5-bit through 8-bit integer quantization for reduced memory use
Hybrid inference	CPU+GPU offloading when the model exceeds VRAM
HTTP API	OpenAI-compatible REST server via `llama-server`
Model format	Reads GGUF files; provides conversion scripts from HuggingFace formats

The project is also the primary development environment for the `ggml` tensor library that underlies it.

Sources: README.md58-70

System Overview

High-level flow from user command to token output:

Sources: README.md56-70 CMakeLists.txt192-215

Installation

There are four supported installation paths:

Method	When to use
Package manager (`brew`, `nix`, `winget`)	Quickest for macOS/Linux/Windows desktop use
Docker	Reproducible environment; GPU support via Docker flags
Pre-built binaries	No compiler required; from the GitHub releases page
Build from source	Required for custom backends (CUDA, Metal, Vulkan, etc.)

The old Makefile-based build has been removed. CMake is the only supported build system (Makefile6-9).

For full source build instructions including CMake flags for each backend, see Installation and Building.

Obtaining a Model

llama.cpp requires models in GGUF format. Models in HuggingFace format (.safetensors, .bin) must be converted first.

Option 1: Download a pre-converted GGUF directly

The -hf flag accepts a HuggingFace repository identifier and downloads the model automatically:

By default this downloads from HuggingFace. Set the environment variable MODEL_ENDPOINT to switch to an alternative host (e.g., https://www.modelscope.cn/).

Option 2: Use a local GGUF file

Option 3: Convert from HuggingFace format

Use convert_hf_to_gguf.py (installed to $prefix/bin by cmake --install) to convert HuggingFace model directories. See Model Conversion from HuggingFace for details.

Option 4: Online tools

The HuggingFace platform provides browser-based GGUF conversion and quantization at huggingface.co/spaces/ggml-org/gguf-my-repo.

Sources: README.md293-319

First-Run Examples

Tool and file mapping:

Sources: CMakeLists.txt192-215 examples/CMakeLists.txt1-45

Interactive chat (`llama-cli`)

Models that carry a built-in chat template activate conversation mode automatically. Use -cnv to force it and --chat-template NAME to override the template.

HTTP server (`llama-server`)

The server exposes an OpenAI-compatible REST API. Key endpoints:

POST /v1/chat/completions
POST /completion
POST /embeddings

See llama-server HTTP API for the full endpoint reference.

Benchmarking (`llama-bench`)

Reports prompt processing throughput (pp) and token generation speed (tg) in tokens/second.

Developer example (`llama-simple`)

examples/simple contains a minimal application that directly calls the libllama C API. It is the recommended starting point for developers embedding llama.cpp in their own programs.

Sources: README.md321-502

Request Lifecycle

How a chat request moves through the system:

Sources: README.md56-70

Configuration Quick Reference

The most commonly adjusted parameters across all tools:

Parameter	Flag	Default	Effect
Model file	`-m <path>`	—	Path to `.gguf` file
HuggingFace model	`-hf <user>/<repo>`	—	Download and use from HF
Context size	`-c <n>`	4096	Max tokens in the KV cache
GPU layers	`-ngl <n>`	0	Layers offloaded to GPU
Threads	`-t <n>`	system	CPU threads for generation
Parallel slots	`-np <n>`	1	Concurrent decode sequences
Batch size	`-b <n>`	2048	Prompt processing batch size

For the complete parameter reference including sampling parameters and library API equivalents, see Configuration and Parameters.

Supported Hardware Targets

Backend	Target
CPU (default)	All x86, ARM, RISC-V platforms
`GGML_METAL=ON`	Apple Silicon (M-series)
`GGML_CUDA=ON`	NVIDIA GPUs
`GGML_HIP=ON`	AMD GPUs
`GGML_VULKAN=ON`	Cross-platform GPU
`GGML_SYCL=ON`	Intel GPUs
`GGML_CANN=ON`	Huawei Ascend NPU
`GGML_OPENCL=ON`	Adreno (Qualcomm) GPU

Each backend is selected at CMake configure time. See Installation and Building for the exact flags and GPU and Accelerator Backends for implementation details.

Sources: README.md272-291 CMakeLists.txt156-166

Next Steps

Goal	Page
Build from source with a specific backend	Installation and Building
Learn all CLI flags and usage patterns	Basic Usage and Examples
Tune context size, sampling, threading	Configuration and Parameters
Understand the library C API	libllama Public API
Convert or quantize a model	Model Pipeline
Use the HTTP server API	llama-server HTTP API
Use a language binding	Language Bindings and Integration

Getting Started

Relevant source files

What llama.cpp Provides

llama.cpp is a C/C++ library for running large language model (LLM) inference locally, with no mandatory external dependencies. Its primary goals are minimal setup and broad hardware coverage.

Key capabilities:

Capability	Description
CPU inference	Optimized via SIMD (AVX, AVX2, AVX512, NEON, SVE, RVV)
GPU acceleration	CUDA (NVIDIA), Metal (Apple), Vulkan, HIP (AMD), SYCL (Intel), CANN (Ascend), OpenCL
Quantization	1.5-bit through 8-bit integer quantization for reduced memory use
Hybrid inference	CPU+GPU offloading when the model exceeds VRAM
HTTP API	OpenAI-compatible REST server via `llama-server`
Model format	Reads GGUF files; provides conversion scripts from HuggingFace formats

The project is also the primary development environment for the `ggml` tensor library that underlies it.

Sources: README.md58-70

System Overview

High-level flow from user command to token output:

Sources: README.md56-70 CMakeLists.txt192-215

Installation

There are four supported installation paths:

Method	When to use
Package manager (`brew`, `nix`, `winget`)	Quickest for macOS/Linux/Windows desktop use
Docker	Reproducible environment; GPU support via Docker flags
Pre-built binaries	No compiler required; from the GitHub releases page
Build from source	Required for custom backends (CUDA, Metal, Vulkan, etc.)

The old Makefile-based build has been removed. CMake is the only supported build system (Makefile6-9).

For full source build instructions including CMake flags for each backend, see Installation and Building.

Obtaining a Model

llama.cpp requires models in GGUF format. Models in HuggingFace format (.safetensors, .bin) must be converted first.

Option 1: Download a pre-converted GGUF directly

The -hf flag accepts a HuggingFace repository identifier and downloads the model automatically:

By default this downloads from HuggingFace. Set the environment variable MODEL_ENDPOINT to switch to an alternative host (e.g., https://www.modelscope.cn/).

Option 2: Use a local GGUF file

Option 3: Convert from HuggingFace format

Use convert_hf_to_gguf.py (installed to $prefix/bin by cmake --install) to convert HuggingFace model directories. See Model Conversion from HuggingFace for details.

Option 4: Online tools

The HuggingFace platform provides browser-based GGUF conversion and quantization at huggingface.co/spaces/ggml-org/gguf-my-repo.

Sources: README.md293-319

First-Run Examples

Tool and file mapping:

Sources: CMakeLists.txt192-215 examples/CMakeLists.txt1-45

Interactive chat (`llama-cli`)

Models that carry a built-in chat template activate conversation mode automatically. Use -cnv to force it and --chat-template NAME to override the template.

HTTP server (`llama-server`)

The server exposes an OpenAI-compatible REST API. Key endpoints:

POST /v1/chat/completions
POST /completion
POST /embeddings

See llama-server HTTP API for the full endpoint reference.

Benchmarking (`llama-bench`)

Reports prompt processing throughput (pp) and token generation speed (tg) in tokens/second.

Developer example (`llama-simple`)

examples/simple contains a minimal application that directly calls the libllama C API. It is the recommended starting point for developers embedding llama.cpp in their own programs.

Sources: README.md321-502

Request Lifecycle

How a chat request moves through the system:

Sources: README.md56-70

Configuration Quick Reference

The most commonly adjusted parameters across all tools:

Parameter	Flag	Default	Effect
Model file	`-m <path>`	—	Path to `.gguf` file
HuggingFace model	`-hf <user>/<repo>`	—	Download and use from HF
Context size	`-c <n>`	4096	Max tokens in the KV cache
GPU layers	`-ngl <n>`	0	Layers offloaded to GPU
Threads	`-t <n>`	system	CPU threads for generation
Parallel slots	`-np <n>`	1	Concurrent decode sequences
Batch size	`-b <n>`	2048	Prompt processing batch size

For the complete parameter reference including sampling parameters and library API equivalents, see Configuration and Parameters.

Supported Hardware Targets

Backend	Target
CPU (default)	All x86, ARM, RISC-V platforms
`GGML_METAL=ON`	Apple Silicon (M-series)
`GGML_CUDA=ON`	NVIDIA GPUs
`GGML_HIP=ON`	AMD GPUs
`GGML_VULKAN=ON`	Cross-platform GPU
`GGML_SYCL=ON`	Intel GPUs
`GGML_CANN=ON`	Huawei Ascend NPU
`GGML_OPENCL=ON`	Adreno (Qualcomm) GPU

Each backend is selected at CMake configure time. See Installation and Building for the exact flags and GPU and Accelerator Backends for implementation details.

Sources: README.md272-291 CMakeLists.txt156-166

Next Steps

Goal	Page
Build from source with a specific backend	Installation and Building
Learn all CLI flags and usage patterns	Basic Usage and Examples
Tune context size, sampling, threading	Configuration and Parameters
Understand the library C API	libllama Public API
Convert or quantize a model	Model Pipeline
Use the HTTP server API	llama-server HTTP API
Use a language binding	Language Bindings and Integration

Getting Started

What llama.cpp Provides

System Overview

Installation

Obtaining a Model

Option 1: Download a pre-converted GGUF directly

Option 2: Use a local GGUF file

Option 3: Convert from HuggingFace format

Option 4: Online tools

First-Run Examples

Interactive chat (`llama-cli`)

HTTP server (`llama-server`)

Benchmarking (`llama-bench`)

Developer example (`llama-simple`)

Request Lifecycle

Configuration Quick Reference

Supported Hardware Targets

Next Steps

On this page

Getting Started

What llama.cpp Provides

System Overview

Installation

Obtaining a Model

Option 1: Download a pre-converted GGUF directly

Option 2: Use a local GGUF file

Option 3: Convert from HuggingFace format

Option 4: Online tools

First-Run Examples

Interactive chat (`llama-cli`)

HTTP server (`llama-server`)

Benchmarking (`llama-bench`)

Developer example (`llama-simple`)

Request Lifecycle

Configuration Quick Reference

Supported Hardware Targets

Next Steps

On this page

Getting Started

What llama.cpp Provides

System Overview

Installation

Obtaining a Model

Option 1: Download a pre-converted GGUF directly

Option 2: Use a local GGUF file

Option 3: Convert from HuggingFace format

Option 4: Online tools

First-Run Examples

Interactive chat (llama-cli)

HTTP server (llama-server)

Benchmarking (llama-bench)

Developer example (llama-simple)

Request Lifecycle

Configuration Quick Reference

Supported Hardware Targets

Next Steps

On this page

Getting Started

What llama.cpp Provides

System Overview

Installation

Obtaining a Model

Option 1: Download a pre-converted GGUF directly

Option 2: Use a local GGUF file

Option 3: Convert from HuggingFace format

Option 4: Online tools

First-Run Examples

Interactive chat (llama-cli)

HTTP server (llama-server)

Benchmarking (llama-bench)

Developer example (llama-simple)

Request Lifecycle

Configuration Quick Reference

Supported Hardware Targets

Next Steps

On this page

Interactive chat (`llama-cli`)

HTTP server (`llama-server`)

Benchmarking (`llama-bench`)

Developer example (`llama-simple`)

Interactive chat (`llama-cli`)

HTTP server (`llama-server`)

Benchmarking (`llama-bench`)

Developer example (`llama-simple`)