This page covers the llama_vocab subsystem: how vocabulary and tokenizer configuration are stored in GGUF files, the supported tokenizer algorithms, token attribute types, and the public API for text encoding and decoding. For how the vocabulary's embedding tensor is used during inference, see Model Loading and Representation. For the Jinja chat template system that operates above tokenization, see Chat Templates and Message Parsing.
The tokenization subsystem converts raw text strings to integer token IDs (encoding) and back (decoding). The central data structure is llama_vocab, defined in src/llama-vocab.h which stores vocabulary data and owns the tokenizer implementation. llama_vocab is embedded directly in llama_model and loaded from the GGUF file's metadata section during model loading.
High-level text-to-token pipeline:
Code entity map — vocabulary subsystem:
Sources: src/llama-vocab.h src/llama-vocab.cpp include/llama.h
All tokenizer metadata is stored in the GGUF key-value section under the tokenizer.ggml.* and related namespaces. These are written during model conversion by helpers in gguf-py/gguf/gguf_writer.py and defined as constants in gguf-py/gguf/constants.py
| GGUF Key | Type | Tokenizer | Description |
|---|---|---|---|
tokenizer.ggml.model | string | all | Tokenizer algorithm name (llama, gpt2, bert, t5, rwkv) |
tokenizer.ggml.pre | string | BPE | Pre-tokenizer variant name |
tokenizer.ggml.tokens | string[] | all | Token text, indexed by token ID |
tokenizer.ggml.token_type | int32[] | all | Token type per entry (NORMAL, CONTROL, BYTE, etc.) |
tokenizer.ggml.scores | float32[] | SPM, UGM | Per-token log-probabilities |
tokenizer.ggml.merges | string[] | BPE | BPE merge rules, each as "token_a token_b" |
tokenizer.ggml.bos_token_id | uint32 | all | Beginning-of-sequence token ID |
tokenizer.ggml.eos_token_id | uint32 | all | End-of-sequence token ID |
tokenizer.ggml.unknown_token_id | uint32 | all | Unknown (<unk>) token ID |
tokenizer.ggml.separator_token_id | uint32 | WPM | Separator token ID |
tokenizer.ggml.padding_token_id | uint32 | all | Padding token ID |
tokenizer.ggml.cls_token_id | uint32 | WPM | [CLS] token ID |
tokenizer.ggml.mask_token_id | uint32 | WPM | [MASK] token ID |
tokenizer.ggml.add_bos_token | bool | all | Prepend BOS during tokenization |
tokenizer.ggml.add_eos_token | bool | all | Append EOS during tokenization |
tokenizer.ggml.add_space_prefix | bool | SPM | Add space before first word |
tokenizer.ggml.remove_extra_whitespaces | bool | SPM, UGM | Collapse extra whitespace |
tokenizer.ggml.precompiled_charsmap | bytes | SPM, UGM | Precompiled normalization map |
tokenizer.ggml.eot_token_id | uint32 | BPE | End-of-turn token ID |
tokenizer.ggml.fim_pre_token_id | uint32 | BPE | Fill-in-the-middle prefix token |
tokenizer.ggml.fim_suf_token_id | uint32 | BPE | Fill-in-the-middle suffix token |
tokenizer.ggml.fim_mid_token_id | uint32 | BPE | Fill-in-the-middle middle token |
tokenizer.huggingface.json | string | BPE | Raw HuggingFace tokenizer.json (BPE fallback) |
tokenizer.rwkv.world | string | RWKV | RWKV world tokenizer data |
tokenizer.chat_template | string | — | Default Jinja2 chat template |
Sources: gguf-py/gguf/constants.py gguf-py/gguf/gguf_writer.py
The llama_vocab_type enum (include/llama.h71-79) identifies the primary tokenizer algorithm. It is stored as the tokenizer.ggml.model string in GGUF.
| Enum Value | GGUF String | Algorithm | Example Models |
|---|---|---|---|
LLAMA_VOCAB_TYPE_NONE | — | No vocabulary | Embedding-only |
LLAMA_VOCAB_TYPE_SPM | llama | SentencePiece (score-based BPE + normalization) | LLaMA 1/2, Mistral v0.1 |
LLAMA_VOCAB_TYPE_BPE | gpt2 | Byte-level BPE (merge-table-based) | GPT-2, LLaMA 3, Qwen, Phi |
LLAMA_VOCAB_TYPE_WPM | bert | WordPiece (longest-prefix greedy) | BERT, RoBERTa |
LLAMA_VOCAB_TYPE_UGM | t5 | Unigram Language Model (Viterbi search) | T5 |
LLAMA_VOCAB_TYPE_RWKV | rwkv | Greedy longest-match via trie | RWKV |
LLAMA_VOCAB_TYPE_PLAMO2 | plamo2 | Aho-Corasick + dynamic programming | PLaMo-2 |
SPM is the original LLaMA tokenizer. Algorithm in llm_tokenizer_spm_session::tokenize (src/llama-vocab.cpp117-173):
llm_symbol node.llm_bigram_spm priority queue with all adjacent pairs scored from token_data::score.BPE models use a fixed merge table. A pre-tokenizer regex first splits text into chunks, then each chunk is segmented via rank-ordered merges from bpe_ranks. See Pre-tokenizer Types below.
BERT WordPiece greedily matches the longest known subword from the beginning of each word. Continuation subwords are prefixed with ##.
Unigram uses the per-token log-probabilities in tokenizer.ggml.scores and finds the most probable segmentation via a Viterbi-style search.
Greedy longest-match using naive_trie::get_longest_prefix() (src/llama-vocab.cpp28-68). No merge rules are involved.
Sources: include/llama.h71-79 src/llama-vocab.cpp28-68 src/llama-vocab.cpp117-173
BPE models differ significantly in how raw text is split into word-like chunks before BPE merges are applied. These differences are captured by the llama_vocab_pre_type enum (src/llama-vocab.h10-46) and stored in GGUF as tokenizer.ggml.pre.
The pre-tokenizer type selects the model-specific regex pattern inside llm_tokenizer_bpe. A sampling of the currently defined types:
| Enum Value | Typical Model Family |
|---|---|
LLAMA_VOCAB_PRE_TYPE_DEFAULT | Generic GPT-2-style |
LLAMA_VOCAB_PRE_TYPE_LLAMA3 | LLaMA 3.x |
LLAMA_VOCAB_PRE_TYPE_LLAMA4 | LLaMA 4 |
LLAMA_VOCAB_PRE_TYPE_DEEPSEEK_LLM | DeepSeek LLM |
LLAMA_VOCAB_PRE_TYPE_DEEPSEEK_CODER | DeepSeek Coder |
LLAMA_VOCAB_PRE_TYPE_DEEPSEEK3_LLM | DeepSeek V3 |
LLAMA_VOCAB_PRE_TYPE_QWEN2 | Qwen 2/3 |
LLAMA_VOCAB_PRE_TYPE_GPT2 | GPT-2 |
LLAMA_VOCAB_PRE_TYPE_GPT4O | GPT-4o |
LLAMA_VOCAB_PRE_TYPE_TEKKEN | Mistral Tekken (v3+) |
LLAMA_VOCAB_PRE_TYPE_FALCON | Falcon |
LLAMA_VOCAB_PRE_TYPE_STARCODER | StarCoder |
LLAMA_VOCAB_PRE_TYPE_COMMAND_R | Command-R |
LLAMA_VOCAB_PRE_TYPE_BLOOM | BLOOM |
LLAMA_VOCAB_PRE_TYPE_CHATGLM3 | ChatGLM 3 |
LLAMA_VOCAB_PRE_TYPE_CHATGLM4 | ChatGLM 4 |
LLAMA_VOCAB_PRE_TYPE_SMOLLM | SmolLM |
LLAMA_VOCAB_PRE_TYPE_EXAONE | EXAONE |
LLAMA_VOCAB_PRE_TYPE_PLaMo2 | PLaMo-2 |
| (36 variants total) |
Identification during conversion: get_vocab_base_pre() in convert_hf_to_gguf.py computes a SHA-256 hash of a sample tokenization output and maps the hash to the correct pre-type string. This is maintained by convert_hf_to_gguf_update.py which downloads reference tokenizers from HuggingFace to update the hash table.
Sources: src/llama-vocab.h10-46 convert_hf_to_gguf.py convert_hf_to_gguf_update.py
llama_token_type (legacy, GGUF storage)Defined at include/llama.h90-98 Used when reading the tokenizer.ggml.token_type array from GGUF.
| Value | Meaning |
|---|---|
LLAMA_TOKEN_TYPE_NORMAL | Regular vocabulary token |
LLAMA_TOKEN_TYPE_UNKNOWN | <unk> token |
LLAMA_TOKEN_TYPE_CONTROL | Special / control token |
LLAMA_TOKEN_TYPE_USER_DEFINED | Added special token |
LLAMA_TOKEN_TYPE_UNUSED | Reserved placeholder |
LLAMA_TOKEN_TYPE_BYTE | Single-byte fallback |
llama_token_attr (internal bitfield)Defined at include/llama.h100-112 Used in token_data::attr for runtime checks. Multiple bits may be set simultaneously.
| Bit | Name | Meaning |
|---|---|---|
1 << 0 | LLAMA_TOKEN_ATTR_UNKNOWN | Unknown token |
1 << 1 | LLAMA_TOKEN_ATTR_UNUSED | Unused placeholder |
1 << 2 | LLAMA_TOKEN_ATTR_NORMAL | Regular token |
1 << 3 | LLAMA_TOKEN_ATTR_CONTROL | Special/control token |
1 << 4 | LLAMA_TOKEN_ATTR_USER_DEFINED | User-added special token |
1 << 5 | LLAMA_TOKEN_ATTR_BYTE | Byte fallback |
1 << 6 | LLAMA_TOKEN_ATTR_NORMALIZED | Text is normalized |
1 << 7 | LLAMA_TOKEN_ATTR_LSTRIP | Strip leading whitespace on decode |
1 << 8 | LLAMA_TOKEN_ATTR_RSTRIP | Strip trailing whitespace on decode |
1 << 9 | LLAMA_TOKEN_ATTR_SINGLE_WORD | Must appear as a complete word |
Sources: include/llama.h90-112
llama_vocab Structllama_vocab (src/llama-vocab.h) aggregates all vocabulary state. It is embedded by value in llama_model.
Class diagram:
Key field notes:
id_to_token: primary token storage, one token_data per IDtoken_to_id: reverse string→ID lookup (excludes byte tokens, which are reconstructed)bpe_ranks: map<pair<string,string>, int> populated from tokenizer.ggml.merges; used by BPE merge loopcache_special_tokens: sorted vector of special/control token IDs for fast is_eog() / is_control() checkscache_token_to_piece: pre-decoded string per token ID (avoids re-decoding on every call)special_eog_ids: map from string name to ID for model-specific end-of-generation tokensprecompiled_charsmap: normalization table for SPM/UGM, from tokenizer.ggml.precompiled_charsmaptokenizer: a polymorphic llm_tokenizer subclass instantiated at load time based on typeSources: src/llama-vocab.h src/llama-vocab.cpp
Each tokenizer type has a stateless configuration object (llm_tokenizer_*) and a per-call session object (llm_tokenizer_*_session) that holds intermediate state.
llm_tokenizer_spm_session (src/llama-vocab.cpp113-173)llm_symbol nodes (one per UTF-8 character).llm_bigram_spm into the priority queue, scored by the merged token's score in id_to_token.n, relink next/prev).resegment().llm_bigram_spm::comparator orders by score descending; ties broken by left index ascending.
llm_tokenizer_bpe_sessiontype_pre) to split text into chunks.bpe_ranks.bpe_ranks.The pre-tokenizer regex patterns are compiled at llm_tokenizer_bpe construction time.
naive_trieThe naive_trie (src/llama-vocab.cpp28-68) stores all vocabulary tokens in a character-keyed trie. Tokenization greedily calls get_longest_prefix() and advances the cursor.
Sources: src/llama-vocab.cpp28-68 src/llama-vocab.cpp75-173
Declared in include/llama.h Vocabulary functions operate on const llama_vocab *, obtained via llama_model_get_vocab().
Sources: include/llama.h
common/common.h and common/common.cpp provide wrappers that handle automatic buffer resizing (both llama_tokenize and llama_token_to_piece return a negative size when the output buffer is too small; callers must retry with a larger buffer).
The context-taking overloads call llama_get_model(ctx) and llama_model_get_vocab() internally.
Sources: common/common.h52 common/common.cpp
convert_hf_to_gguf.py writes vocabulary data through GGUFWriter methods. Each TextModel subclass calls a _set_vocab_* helper, which reads the source tokenizer files and calls the appropriate gguf_writer.add_* methods.
| Helper | Tokenizer Type | Source Files Read |
|---|---|---|
_set_vocab_gpt2() | BPE | tokenizer.json (HuggingFace format) |
_set_vocab_spm() | SPM | tokenizer.model (SentencePiece protobuf) |
_set_vocab_bert() | WPM | vocab.txt |
_set_vocab_ugm() | UGM | spiece.model |
_set_vocab_rwkv_world() | RWKV | rwkv_vocab_v20230424.txt |
SentencePieceTokenTypes (convert_hf_to_gguf.py62-68) maps the protobuf token types to the integer values stored in tokenizer.ggml.token_type:
SentencePieceTokenTypes | GGUF int value |
|---|---|
NORMAL | 1 |
UNKNOWN | 2 |
CONTROL | 3 |
USER_DEFINED | 4 |
UNUSED | 5 |
BYTE | 6 |
get_vocab_base_pre() in convert_hf_to_gguf.py identifies the correct pre-tokenizer by:
tokenizer.ggml.pre string.This table is regenerated by convert_hf_to_gguf_update.py which downloads tokenizers from HuggingFace and recomputes fingerprints when new model families are added.
Sources: convert_hf_to_gguf.py62-68 convert_hf_to_gguf_update.py gguf-py/gguf/gguf_writer.py
Refresh this wiki
This wiki was recently refreshed. Please wait 2 days to refresh again.