Tokenization and Vocabulary

Relevant source files

This page covers the llama_vocab subsystem: how vocabulary and tokenizer configuration are stored in GGUF files, the supported tokenizer algorithms, token attribute types, and the public API for text encoding and decoding. For how the vocabulary's embedding tensor is used during inference, see Model Loading and Representation. For the Jinja chat template system that operates above tokenization, see Chat Templates and Message Parsing.

Overview

The tokenization subsystem converts raw text strings to integer token IDs (encoding) and back (decoding). The central data structure is llama_vocab, defined in src/llama-vocab.h which stores vocabulary data and owns the tokenizer implementation. llama_vocab is embedded directly in llama_model and loaded from the GGUF file's metadata section during model loading.

High-level text-to-token pipeline:

Code entity map — vocabulary subsystem:

Sources: src/llama-vocab.h src/llama-vocab.cpp include/llama.h

Vocabulary Data in GGUF

All tokenizer metadata is stored in the GGUF key-value section under the tokenizer.ggml.* and related namespaces. These are written during model conversion by helpers in gguf-py/gguf/gguf_writer.py and defined as constants in gguf-py/gguf/constants.py

GGUF Key	Type	Tokenizer	Description
`tokenizer.ggml.model`	string	all	Tokenizer algorithm name (`llama`, `gpt2`, `bert`, `t5`, `rwkv`)
`tokenizer.ggml.pre`	string	BPE	Pre-tokenizer variant name
`tokenizer.ggml.tokens`	string[]	all	Token text, indexed by token ID
`tokenizer.ggml.token_type`	int32[]	all	Token type per entry (NORMAL, CONTROL, BYTE, etc.)
`tokenizer.ggml.scores`	float32[]	SPM, UGM	Per-token log-probabilities
`tokenizer.ggml.merges`	string[]	BPE	BPE merge rules, each as `"token_a token_b"`
`tokenizer.ggml.bos_token_id`	uint32	all	Beginning-of-sequence token ID
`tokenizer.ggml.eos_token_id`	uint32	all	End-of-sequence token ID
`tokenizer.ggml.unknown_token_id`	uint32	all	Unknown (`<unk>`) token ID
`tokenizer.ggml.separator_token_id`	uint32	WPM	Separator token ID
`tokenizer.ggml.padding_token_id`	uint32	all	Padding token ID
`tokenizer.ggml.cls_token_id`	uint32	WPM	`[CLS]` token ID
`tokenizer.ggml.mask_token_id`	uint32	WPM	`[MASK]` token ID
`tokenizer.ggml.add_bos_token`	bool	all	Prepend BOS during tokenization
`tokenizer.ggml.add_eos_token`	bool	all	Append EOS during tokenization
`tokenizer.ggml.add_space_prefix`	bool	SPM	Add space before first word
`tokenizer.ggml.remove_extra_whitespaces`	bool	SPM, UGM	Collapse extra whitespace
`tokenizer.ggml.precompiled_charsmap`	bytes	SPM, UGM	Precompiled normalization map
`tokenizer.ggml.eot_token_id`	uint32	BPE	End-of-turn token ID
`tokenizer.ggml.fim_pre_token_id`	uint32	BPE	Fill-in-the-middle prefix token
`tokenizer.ggml.fim_suf_token_id`	uint32	BPE	Fill-in-the-middle suffix token
`tokenizer.ggml.fim_mid_token_id`	uint32	BPE	Fill-in-the-middle middle token
`tokenizer.huggingface.json`	string	BPE	Raw HuggingFace `tokenizer.json` (BPE fallback)
`tokenizer.rwkv.world`	string	RWKV	RWKV world tokenizer data
`tokenizer.chat_template`	string	—	Default Jinja2 chat template

Sources: gguf-py/gguf/constants.py gguf-py/gguf/gguf_writer.py

Tokenizer Types

The llama_vocab_type enum (include/llama.h71-79) identifies the primary tokenizer algorithm. It is stored as the tokenizer.ggml.model string in GGUF.

Enum Value	GGUF String	Algorithm	Example Models
`LLAMA_VOCAB_TYPE_NONE`	—	No vocabulary	Embedding-only
`LLAMA_VOCAB_TYPE_SPM`	`llama`	SentencePiece (score-based BPE + normalization)	LLaMA 1/2, Mistral v0.1
`LLAMA_VOCAB_TYPE_BPE`	`gpt2`	Byte-level BPE (merge-table-based)	GPT-2, LLaMA 3, Qwen, Phi
`LLAMA_VOCAB_TYPE_WPM`	`bert`	WordPiece (longest-prefix greedy)	BERT, RoBERTa
`LLAMA_VOCAB_TYPE_UGM`	`t5`	Unigram Language Model (Viterbi search)	T5
`LLAMA_VOCAB_TYPE_RWKV`	`rwkv`	Greedy longest-match via trie	RWKV
`LLAMA_VOCAB_TYPE_PLAMO2`	`plamo2`	Aho-Corasick + dynamic programming	PLaMo-2

SPM

SPM is the original LLaMA tokenizer. Algorithm in llm_tokenizer_spm_session::tokenize (src/llama-vocab.cpp117-173):

Normalize text via the precompiled charsmap.
Split into UTF-8 characters; each becomes an llm_symbol node.
Seed llm_bigram_spm priority queue with all adjacent pairs scored from token_data::score.
Iteratively pop and merge the highest-scoring bigram; enqueue new neighbors.
Symbols with no vocabulary match fall back to byte tokens.

BPE

BPE models use a fixed merge table. A pre-tokenizer regex first splits text into chunks, then each chunk is segmented via rank-ordered merges from bpe_ranks. See Pre-tokenizer Types below.

WPM

BERT WordPiece greedily matches the longest known subword from the beginning of each word. Continuation subwords are prefixed with ##.

UGM

Unigram uses the per-token log-probabilities in tokenizer.ggml.scores and finds the most probable segmentation via a Viterbi-style search.

RWKV

Greedy longest-match using naive_trie::get_longest_prefix() (src/llama-vocab.cpp28-68). No merge rules are involved.

Sources: include/llama.h71-79 src/llama-vocab.cpp28-68 src/llama-vocab.cpp117-173

Pre-tokenizer Types

BPE models differ significantly in how raw text is split into word-like chunks before BPE merges are applied. These differences are captured by the llama_vocab_pre_type enum (src/llama-vocab.h10-46) and stored in GGUF as tokenizer.ggml.pre.

The pre-tokenizer type selects the model-specific regex pattern inside llm_tokenizer_bpe. A sampling of the currently defined types:

Enum Value	Typical Model Family
`LLAMA_VOCAB_PRE_TYPE_DEFAULT`	Generic GPT-2-style
`LLAMA_VOCAB_PRE_TYPE_LLAMA3`	LLaMA 3.x
`LLAMA_VOCAB_PRE_TYPE_LLAMA4`	LLaMA 4
`LLAMA_VOCAB_PRE_TYPE_DEEPSEEK_LLM`	DeepSeek LLM
`LLAMA_VOCAB_PRE_TYPE_DEEPSEEK_CODER`	DeepSeek Coder
`LLAMA_VOCAB_PRE_TYPE_DEEPSEEK3_LLM`	DeepSeek V3
`LLAMA_VOCAB_PRE_TYPE_QWEN2`	Qwen 2/3
`LLAMA_VOCAB_PRE_TYPE_GPT2`	GPT-2
`LLAMA_VOCAB_PRE_TYPE_GPT4O`	GPT-4o
`LLAMA_VOCAB_PRE_TYPE_TEKKEN`	Mistral Tekken (v3+)
`LLAMA_VOCAB_PRE_TYPE_FALCON`	Falcon
`LLAMA_VOCAB_PRE_TYPE_STARCODER`	StarCoder
`LLAMA_VOCAB_PRE_TYPE_COMMAND_R`	Command-R
`LLAMA_VOCAB_PRE_TYPE_BLOOM`	BLOOM
`LLAMA_VOCAB_PRE_TYPE_CHATGLM3`	ChatGLM 3
`LLAMA_VOCAB_PRE_TYPE_CHATGLM4`	ChatGLM 4
`LLAMA_VOCAB_PRE_TYPE_SMOLLM`	SmolLM
`LLAMA_VOCAB_PRE_TYPE_EXAONE`	EXAONE
`LLAMA_VOCAB_PRE_TYPE_PLaMo2`	PLaMo-2
(36 variants total)

Identification during conversion: get_vocab_base_pre() in convert_hf_to_gguf.py computes a SHA-256 hash of a sample tokenization output and maps the hash to the correct pre-type string. This is maintained by convert_hf_to_gguf_update.py which downloads reference tokenizers from HuggingFace to update the hash table.

Sources: src/llama-vocab.h10-46 convert_hf_to_gguf.py convert_hf_to_gguf_update.py

Token Attributes

`llama_token_type` (legacy, GGUF storage)

Defined at include/llama.h90-98 Used when reading the tokenizer.ggml.token_type array from GGUF.

Value	Meaning
`LLAMA_TOKEN_TYPE_NORMAL`	Regular vocabulary token
`LLAMA_TOKEN_TYPE_UNKNOWN`	`<unk>` token
`LLAMA_TOKEN_TYPE_CONTROL`	Special / control token
`LLAMA_TOKEN_TYPE_USER_DEFINED`	Added special token
`LLAMA_TOKEN_TYPE_UNUSED`	Reserved placeholder
`LLAMA_TOKEN_TYPE_BYTE`	Single-byte fallback

`llama_token_attr` (internal bitfield)

Defined at include/llama.h100-112 Used in token_data::attr for runtime checks. Multiple bits may be set simultaneously.

Bit	Name	Meaning
`1 << 0`	`LLAMA_TOKEN_ATTR_UNKNOWN`	Unknown token
`1 << 1`	`LLAMA_TOKEN_ATTR_UNUSED`	Unused placeholder
`1 << 2`	`LLAMA_TOKEN_ATTR_NORMAL`	Regular token
`1 << 3`	`LLAMA_TOKEN_ATTR_CONTROL`	Special/control token
`1 << 4`	`LLAMA_TOKEN_ATTR_USER_DEFINED`	User-added special token
`1 << 5`	`LLAMA_TOKEN_ATTR_BYTE`	Byte fallback
`1 << 6`	`LLAMA_TOKEN_ATTR_NORMALIZED`	Text is normalized
`1 << 7`	`LLAMA_TOKEN_ATTR_LSTRIP`	Strip leading whitespace on decode
`1 << 8`	`LLAMA_TOKEN_ATTR_RSTRIP`	Strip trailing whitespace on decode
`1 << 9`	`LLAMA_TOKEN_ATTR_SINGLE_WORD`	Must appear as a complete word

Sources: include/llama.h90-112

The `llama_vocab` Struct

llama_vocab (src/llama-vocab.h) aggregates all vocabulary state. It is embedded by value in llama_model.

Class diagram:

Key field notes:

id_to_token: primary token storage, one token_data per ID
token_to_id: reverse string→ID lookup (excludes byte tokens, which are reconstructed)
bpe_ranks: map<pair<string,string>, int> populated from tokenizer.ggml.merges; used by BPE merge loop
cache_special_tokens: sorted vector of special/control token IDs for fast is_eog() / is_control() checks
cache_token_to_piece: pre-decoded string per token ID (avoids re-decoding on every call)
special_eog_ids: map from string name to ID for model-specific end-of-generation tokens
precompiled_charsmap: normalization table for SPM/UGM, from tokenizer.ggml.precompiled_charsmap
tokenizer: a polymorphic llm_tokenizer subclass instantiated at load time based on type

Sources: src/llama-vocab.h src/llama-vocab.cpp

Tokenizer Algorithm Implementations

Each tokenizer type has a stateless configuration object (llm_tokenizer_*) and a per-call session object (llm_tokenizer_*_session) that holds intermediate state.

SPM: `llm_tokenizer_spm_session` (src/llama-vocab.cpp113-173)

Split normalized text into llm_symbol nodes (one per UTF-8 character).
Build a doubly-linked list over the symbol array.
Insert all adjacent pairs as llm_bigram_spm into the priority queue, scored by the merged token's score in id_to_token.
Pop the highest-scoring bigram; skip if either symbol was already merged.
Merge right into left (update n, relink next/prev).
Insert new bigrams for the updated node's neighbors.
Unrecognized final symbols are resegmented into byte tokens via resegment().

llm_bigram_spm::comparator orders by score descending; ties broken by left index ascending.

BPE: `llm_tokenizer_bpe_session`

Apply pre-tokenizer regex (selected by type_pre) to split text into chunks.
For each chunk: initialize symbols from bytes.
Build a set of adjacent pairs and look up each in bpe_ranks.
Merge the lowest-rank pair; update neighbors; repeat until no pair exists in bpe_ranks.

The pre-tokenizer regex patterns are compiled at llm_tokenizer_bpe construction time.

RWKV: `naive_trie`

The naive_trie (src/llama-vocab.cpp28-68) stores all vocabulary tokens in a character-keyed trie. Tokenization greedily calls get_longest_prefix() and advances the cursor.

Sources: src/llama-vocab.cpp28-68 src/llama-vocab.cpp75-173

Public API

Declared in include/llama.h Vocabulary functions operate on const llama_vocab *, obtained via llama_model_get_vocab().

Vocabulary Access

Special Tokens

Encoding and Decoding

Sources: include/llama.h

Common Helper Functions

common/common.h and common/common.cpp provide wrappers that handle automatic buffer resizing (both llama_tokenize and llama_token_to_piece return a negative size when the output buffer is too small; callers must retry with a larger buffer).

The context-taking overloads call llama_get_model(ctx) and llama_model_get_vocab() internally.

Sources: common/common.h52 common/common.cpp

Vocabulary Conversion from HuggingFace

convert_hf_to_gguf.py writes vocabulary data through GGUFWriter methods. Each TextModel subclass calls a _set_vocab_* helper, which reads the source tokenizer files and calls the appropriate gguf_writer.add_* methods.

Helper	Tokenizer Type	Source Files Read
`_set_vocab_gpt2()`	BPE	`tokenizer.json` (HuggingFace format)
`_set_vocab_spm()`	SPM	`tokenizer.model` (SentencePiece protobuf)
`_set_vocab_bert()`	WPM	`vocab.txt`
`_set_vocab_ugm()`	UGM	`spiece.model`
`_set_vocab_rwkv_world()`	RWKV	`rwkv_vocab_v20230424.txt`

SPM Token Type Mapping

SentencePieceTokenTypes (convert_hf_to_gguf.py62-68) maps the protobuf token types to the integer values stored in tokenizer.ggml.token_type:

`SentencePieceTokenTypes`	GGUF int value
`NORMAL`	1
`UNKNOWN`	2
`CONTROL`	3
`USER_DEFINED`	4
`UNUSED`	5
`BYTE`	6

Pre-tokenizer Identification for BPE

get_vocab_base_pre() in convert_hf_to_gguf.py identifies the correct pre-tokenizer by:

Running the HuggingFace tokenizer on a fixed sample string.
Computing a SHA-256 hash of the resulting token IDs.
Looking up the hash in a hard-coded table to produce the tokenizer.ggml.pre string.

This table is regenerated by convert_hf_to_gguf_update.py which downloads tokenizers from HuggingFace and recomputes fingerprints when new model families are added.

Sources: convert_hf_to_gguf.py62-68 convert_hf_to_gguf_update.py gguf-py/gguf/gguf_writer.py

Tokenization and Vocabulary

Relevant source files

Overview

High-level text-to-token pipeline:

Code entity map — vocabulary subsystem:

Sources: src/llama-vocab.h src/llama-vocab.cpp include/llama.h

Vocabulary Data in GGUF

GGUF Key	Type	Tokenizer	Description
`tokenizer.ggml.model`	string	all	Tokenizer algorithm name (`llama`, `gpt2`, `bert`, `t5`, `rwkv`)
`tokenizer.ggml.pre`	string	BPE	Pre-tokenizer variant name
`tokenizer.ggml.tokens`	string[]	all	Token text, indexed by token ID
`tokenizer.ggml.token_type`	int32[]	all	Token type per entry (NORMAL, CONTROL, BYTE, etc.)
`tokenizer.ggml.scores`	float32[]	SPM, UGM	Per-token log-probabilities
`tokenizer.ggml.merges`	string[]	BPE	BPE merge rules, each as `"token_a token_b"`
`tokenizer.ggml.bos_token_id`	uint32	all	Beginning-of-sequence token ID
`tokenizer.ggml.eos_token_id`	uint32	all	End-of-sequence token ID
`tokenizer.ggml.unknown_token_id`	uint32	all	Unknown (`<unk>`) token ID
`tokenizer.ggml.separator_token_id`	uint32	WPM	Separator token ID
`tokenizer.ggml.padding_token_id`	uint32	all	Padding token ID
`tokenizer.ggml.cls_token_id`	uint32	WPM	`[CLS]` token ID
`tokenizer.ggml.mask_token_id`	uint32	WPM	`[MASK]` token ID
`tokenizer.ggml.add_bos_token`	bool	all	Prepend BOS during tokenization
`tokenizer.ggml.add_eos_token`	bool	all	Append EOS during tokenization
`tokenizer.ggml.add_space_prefix`	bool	SPM	Add space before first word
`tokenizer.ggml.remove_extra_whitespaces`	bool	SPM, UGM	Collapse extra whitespace
`tokenizer.ggml.precompiled_charsmap`	bytes	SPM, UGM	Precompiled normalization map
`tokenizer.ggml.eot_token_id`	uint32	BPE	End-of-turn token ID
`tokenizer.ggml.fim_pre_token_id`	uint32	BPE	Fill-in-the-middle prefix token
`tokenizer.ggml.fim_suf_token_id`	uint32	BPE	Fill-in-the-middle suffix token
`tokenizer.ggml.fim_mid_token_id`	uint32	BPE	Fill-in-the-middle middle token
`tokenizer.huggingface.json`	string	BPE	Raw HuggingFace `tokenizer.json` (BPE fallback)
`tokenizer.rwkv.world`	string	RWKV	RWKV world tokenizer data
`tokenizer.chat_template`	string	—	Default Jinja2 chat template

Sources: gguf-py/gguf/constants.py gguf-py/gguf/gguf_writer.py

Tokenizer Types

The llama_vocab_type enum (include/llama.h71-79) identifies the primary tokenizer algorithm. It is stored as the tokenizer.ggml.model string in GGUF.

Enum Value	GGUF String	Algorithm	Example Models
`LLAMA_VOCAB_TYPE_NONE`	—	No vocabulary	Embedding-only
`LLAMA_VOCAB_TYPE_SPM`	`llama`	SentencePiece (score-based BPE + normalization)	LLaMA 1/2, Mistral v0.1
`LLAMA_VOCAB_TYPE_BPE`	`gpt2`	Byte-level BPE (merge-table-based)	GPT-2, LLaMA 3, Qwen, Phi
`LLAMA_VOCAB_TYPE_WPM`	`bert`	WordPiece (longest-prefix greedy)	BERT, RoBERTa
`LLAMA_VOCAB_TYPE_UGM`	`t5`	Unigram Language Model (Viterbi search)	T5
`LLAMA_VOCAB_TYPE_RWKV`	`rwkv`	Greedy longest-match via trie	RWKV
`LLAMA_VOCAB_TYPE_PLAMO2`	`plamo2`	Aho-Corasick + dynamic programming	PLaMo-2

SPM

SPM is the original LLaMA tokenizer. Algorithm in llm_tokenizer_spm_session::tokenize (src/llama-vocab.cpp117-173):

Normalize text via the precompiled charsmap.
Split into UTF-8 characters; each becomes an llm_symbol node.
Seed llm_bigram_spm priority queue with all adjacent pairs scored from token_data::score.
Iteratively pop and merge the highest-scoring bigram; enqueue new neighbors.
Symbols with no vocabulary match fall back to byte tokens.

BPE

BPE models use a fixed merge table. A pre-tokenizer regex first splits text into chunks, then each chunk is segmented via rank-ordered merges from bpe_ranks. See Pre-tokenizer Types below.

WPM

BERT WordPiece greedily matches the longest known subword from the beginning of each word. Continuation subwords are prefixed with ##.

UGM

Unigram uses the per-token log-probabilities in tokenizer.ggml.scores and finds the most probable segmentation via a Viterbi-style search.

RWKV

Greedy longest-match using naive_trie::get_longest_prefix() (src/llama-vocab.cpp28-68). No merge rules are involved.

Sources: include/llama.h71-79 src/llama-vocab.cpp28-68 src/llama-vocab.cpp117-173

Pre-tokenizer Types

The pre-tokenizer type selects the model-specific regex pattern inside llm_tokenizer_bpe. A sampling of the currently defined types:

Enum Value	Typical Model Family
`LLAMA_VOCAB_PRE_TYPE_DEFAULT`	Generic GPT-2-style
`LLAMA_VOCAB_PRE_TYPE_LLAMA3`	LLaMA 3.x
`LLAMA_VOCAB_PRE_TYPE_LLAMA4`	LLaMA 4
`LLAMA_VOCAB_PRE_TYPE_DEEPSEEK_LLM`	DeepSeek LLM
`LLAMA_VOCAB_PRE_TYPE_DEEPSEEK_CODER`	DeepSeek Coder
`LLAMA_VOCAB_PRE_TYPE_DEEPSEEK3_LLM`	DeepSeek V3
`LLAMA_VOCAB_PRE_TYPE_QWEN2`	Qwen 2/3
`LLAMA_VOCAB_PRE_TYPE_GPT2`	GPT-2
`LLAMA_VOCAB_PRE_TYPE_GPT4O`	GPT-4o
`LLAMA_VOCAB_PRE_TYPE_TEKKEN`	Mistral Tekken (v3+)
`LLAMA_VOCAB_PRE_TYPE_FALCON`	Falcon
`LLAMA_VOCAB_PRE_TYPE_STARCODER`	StarCoder
`LLAMA_VOCAB_PRE_TYPE_COMMAND_R`	Command-R
`LLAMA_VOCAB_PRE_TYPE_BLOOM`	BLOOM
`LLAMA_VOCAB_PRE_TYPE_CHATGLM3`	ChatGLM 3
`LLAMA_VOCAB_PRE_TYPE_CHATGLM4`	ChatGLM 4
`LLAMA_VOCAB_PRE_TYPE_SMOLLM`	SmolLM
`LLAMA_VOCAB_PRE_TYPE_EXAONE`	EXAONE
`LLAMA_VOCAB_PRE_TYPE_PLaMo2`	PLaMo-2
(36 variants total)

Sources: src/llama-vocab.h10-46 convert_hf_to_gguf.py convert_hf_to_gguf_update.py

Token Attributes

`llama_token_type` (legacy, GGUF storage)

Defined at include/llama.h90-98 Used when reading the tokenizer.ggml.token_type array from GGUF.

Value	Meaning
`LLAMA_TOKEN_TYPE_NORMAL`	Regular vocabulary token
`LLAMA_TOKEN_TYPE_UNKNOWN`	`<unk>` token
`LLAMA_TOKEN_TYPE_CONTROL`	Special / control token
`LLAMA_TOKEN_TYPE_USER_DEFINED`	Added special token
`LLAMA_TOKEN_TYPE_UNUSED`	Reserved placeholder
`LLAMA_TOKEN_TYPE_BYTE`	Single-byte fallback

`llama_token_attr` (internal bitfield)

Defined at include/llama.h100-112 Used in token_data::attr for runtime checks. Multiple bits may be set simultaneously.

Bit	Name	Meaning
`1 << 0`	`LLAMA_TOKEN_ATTR_UNKNOWN`	Unknown token
`1 << 1`	`LLAMA_TOKEN_ATTR_UNUSED`	Unused placeholder
`1 << 2`	`LLAMA_TOKEN_ATTR_NORMAL`	Regular token
`1 << 3`	`LLAMA_TOKEN_ATTR_CONTROL`	Special/control token
`1 << 4`	`LLAMA_TOKEN_ATTR_USER_DEFINED`	User-added special token
`1 << 5`	`LLAMA_TOKEN_ATTR_BYTE`	Byte fallback
`1 << 6`	`LLAMA_TOKEN_ATTR_NORMALIZED`	Text is normalized
`1 << 7`	`LLAMA_TOKEN_ATTR_LSTRIP`	Strip leading whitespace on decode
`1 << 8`	`LLAMA_TOKEN_ATTR_RSTRIP`	Strip trailing whitespace on decode
`1 << 9`	`LLAMA_TOKEN_ATTR_SINGLE_WORD`	Must appear as a complete word

Sources: include/llama.h90-112

The `llama_vocab` Struct

llama_vocab (src/llama-vocab.h) aggregates all vocabulary state. It is embedded by value in llama_model.

Class diagram:

Key field notes:

id_to_token: primary token storage, one token_data per ID
token_to_id: reverse string→ID lookup (excludes byte tokens, which are reconstructed)
bpe_ranks: map<pair<string,string>, int> populated from tokenizer.ggml.merges; used by BPE merge loop
cache_special_tokens: sorted vector of special/control token IDs for fast is_eog() / is_control() checks
cache_token_to_piece: pre-decoded string per token ID (avoids re-decoding on every call)
special_eog_ids: map from string name to ID for model-specific end-of-generation tokens
precompiled_charsmap: normalization table for SPM/UGM, from tokenizer.ggml.precompiled_charsmap
tokenizer: a polymorphic llm_tokenizer subclass instantiated at load time based on type

Sources: src/llama-vocab.h src/llama-vocab.cpp

Tokenizer Algorithm Implementations

Each tokenizer type has a stateless configuration object (llm_tokenizer_*) and a per-call session object (llm_tokenizer_*_session) that holds intermediate state.

SPM: `llm_tokenizer_spm_session` (src/llama-vocab.cpp113-173)

Split normalized text into llm_symbol nodes (one per UTF-8 character).
Build a doubly-linked list over the symbol array.
Insert all adjacent pairs as llm_bigram_spm into the priority queue, scored by the merged token's score in id_to_token.
Pop the highest-scoring bigram; skip if either symbol was already merged.
Merge right into left (update n, relink next/prev).
Insert new bigrams for the updated node's neighbors.
Unrecognized final symbols are resegmented into byte tokens via resegment().

llm_bigram_spm::comparator orders by score descending; ties broken by left index ascending.

BPE: `llm_tokenizer_bpe_session`

Apply pre-tokenizer regex (selected by type_pre) to split text into chunks.
For each chunk: initialize symbols from bytes.
Build a set of adjacent pairs and look up each in bpe_ranks.
Merge the lowest-rank pair; update neighbors; repeat until no pair exists in bpe_ranks.

The pre-tokenizer regex patterns are compiled at llm_tokenizer_bpe construction time.

RWKV: `naive_trie`

The naive_trie (src/llama-vocab.cpp28-68) stores all vocabulary tokens in a character-keyed trie. Tokenization greedily calls get_longest_prefix() and advances the cursor.

Sources: src/llama-vocab.cpp28-68 src/llama-vocab.cpp75-173

Public API

Declared in include/llama.h Vocabulary functions operate on const llama_vocab *, obtained via llama_model_get_vocab().

Vocabulary Access

Special Tokens

Encoding and Decoding

Sources: include/llama.h

Common Helper Functions

The context-taking overloads call llama_get_model(ctx) and llama_model_get_vocab() internally.

Sources: common/common.h52 common/common.cpp

Vocabulary Conversion from HuggingFace

Helper	Tokenizer Type	Source Files Read
`_set_vocab_gpt2()`	BPE	`tokenizer.json` (HuggingFace format)
`_set_vocab_spm()`	SPM	`tokenizer.model` (SentencePiece protobuf)
`_set_vocab_bert()`	WPM	`vocab.txt`
`_set_vocab_ugm()`	UGM	`spiece.model`
`_set_vocab_rwkv_world()`	RWKV	`rwkv_vocab_v20230424.txt`

SPM Token Type Mapping

SentencePieceTokenTypes (convert_hf_to_gguf.py62-68) maps the protobuf token types to the integer values stored in tokenizer.ggml.token_type:

`SentencePieceTokenTypes`	GGUF int value
`NORMAL`	1
`UNKNOWN`	2
`CONTROL`	3
`USER_DEFINED`	4
`UNUSED`	5
`BYTE`	6

Pre-tokenizer Identification for BPE

get_vocab_base_pre() in convert_hf_to_gguf.py identifies the correct pre-tokenizer by:

Running the HuggingFace tokenizer on a fixed sample string.
Computing a SHA-256 hash of the resulting token IDs.
Looking up the hash in a hard-coded table to produce the tokenizer.ggml.pre string.

This table is regenerated by convert_hf_to_gguf_update.py which downloads tokenizers from HuggingFace and recomputes fingerprints when new model families are added.

Sources: convert_hf_to_gguf.py62-68 convert_hf_to_gguf_update.py gguf-py/gguf/gguf_writer.py

Tokenization and Vocabulary

Overview

Vocabulary Data in GGUF

Tokenizer Types

SPM

BPE

WPM

UGM

RWKV

Pre-tokenizer Types

Token Attributes

llama_token_type (legacy, GGUF storage)

llama_token_attr (internal bitfield)

The llama_vocab Struct

Tokenizer Algorithm Implementations

SPM: llm_tokenizer_spm_session (src/llama-vocab.cpp113-173)

BPE: llm_tokenizer_bpe_session

RWKV: naive_trie

Public API

Vocabulary Access

Special Tokens

Encoding and Decoding

Common Helper Functions

Vocabulary Conversion from HuggingFace

SPM Token Type Mapping

Pre-tokenizer Identification for BPE

On this page

Tokenization and Vocabulary

Overview

Vocabulary Data in GGUF

Tokenizer Types

SPM

BPE

WPM

UGM

RWKV

Pre-tokenizer Types

Token Attributes

llama_token_type (legacy, GGUF storage)

llama_token_attr (internal bitfield)

The llama_vocab Struct

Tokenizer Algorithm Implementations

SPM: llm_tokenizer_spm_session (src/llama-vocab.cpp113-173)

BPE: llm_tokenizer_bpe_session

RWKV: naive_trie

Public API

Vocabulary Access

Special Tokens

Encoding and Decoding

Common Helper Functions

Vocabulary Conversion from HuggingFace

SPM Token Type Mapping

Pre-tokenizer Identification for BPE

On this page

`llama_token_type` (legacy, GGUF storage)

`llama_token_attr` (internal bitfield)

The `llama_vocab` Struct

SPM: `llm_tokenizer_spm_session` (src/llama-vocab.cpp113-173)

BPE: `llm_tokenizer_bpe_session`

RWKV: `naive_trie`

`llama_token_type` (legacy, GGUF storage)

`llama_token_attr` (internal bitfield)

The `llama_vocab` Struct

SPM: `llm_tokenizer_spm_session` (src/llama-vocab.cpp113-173)

BPE: `llm_tokenizer_bpe_session`

RWKV: `naive_trie`