Multimodal Support

Relevant source files

This page documents the libmtmd multimodal library (tools/mtmd/), which provides image and audio input support for llama.cpp. It covers the public C API, the internal CLIP encoder architecture, per-model computation graph builders, and the llama-mtmd-cli tool.

For the server-side multimodal endpoints (/chat/completions with image attachments), see 6.2. For the underlying GGML tensor computation system used by the encoders, see 4.1.

Overview

libmtmd sits between the user-facing application and the language model inference engine. It handles:

Loading a "mmproj" GGUF file containing the vision/audio encoder and projection layer weights.
Preprocessing raw image (RGB bytes) or audio (PCM float32) into encoder-ready format.
Running the encoder to produce floating-point embeddings.
Feeding those embeddings into llama_decode at the correct sequence positions.

The library exposes two headers: mtmd.h (public C API) and mtmd-helper.h (helpers for evaluation). Internal headers (clip.h, clip-impl.h, clip-model.h, clip-graph.h) are not part of the public API.

Component diagram: libmtmd in context

Sources: tools/mtmd/mtmd.h1-50 tools/mtmd/clip.h1-30 tools/mtmd/CMakeLists.txt1-50

File Layout

File	Role
`mtmd.h`	Public C API
`mtmd-helper.h`	Public helpers (eval, bitmap loading)
`mtmd.cpp`	`mtmd_context`, tokenizer, `mtmd_encode_chunk`
`mtmd-helper.cpp`	`mtmd_helper_eval_chunks`, batch construction, stb_image, miniaudio
`clip.h`	Internal CLIP API used only by mtmd
`clip.cpp`	`clip_ctx`, model loading, `clip_image_build_graph` dispatch
`clip-impl.h`	Internal constants, `projector_type` enum, image structs
`clip-model.h`	`clip_model`, `clip_hparams`, `clip_layer` data structs
`clip-graph.h`	Abstract `clip_graph` base class
`models/models.h`	Concrete `clip_graph_*` subclass declarations
`models/*.cpp`	Per-architecture graph builder implementations
`mtmd-cli.cpp`	`llama-mtmd-cli` command-line tool

Sources: tools/mtmd/CMakeLists.txt1-55

Public API (`mtmd.h`)

Core types

Type	Description
`mtmd_context`	Opaque context holding vision (`ctx_v`) and audio (`ctx_a`) encoders
`mtmd_bitmap`	Raw image (RGBRGB…) or audio (float32 PCM) input
`mtmd_input_chunk`	One segment of a tokenized prompt: text, image, or audio
`mtmd_input_chunks`	Ordered list of `mtmd_input_chunk`
`mtmd_image_tokens`	Preprocessed image with token geometry (`nx`, `ny`)

Initialization and teardown

mtmd_context_params controls GPU offload, flash attention, thread count, image token limits, and an optional evaluation callback. Defaults are available via mtmd_context_params_default().

Sources: tools/mtmd/mtmd.h86-130

Tokenization

The input text must contain the media marker (default <__media__>, returned by mtmd_default_marker()) once per bitmap. The function splits the prompt at each marker, tokenizes text segments using the LLM vocabulary, and preprocesses bitmaps into image/audio token chunks. Special tokens for the model's image/audio delimiters are inserted automatically.

Sources: tools/mtmd/mtmd.h194-215 tools/mtmd/mtmd.cpp467-522

Encoding and embedding retrieval

mtmd_encode_chunk runs the CLIP encoder (vision or audio) on the preprocessed chunk and stores the result internally. The caller retrieves a pointer with mtmd_get_output_embd and then calls llama_decode to inject the embeddings into the LLM.

Sources: tools/mtmd/mtmd.h218-230

Capability queries

These let callers adapt their logic for models using M-RoPE position encoding (Qwen2-VL family) or non-causal attention during image decoding.

Sources: tools/mtmd/mtmd.h117-131

C++ wrappers

mtmd.h also provides a namespace mtmd with RAII wrappers (context_ptr, bitmap_ptr, input_chunks_ptr, bitmap, bitmaps, input_chunks) built on top of the C API.

Sources: tools/mtmd/mtmd.h248-315

Internal Architecture

`mtmd_context` internals

mtmd_context (defined in mtmd.cpp) holds:

Field	Type	Purpose
`ctx_v`	`clip_ctx *`	Vision encoder context
`ctx_a`	`clip_ctx *`	Audio encoder context
`text_model`	`llama_model *`	Text LLM for vocabulary lookups
`image_embd_v`	`std::vector<float>`	Output embedding buffer
`slice_tmpl`	`mtmd_slice_tmpl`	Tiling layout for UHD models
`img_beg/img_end`	`std::string`	Model-specific image delimiter tokens
`aud_beg/aud_end`	`std::string`	Model-specific audio delimiter tokens
`audio_preproc`	`mtmd_audio_preprocessor*`	Mel spectrogram preprocessor

The constructor calls clip_init which returns a clip_init_result with both ctx_v and ctx_a (one may be null if the model only supports one modality).

Sources: tools/mtmd/mtmd.cpp121-220

`clip_ctx` internals

clip_ctx (defined in clip.cpp) is the encoder runtime context:

Field	Type	Purpose
`model`	`clip_model`	Weights and hyperparameters
`ctx_gguf`	`gguf_context_ptr`	GGUF file context
`backend`	`ggml_backend_t`	GPU or CPU backend
`backend_cpu`	`ggml_backend_t`	Always-present CPU backend
`sched`	`ggml_backend_sched_ptr`	Backend scheduler
`flash_attn_type`	`clip_flash_attn_type`	Flash attention mode

GPU backend selection respects the MTMD_BACKEND_DEVICE environment variable, falling back to the default GPU device or CPU.

Sources: tools/mtmd/clip.cpp141-221

`clip_model` and `clip_hparams`

clip_model (in clip-model.h) stores all tensor weight pointers for a loaded encoder, including:

Patch embedding convolution (patch_embeddings_0, patch_embeddings_1)
Position embeddings (position_embeddings)
Per-layer attention and FFN weights (clip_layer)
Projection tensors (projection, mm_0_w/mm_2_w, etc.)
Model-specific fields (MiniCPM-V resampler, Ultravox conv1d, LFM2 conformer, MobileNetV5 for Gemma3n, etc.)

clip_hparams stores integer hyperparameters loaded from GGUF metadata keys like clip.vision.embedding_length, clip.vision.patch_size, clip.vision.image_grid_pinpoints, etc.

Sources: tools/mtmd/clip-model.h31-102 tools/mtmd/clip-model.h216-390 tools/mtmd/clip-impl.h20-64

`projector_type` and Supported Architectures

The projector_type enum (in clip-impl.h) maps to the GGUF key clip.projector_type. During model loading, clip_model_loader::load_hparams reads this string and converts it.

Vision projector types:

Enum	String	Models
`PROJECTOR_TYPE_MLP`	`mlp`	LLaVA 1.5, base LLaVA models
`PROJECTOR_TYPE_MINICPMV`	`resampler`	MiniCPM-V 2.5/2.6
`PROJECTOR_TYPE_QWEN2VL`	`qwen2vl_merger`	Qwen2-VL
`PROJECTOR_TYPE_QWEN25VL`	`qwen2.5vl_merger`	Qwen2.5-VL
`PROJECTOR_TYPE_QWEN3VL`	`qwen3vl_merger`	Qwen3-VL
`PROJECTOR_TYPE_GEMMA3`	`gemma3`	Gemma 3
`PROJECTOR_TYPE_GEMMA3NV`	`gemma3nv`	Gemma 3n (MobileNetV5)
`PROJECTOR_TYPE_IDEFICS3`	`idefics3`	SmolVLM / Idefics3
`PROJECTOR_TYPE_PIXTRAL`	`pixtral`	Pixtral / Mistral Small 3.1
`PROJECTOR_TYPE_INTERNVL`	`internvl`	InternVL 2.5/3
`PROJECTOR_TYPE_LLAMA4`	`llama4`	Llama 4 Scout
`PROJECTOR_TYPE_KIMIVL`	`kimivl`	Kimi-VL
`PROJECTOR_TYPE_GLM4V`	`glm4v`	GLM-4V
`PROJECTOR_TYPE_COGVLM`	`cogvlm`	CogVLM
`PROJECTOR_TYPE_JANUS_PRO`	`janus_pro`	Janus-Pro

Audio projector types:

Enum	String	Models
`PROJECTOR_TYPE_ULTRAVOX`	`ultravox`	Ultravox
`PROJECTOR_TYPE_QWEN2A`	`qwen2a`	Qwen2-Audio
`PROJECTOR_TYPE_VOXTRAL`	`voxtral`	Voxtral
`PROJECTOR_TYPE_GLMA`	`glma`	GLM-Audio
`PROJECTOR_TYPE_LFM2A`	`lfm2a`	LFM2-Audio (Conformer)

Mixed modality:

Enum	String	Models
`PROJECTOR_TYPE_QWEN25O`	`qwen2.5o`	Qwen2.5-Omni (splits into `QWEN25VL` + `QWEN2A`)

Sources: tools/mtmd/clip-impl.h207-286

Computation Graph Architecture (`clip_graph`)

Base class

clip_graph (clip-graph.h) is an abstract base class. All encoder computations are expressed as GGML computation graphs. The constructor captures the model reference, image dimensions, and initializes a ggml_context and ggml_cgraph. Subclasses override build() to return the completed ggml_cgraph *.

Shared utility methods on clip_graph:

Method	Purpose
`build_inp()`	Conv2D patch embedding (`patch_embeddings_0`)
`build_vit(...)`	Standard Vision Transformer loop (layernorm + self-attention + FFN)
`build_attn(...)`	Multi-head attention, with optional flash attention path
`build_ffn(...)`	Feed-forward network (GELU, SILU, GELU_ERF, etc.)
`build_norm(...)`	LayerNorm or RMSNorm
`build_rope_2d(...)`	2D rotary position embeddings
`build_patch_merge_permute(...)`	Pixel shuffle / patch merger
`build_stack(...)`	Frame stacking for audio (Ultravox)
`resize_position_embeddings(...)`	Bilinear interpolation of position embeddings (SigLIP2 NaFlex)

Sources: tools/mtmd/clip-graph.h14-117

Graph builder dispatch

clip_image_build_graph in clip.cpp dispatches to the correct concrete subclass based on proj_type:

Graph builder class hierarchy:

Sources: tools/mtmd/clip.cpp784-881 tools/mtmd/models/models.h1-129

The clip_graph_siglip subclass is shared by PROJECTOR_TYPE_GEMMA3, PROJECTOR_TYPE_IDEFICS3, PROJECTOR_TYPE_LFM2, and PROJECTOR_TYPE_JANUS_PRO. The clip_graph_whisper_enc subclass handles all Whisper-based audio encoders (ULTRAVOX, VOXTRAL, QWEN2A, GLMA, MUSIC_FLAMINGO).

Sources: tools/mtmd/clip.cpp790-880

End-to-End Processing Pipeline

Sequence: image input through to LLM decode

Sources: tools/mtmd/mtmd.cpp551-678 tools/mtmd/mtmd-helper.cpp229-305 tools/mtmd/clip.cpp784-881

Image preprocessing

clip_image_preprocess (called from mtmd_tokenize) handles:

Resize and normalization to the model's expected input dimensions.
For dynamic-resolution models (Qwen2-VL, InternVL, etc.): selecting the best resolution from clip_hparams::image_res_candidates and splitting into tiles.
The result is a clip_image_f32_batch with grid_x/grid_y set for tiled models.

Tiled / UHD image handling

Models supporting high-resolution input split images into an "overview" image plus a grid of tiles. The mtmd_context has a slice_tmpl field (type mtmd_slice_tmpl) controlling how the tokens are arranged around these tiles:

Enum value	Layout	Models
`MTMD_SLICE_TMPL_NONE`	No tiling	Most basic models
`MTMD_SLICE_TMPL_MINICPMV_2_5`	`<image>overview</image><slice>...</slice>`	MiniCPM-V 2.5
`MTMD_SLICE_TMPL_MINICPMV_2_6`	`<image>overview</image><slice>tile</slice>...`	MiniCPM-V 2.6+
`MTMD_SLICE_TMPL_LLAMA4`	`<\|image_start\|>tiles<\|image\|>overview<\|image_end\|>`	Llama 4
`MTMD_SLICE_TMPL_IDEFICS3`	Row/column templated tokens	SmolVLM / Idefics3
`MTMD_SLICE_TMPL_LFM2`	`<\|img_row_R_col_C\|>` per tile	LFM2-VL

Sources: tools/mtmd/mtmd.cpp82-90 tools/mtmd/mtmd.cpp221-332

M-RoPE Position Encoding

Qwen2-VL and related models use Multimodal Rotary Position Embeddings (M-RoPE). When mtmd_decode_use_mrope(ctx) returns true, the helper layer must provide a 4-component position vector per token (temporal, height, width, unused).

decode_embd_batch in mtmd-helper.cpp manages two methods for building this layout:

set_position_mrope_2d(pos_0, nx, ny, seq_id) — for image chunks: height index in dim 1, width index in dim 2.
set_position_mrope_1d(pos_0, seq_id) — for audio chunks: sequential index replicated across dims 0–2.

Sources: tools/mtmd/mtmd-helper.cpp156-193

Audio Support

Audio input follows the same mtmd_bitmap API as images, but with is_audio = true and data containing float32 PCM samples. The pipeline differs in preprocessing:

mtmd_audio_preprocessor (abstract) converts raw PCM to mel spectrograms.
- mtmd_audio_preprocessor_whisper — for Whisper-based models (Ultravox, Qwen2A, Voxtral, GlmA, MusicFlamingo).
- mtmd_audio_preprocessor_conformer — for LFM2-Audio.
The mel spectrogram is stored as clip_image_f32 where nx = n_frames, ny = n_mel.
clip_graph_whisper_enc::build() constructs the encoder graph: two 1D convolution layers followed by a Vision Transformer loop, then a projector MLP.

The audio_has_avgpool() and audio_has_stack_frames() flags on clip_model control whether average pooling or frame stacking is applied after the transformer.

Sources: tools/mtmd/mtmd.cpp335-376 tools/mtmd/models/whisper-enc.cpp1-115 tools/mtmd/clip-model.h378-388

Helper API (`mtmd-helper.h`)

The helper layer bridges libmtmd and libllama to simplify the eval loop.

Function	Description
`mtmd_helper_eval_chunks`	Iterates chunks, calls `mtmd_encode_chunk` then `llama_decode`, returns new `n_past`
`mtmd_helper_eval_chunk_single`	Evaluates one chunk (text or image/audio)
`mtmd_helper_decode_image_chunk`	Builds `decode_embd_batch`, calls `llama_decode` in sub-batches
`mtmd_helper_get_n_tokens`	Total token count across all chunks
`mtmd_helper_get_n_pos`	Total position count (accounts for M-RoPE temporal positions)
`mtmd_helper_bitmap_init_from_file`	Load image or audio file using stb_image / miniaudio

mtmd_helper_bitmap_init_from_file is the primary entry point for file-based input. It uses stb_image.h for image decoding and miniaudio.h for audio decoding (both vendored in tools/mtmd/vendor/).

Sources: tools/mtmd/mtmd-helper.cpp99-115 tools/mtmd/mtmd-helper.cpp229-305 tools/mtmd/mtmd-helper.cpp307-380

`llama-mtmd-cli` Tool

llama-mtmd-cli (tools/mtmd/mtmd-cli.cpp) is the reference CLI for multimodal inference.

Modes:

Single-turn (-p <prompt> + --image <path>): formats prompt, loads media, evaluates, generates.
Interactive chat: reads commands from stdin. /image <path> and /audio <path> load media; the next text message sends them together.

The mtmd_cli_context struct manages all state: mtmd::context_ptr ctx_vision, llama_context, common_sampler, chat history, and a mtmd::bitmaps accumulator for media loaded before the next message.

Usage:

llama-mtmd-cli -hf ggml-org/gemma-3-4b-it-GGUF
llama-mtmd-cli -m model.gguf --mmproj mmproj.gguf --image photo.jpg -p "<__media__> describe this"
llama-mtmd-cli -hf ggml-org/ultravox-v0_5-llama-3_2-1b-GGUF --audio recording.wav -p "<__media__> transcribe"

By default the mmproj is GPU-offloaded. Pass --no-mmproj-offload to keep it on CPU.

Sources: tools/mtmd/mtmd-cli.cpp67-175 docs/multimodal.md1-40

GGUF Metadata Keys

The mmproj GGUF file carries all encoder configuration under the clip.* namespace. Key metadata fields:

Key	Purpose
`clip.projector_type`	Maps to `projector_type` enum
`clip.has_vision_encoder`	Bool: file contains vision weights
`clip.has_audio_encoder`	Bool: file contains audio weights
`clip.vision.image_size`	Expected input resolution
`clip.vision.patch_size`	Conv2D patch stride
`clip.vision.embedding_length`	Hidden dimension `n_embd`
`clip.vision.block_count`	Number of transformer layers
`clip.vision.projection_dim`	Output embedding dimension (must equal text model's `n_embd`)
`clip.vision.image_grid_pinpoints`	Candidate resolutions for tiled models
`clip.audio.num_mel_bins`	Mel spectrogram height for audio models
`clip.audio.projector.stack_factor`	Frame stacking factor (Ultravox)

Sources: tools/mtmd/clip-impl.h20-65 tools/mtmd/clip.cpp959-1001

Adding a New Model Architecture

To add a new encoder architecture:

Add a PROJECTOR_TYPE_* entry to the projector_type enum in clip-impl.h and its string mapping in PROJECTOR_TYPE_NAMES.
Create a clip_graph_* subclass in tools/mtmd/models/, inheriting from clip_graph, overriding build().
Register the weights in clip_model (clip-model.h) and load them in clip_model_loader::load_tensors (clip.cpp).
Add the dispatch case in clip_image_build_graph (clip.cpp).
Handle any model-specific special tokens in mtmd_context::init_vision() or init_audio() in mtmd.cpp.
Add the model to tests.sh for integration testing.

Sources: tools/mtmd/clip.cpp784-881 tools/mtmd/mtmd.cpp221-376 tools/mtmd/tests.sh46-96

Multimodal Support

Relevant source files

For the server-side multimodal endpoints (/chat/completions with image attachments), see 6.2. For the underlying GGML tensor computation system used by the encoders, see 4.1.

Overview

libmtmd sits between the user-facing application and the language model inference engine. It handles:

Loading a "mmproj" GGUF file containing the vision/audio encoder and projection layer weights.
Preprocessing raw image (RGB bytes) or audio (PCM float32) into encoder-ready format.
Running the encoder to produce floating-point embeddings.
Feeding those embeddings into llama_decode at the correct sequence positions.

Component diagram: libmtmd in context

Sources: tools/mtmd/mtmd.h1-50 tools/mtmd/clip.h1-30 tools/mtmd/CMakeLists.txt1-50

File Layout

File	Role
`mtmd.h`	Public C API
`mtmd-helper.h`	Public helpers (eval, bitmap loading)
`mtmd.cpp`	`mtmd_context`, tokenizer, `mtmd_encode_chunk`
`mtmd-helper.cpp`	`mtmd_helper_eval_chunks`, batch construction, stb_image, miniaudio
`clip.h`	Internal CLIP API used only by mtmd
`clip.cpp`	`clip_ctx`, model loading, `clip_image_build_graph` dispatch
`clip-impl.h`	Internal constants, `projector_type` enum, image structs
`clip-model.h`	`clip_model`, `clip_hparams`, `clip_layer` data structs
`clip-graph.h`	Abstract `clip_graph` base class
`models/models.h`	Concrete `clip_graph_*` subclass declarations
`models/*.cpp`	Per-architecture graph builder implementations
`mtmd-cli.cpp`	`llama-mtmd-cli` command-line tool

Sources: tools/mtmd/CMakeLists.txt1-55

Public API (`mtmd.h`)

Core types

Type	Description
`mtmd_context`	Opaque context holding vision (`ctx_v`) and audio (`ctx_a`) encoders
`mtmd_bitmap`	Raw image (RGBRGB…) or audio (float32 PCM) input
`mtmd_input_chunk`	One segment of a tokenized prompt: text, image, or audio
`mtmd_input_chunks`	Ordered list of `mtmd_input_chunk`
`mtmd_image_tokens`	Preprocessed image with token geometry (`nx`, `ny`)

Initialization and teardown

mtmd_context_params controls GPU offload, flash attention, thread count, image token limits, and an optional evaluation callback. Defaults are available via mtmd_context_params_default().

Sources: tools/mtmd/mtmd.h86-130

Tokenization

Sources: tools/mtmd/mtmd.h194-215 tools/mtmd/mtmd.cpp467-522

Encoding and embedding retrieval

Sources: tools/mtmd/mtmd.h218-230

Capability queries

These let callers adapt their logic for models using M-RoPE position encoding (Qwen2-VL family) or non-causal attention during image decoding.

Sources: tools/mtmd/mtmd.h117-131

C++ wrappers

mtmd.h also provides a namespace mtmd with RAII wrappers (context_ptr, bitmap_ptr, input_chunks_ptr, bitmap, bitmaps, input_chunks) built on top of the C API.

Sources: tools/mtmd/mtmd.h248-315

Internal Architecture

`mtmd_context` internals

mtmd_context (defined in mtmd.cpp) holds:

Field	Type	Purpose
`ctx_v`	`clip_ctx *`	Vision encoder context
`ctx_a`	`clip_ctx *`	Audio encoder context
`text_model`	`llama_model *`	Text LLM for vocabulary lookups
`image_embd_v`	`std::vector<float>`	Output embedding buffer
`slice_tmpl`	`mtmd_slice_tmpl`	Tiling layout for UHD models
`img_beg/img_end`	`std::string`	Model-specific image delimiter tokens
`aud_beg/aud_end`	`std::string`	Model-specific audio delimiter tokens
`audio_preproc`	`mtmd_audio_preprocessor*`	Mel spectrogram preprocessor

The constructor calls clip_init which returns a clip_init_result with both ctx_v and ctx_a (one may be null if the model only supports one modality).

Sources: tools/mtmd/mtmd.cpp121-220

`clip_ctx` internals

clip_ctx (defined in clip.cpp) is the encoder runtime context:

Field	Type	Purpose
`model`	`clip_model`	Weights and hyperparameters
`ctx_gguf`	`gguf_context_ptr`	GGUF file context
`backend`	`ggml_backend_t`	GPU or CPU backend
`backend_cpu`	`ggml_backend_t`	Always-present CPU backend
`sched`	`ggml_backend_sched_ptr`	Backend scheduler
`flash_attn_type`	`clip_flash_attn_type`	Flash attention mode

GPU backend selection respects the MTMD_BACKEND_DEVICE environment variable, falling back to the default GPU device or CPU.

Sources: tools/mtmd/clip.cpp141-221

`clip_model` and `clip_hparams`

clip_model (in clip-model.h) stores all tensor weight pointers for a loaded encoder, including:

Patch embedding convolution (patch_embeddings_0, patch_embeddings_1)
Position embeddings (position_embeddings)
Per-layer attention and FFN weights (clip_layer)
Projection tensors (projection, mm_0_w/mm_2_w, etc.)
Model-specific fields (MiniCPM-V resampler, Ultravox conv1d, LFM2 conformer, MobileNetV5 for Gemma3n, etc.)

clip_hparams stores integer hyperparameters loaded from GGUF metadata keys like clip.vision.embedding_length, clip.vision.patch_size, clip.vision.image_grid_pinpoints, etc.

Sources: tools/mtmd/clip-model.h31-102 tools/mtmd/clip-model.h216-390 tools/mtmd/clip-impl.h20-64

`projector_type` and Supported Architectures

The projector_type enum (in clip-impl.h) maps to the GGUF key clip.projector_type. During model loading, clip_model_loader::load_hparams reads this string and converts it.

Vision projector types:

Enum	String	Models
`PROJECTOR_TYPE_MLP`	`mlp`	LLaVA 1.5, base LLaVA models
`PROJECTOR_TYPE_MINICPMV`	`resampler`	MiniCPM-V 2.5/2.6
`PROJECTOR_TYPE_QWEN2VL`	`qwen2vl_merger`	Qwen2-VL
`PROJECTOR_TYPE_QWEN25VL`	`qwen2.5vl_merger`	Qwen2.5-VL
`PROJECTOR_TYPE_QWEN3VL`	`qwen3vl_merger`	Qwen3-VL
`PROJECTOR_TYPE_GEMMA3`	`gemma3`	Gemma 3
`PROJECTOR_TYPE_GEMMA3NV`	`gemma3nv`	Gemma 3n (MobileNetV5)
`PROJECTOR_TYPE_IDEFICS3`	`idefics3`	SmolVLM / Idefics3
`PROJECTOR_TYPE_PIXTRAL`	`pixtral`	Pixtral / Mistral Small 3.1
`PROJECTOR_TYPE_INTERNVL`	`internvl`	InternVL 2.5/3
`PROJECTOR_TYPE_LLAMA4`	`llama4`	Llama 4 Scout
`PROJECTOR_TYPE_KIMIVL`	`kimivl`	Kimi-VL
`PROJECTOR_TYPE_GLM4V`	`glm4v`	GLM-4V
`PROJECTOR_TYPE_COGVLM`	`cogvlm`	CogVLM
`PROJECTOR_TYPE_JANUS_PRO`	`janus_pro`	Janus-Pro

Audio projector types:

Enum	String	Models
`PROJECTOR_TYPE_ULTRAVOX`	`ultravox`	Ultravox
`PROJECTOR_TYPE_QWEN2A`	`qwen2a`	Qwen2-Audio
`PROJECTOR_TYPE_VOXTRAL`	`voxtral`	Voxtral
`PROJECTOR_TYPE_GLMA`	`glma`	GLM-Audio
`PROJECTOR_TYPE_LFM2A`	`lfm2a`	LFM2-Audio (Conformer)

Mixed modality:

Enum	String	Models
`PROJECTOR_TYPE_QWEN25O`	`qwen2.5o`	Qwen2.5-Omni (splits into `QWEN25VL` + `QWEN2A`)

Sources: tools/mtmd/clip-impl.h207-286

Computation Graph Architecture (`clip_graph`)

Base class

Shared utility methods on clip_graph:

Method	Purpose
`build_inp()`	Conv2D patch embedding (`patch_embeddings_0`)
`build_vit(...)`	Standard Vision Transformer loop (layernorm + self-attention + FFN)
`build_attn(...)`	Multi-head attention, with optional flash attention path
`build_ffn(...)`	Feed-forward network (GELU, SILU, GELU_ERF, etc.)
`build_norm(...)`	LayerNorm or RMSNorm
`build_rope_2d(...)`	2D rotary position embeddings
`build_patch_merge_permute(...)`	Pixel shuffle / patch merger
`build_stack(...)`	Frame stacking for audio (Ultravox)
`resize_position_embeddings(...)`	Bilinear interpolation of position embeddings (SigLIP2 NaFlex)

Sources: tools/mtmd/clip-graph.h14-117

Graph builder dispatch

clip_image_build_graph in clip.cpp dispatches to the correct concrete subclass based on proj_type:

Graph builder class hierarchy:

Sources: tools/mtmd/clip.cpp784-881 tools/mtmd/models/models.h1-129

Sources: tools/mtmd/clip.cpp790-880

End-to-End Processing Pipeline

Sequence: image input through to LLM decode

Sources: tools/mtmd/mtmd.cpp551-678 tools/mtmd/mtmd-helper.cpp229-305 tools/mtmd/clip.cpp784-881

Image preprocessing

clip_image_preprocess (called from mtmd_tokenize) handles:

Resize and normalization to the model's expected input dimensions.
For dynamic-resolution models (Qwen2-VL, InternVL, etc.): selecting the best resolution from clip_hparams::image_res_candidates and splitting into tiles.
The result is a clip_image_f32_batch with grid_x/grid_y set for tiled models.

Tiled / UHD image handling

Enum value	Layout	Models
`MTMD_SLICE_TMPL_NONE`	No tiling	Most basic models
`MTMD_SLICE_TMPL_MINICPMV_2_5`	`<image>overview</image><slice>...</slice>`	MiniCPM-V 2.5
`MTMD_SLICE_TMPL_MINICPMV_2_6`	`<image>overview</image><slice>tile</slice>...`	MiniCPM-V 2.6+
`MTMD_SLICE_TMPL_LLAMA4`	`<\|image_start\|>tiles<\|image\|>overview<\|image_end\|>`	Llama 4
`MTMD_SLICE_TMPL_IDEFICS3`	Row/column templated tokens	SmolVLM / Idefics3
`MTMD_SLICE_TMPL_LFM2`	`<\|img_row_R_col_C\|>` per tile	LFM2-VL

Sources: tools/mtmd/mtmd.cpp82-90 tools/mtmd/mtmd.cpp221-332

M-RoPE Position Encoding

decode_embd_batch in mtmd-helper.cpp manages two methods for building this layout:

set_position_mrope_2d(pos_0, nx, ny, seq_id) — for image chunks: height index in dim 1, width index in dim 2.
set_position_mrope_1d(pos_0, seq_id) — for audio chunks: sequential index replicated across dims 0–2.

Sources: tools/mtmd/mtmd-helper.cpp156-193

Audio Support

Audio input follows the same mtmd_bitmap API as images, but with is_audio = true and data containing float32 PCM samples. The pipeline differs in preprocessing:

mtmd_audio_preprocessor (abstract) converts raw PCM to mel spectrograms.
- mtmd_audio_preprocessor_whisper — for Whisper-based models (Ultravox, Qwen2A, Voxtral, GlmA, MusicFlamingo).
- mtmd_audio_preprocessor_conformer — for LFM2-Audio.
The mel spectrogram is stored as clip_image_f32 where nx = n_frames, ny = n_mel.
clip_graph_whisper_enc::build() constructs the encoder graph: two 1D convolution layers followed by a Vision Transformer loop, then a projector MLP.

The audio_has_avgpool() and audio_has_stack_frames() flags on clip_model control whether average pooling or frame stacking is applied after the transformer.

Sources: tools/mtmd/mtmd.cpp335-376 tools/mtmd/models/whisper-enc.cpp1-115 tools/mtmd/clip-model.h378-388

Helper API (`mtmd-helper.h`)

The helper layer bridges libmtmd and libllama to simplify the eval loop.

Function	Description
`mtmd_helper_eval_chunks`	Iterates chunks, calls `mtmd_encode_chunk` then `llama_decode`, returns new `n_past`
`mtmd_helper_eval_chunk_single`	Evaluates one chunk (text or image/audio)
`mtmd_helper_decode_image_chunk`	Builds `decode_embd_batch`, calls `llama_decode` in sub-batches
`mtmd_helper_get_n_tokens`	Total token count across all chunks
`mtmd_helper_get_n_pos`	Total position count (accounts for M-RoPE temporal positions)
`mtmd_helper_bitmap_init_from_file`	Load image or audio file using stb_image / miniaudio

Sources: tools/mtmd/mtmd-helper.cpp99-115 tools/mtmd/mtmd-helper.cpp229-305 tools/mtmd/mtmd-helper.cpp307-380

`llama-mtmd-cli` Tool

llama-mtmd-cli (tools/mtmd/mtmd-cli.cpp) is the reference CLI for multimodal inference.

Modes:

Single-turn (-p <prompt> + --image <path>): formats prompt, loads media, evaluates, generates.
Interactive chat: reads commands from stdin. /image <path> and /audio <path> load media; the next text message sends them together.

Usage:

llama-mtmd-cli -hf ggml-org/gemma-3-4b-it-GGUF
llama-mtmd-cli -m model.gguf --mmproj mmproj.gguf --image photo.jpg -p "<__media__> describe this"
llama-mtmd-cli -hf ggml-org/ultravox-v0_5-llama-3_2-1b-GGUF --audio recording.wav -p "<__media__> transcribe"

By default the mmproj is GPU-offloaded. Pass --no-mmproj-offload to keep it on CPU.

Sources: tools/mtmd/mtmd-cli.cpp67-175 docs/multimodal.md1-40

GGUF Metadata Keys

The mmproj GGUF file carries all encoder configuration under the clip.* namespace. Key metadata fields:

Key	Purpose
`clip.projector_type`	Maps to `projector_type` enum
`clip.has_vision_encoder`	Bool: file contains vision weights
`clip.has_audio_encoder`	Bool: file contains audio weights
`clip.vision.image_size`	Expected input resolution
`clip.vision.patch_size`	Conv2D patch stride
`clip.vision.embedding_length`	Hidden dimension `n_embd`
`clip.vision.block_count`	Number of transformer layers
`clip.vision.projection_dim`	Output embedding dimension (must equal text model's `n_embd`)
`clip.vision.image_grid_pinpoints`	Candidate resolutions for tiled models
`clip.audio.num_mel_bins`	Mel spectrogram height for audio models
`clip.audio.projector.stack_factor`	Frame stacking factor (Ultravox)

Sources: tools/mtmd/clip-impl.h20-65 tools/mtmd/clip.cpp959-1001

Adding a New Model Architecture

To add a new encoder architecture:

Add a PROJECTOR_TYPE_* entry to the projector_type enum in clip-impl.h and its string mapping in PROJECTOR_TYPE_NAMES.
Create a clip_graph_* subclass in tools/mtmd/models/, inheriting from clip_graph, overriding build().
Register the weights in clip_model (clip-model.h) and load them in clip_model_loader::load_tensors (clip.cpp).
Add the dispatch case in clip_image_build_graph (clip.cpp).
Handle any model-specific special tokens in mtmd_context::init_vision() or init_audio() in mtmd.cpp.
Add the model to tests.sh for integration testing.

Sources: tools/mtmd/clip.cpp784-881 tools/mtmd/mtmd.cpp221-376 tools/mtmd/tests.sh46-96

Multimodal Support

Overview

File Layout

Public API (mtmd.h)

Core types

Initialization and teardown

Tokenization

Encoding and embedding retrieval

Capability queries

C++ wrappers

Internal Architecture

mtmd_context internals

clip_ctx internals

clip_model and clip_hparams

projector_type and Supported Architectures

Computation Graph Architecture (clip_graph)

Base class

Graph builder dispatch

End-to-End Processing Pipeline

Image preprocessing

Tiled / UHD image handling

M-RoPE Position Encoding

Audio Support

Helper API (mtmd-helper.h)

llama-mtmd-cli Tool

GGUF Metadata Keys

Adding a New Model Architecture

On this page

Multimodal Support

Overview

File Layout

Public API (mtmd.h)

Core types

Initialization and teardown

Tokenization

Encoding and embedding retrieval

Capability queries

C++ wrappers

Internal Architecture

mtmd_context internals

clip_ctx internals

clip_model and clip_hparams

projector_type and Supported Architectures

Computation Graph Architecture (clip_graph)

Base class

Graph builder dispatch

End-to-End Processing Pipeline

Image preprocessing

Tiled / UHD image handling

M-RoPE Position Encoding

Audio Support

Helper API (mtmd-helper.h)

llama-mtmd-cli Tool

GGUF Metadata Keys

Adding a New Model Architecture

On this page

Public API (`mtmd.h`)

`mtmd_context` internals

`clip_ctx` internals

`clip_model` and `clip_hparams`

`projector_type` and Supported Architectures

Computation Graph Architecture (`clip_graph`)

Helper API (`mtmd-helper.h`)

`llama-mtmd-cli` Tool

Public API (`mtmd.h`)

`mtmd_context` internals

`clip_ctx` internals

`clip_model` and `clip_hparams`

`projector_type` and Supported Architectures

Computation Graph Architecture (`clip_graph`)

Helper API (`mtmd-helper.h`)

`llama-mtmd-cli` Tool