This page documents the libmtmd multimodal library (tools/mtmd/), which provides image and audio input support for llama.cpp. It covers the public C API, the internal CLIP encoder architecture, per-model computation graph builders, and the llama-mtmd-cli tool.
For the server-side multimodal endpoints (/chat/completions with image attachments), see 6.2. For the underlying GGML tensor computation system used by the encoders, see 4.1.
libmtmd sits between the user-facing application and the language model inference engine. It handles:
llama_decode at the correct sequence positions.The library exposes two headers: mtmd.h (public C API) and mtmd-helper.h (helpers for evaluation). Internal headers (clip.h, clip-impl.h, clip-model.h, clip-graph.h) are not part of the public API.
Component diagram: libmtmd in context
Sources: tools/mtmd/mtmd.h1-50 tools/mtmd/clip.h1-30 tools/mtmd/CMakeLists.txt1-50
| File | Role |
|---|---|
mtmd.h | Public C API |
mtmd-helper.h | Public helpers (eval, bitmap loading) |
mtmd.cpp | mtmd_context, tokenizer, mtmd_encode_chunk |
mtmd-helper.cpp | mtmd_helper_eval_chunks, batch construction, stb_image, miniaudio |
clip.h | Internal CLIP API used only by mtmd |
clip.cpp | clip_ctx, model loading, clip_image_build_graph dispatch |
clip-impl.h | Internal constants, projector_type enum, image structs |
clip-model.h | clip_model, clip_hparams, clip_layer data structs |
clip-graph.h | Abstract clip_graph base class |
models/models.h | Concrete clip_graph_* subclass declarations |
models/*.cpp | Per-architecture graph builder implementations |
mtmd-cli.cpp | llama-mtmd-cli command-line tool |
Sources: tools/mtmd/CMakeLists.txt1-55
mtmd.h)| Type | Description |
|---|---|
mtmd_context | Opaque context holding vision (ctx_v) and audio (ctx_a) encoders |
mtmd_bitmap | Raw image (RGBRGB…) or audio (float32 PCM) input |
mtmd_input_chunk | One segment of a tokenized prompt: text, image, or audio |
mtmd_input_chunks | Ordered list of mtmd_input_chunk |
mtmd_image_tokens | Preprocessed image with token geometry (nx, ny) |
mtmd_context_params controls GPU offload, flash attention, thread count, image token limits, and an optional evaluation callback. Defaults are available via mtmd_context_params_default().
Sources: tools/mtmd/mtmd.h86-130
The input text must contain the media marker (default <__media__>, returned by mtmd_default_marker()) once per bitmap. The function splits the prompt at each marker, tokenizes text segments using the LLM vocabulary, and preprocesses bitmaps into image/audio token chunks. Special tokens for the model's image/audio delimiters are inserted automatically.
Sources: tools/mtmd/mtmd.h194-215 tools/mtmd/mtmd.cpp467-522
mtmd_encode_chunk runs the CLIP encoder (vision or audio) on the preprocessed chunk and stores the result internally. The caller retrieves a pointer with mtmd_get_output_embd and then calls llama_decode to inject the embeddings into the LLM.
Sources: tools/mtmd/mtmd.h218-230
These let callers adapt their logic for models using M-RoPE position encoding (Qwen2-VL family) or non-causal attention during image decoding.
Sources: tools/mtmd/mtmd.h117-131
mtmd.h also provides a namespace mtmd with RAII wrappers (context_ptr, bitmap_ptr, input_chunks_ptr, bitmap, bitmaps, input_chunks) built on top of the C API.
Sources: tools/mtmd/mtmd.h248-315
mtmd_context internalsmtmd_context (defined in mtmd.cpp) holds:
| Field | Type | Purpose |
|---|---|---|
ctx_v | clip_ctx * | Vision encoder context |
ctx_a | clip_ctx * | Audio encoder context |
text_model | llama_model * | Text LLM for vocabulary lookups |
image_embd_v | std::vector<float> | Output embedding buffer |
slice_tmpl | mtmd_slice_tmpl | Tiling layout for UHD models |
img_beg/img_end | std::string | Model-specific image delimiter tokens |
aud_beg/aud_end | std::string | Model-specific audio delimiter tokens |
audio_preproc | mtmd_audio_preprocessor* | Mel spectrogram preprocessor |
The constructor calls clip_init which returns a clip_init_result with both ctx_v and ctx_a (one may be null if the model only supports one modality).
Sources: tools/mtmd/mtmd.cpp121-220
clip_ctx internalsclip_ctx (defined in clip.cpp) is the encoder runtime context:
| Field | Type | Purpose |
|---|---|---|
model | clip_model | Weights and hyperparameters |
ctx_gguf | gguf_context_ptr | GGUF file context |
backend | ggml_backend_t | GPU or CPU backend |
backend_cpu | ggml_backend_t | Always-present CPU backend |
sched | ggml_backend_sched_ptr | Backend scheduler |
flash_attn_type | clip_flash_attn_type | Flash attention mode |
GPU backend selection respects the MTMD_BACKEND_DEVICE environment variable, falling back to the default GPU device or CPU.
Sources: tools/mtmd/clip.cpp141-221
clip_model and clip_hparamsclip_model (in clip-model.h) stores all tensor weight pointers for a loaded encoder, including:
patch_embeddings_0, patch_embeddings_1)position_embeddings)clip_layer)projection, mm_0_w/mm_2_w, etc.)clip_hparams stores integer hyperparameters loaded from GGUF metadata keys like clip.vision.embedding_length, clip.vision.patch_size, clip.vision.image_grid_pinpoints, etc.
Sources: tools/mtmd/clip-model.h31-102 tools/mtmd/clip-model.h216-390 tools/mtmd/clip-impl.h20-64
projector_type and Supported ArchitecturesThe projector_type enum (in clip-impl.h) maps to the GGUF key clip.projector_type. During model loading, clip_model_loader::load_hparams reads this string and converts it.
Vision projector types:
| Enum | String | Models |
|---|---|---|
PROJECTOR_TYPE_MLP | mlp | LLaVA 1.5, base LLaVA models |
PROJECTOR_TYPE_MINICPMV | resampler | MiniCPM-V 2.5/2.6 |
PROJECTOR_TYPE_QWEN2VL | qwen2vl_merger | Qwen2-VL |
PROJECTOR_TYPE_QWEN25VL | qwen2.5vl_merger | Qwen2.5-VL |
PROJECTOR_TYPE_QWEN3VL | qwen3vl_merger | Qwen3-VL |
PROJECTOR_TYPE_GEMMA3 | gemma3 | Gemma 3 |
PROJECTOR_TYPE_GEMMA3NV | gemma3nv | Gemma 3n (MobileNetV5) |
PROJECTOR_TYPE_IDEFICS3 | idefics3 | SmolVLM / Idefics3 |
PROJECTOR_TYPE_PIXTRAL | pixtral | Pixtral / Mistral Small 3.1 |
PROJECTOR_TYPE_INTERNVL | internvl | InternVL 2.5/3 |
PROJECTOR_TYPE_LLAMA4 | llama4 | Llama 4 Scout |
PROJECTOR_TYPE_KIMIVL | kimivl | Kimi-VL |
PROJECTOR_TYPE_GLM4V | glm4v | GLM-4V |
PROJECTOR_TYPE_COGVLM | cogvlm | CogVLM |
PROJECTOR_TYPE_JANUS_PRO | janus_pro | Janus-Pro |
Audio projector types:
| Enum | String | Models |
|---|---|---|
PROJECTOR_TYPE_ULTRAVOX | ultravox | Ultravox |
PROJECTOR_TYPE_QWEN2A | qwen2a | Qwen2-Audio |
PROJECTOR_TYPE_VOXTRAL | voxtral | Voxtral |
PROJECTOR_TYPE_GLMA | glma | GLM-Audio |
PROJECTOR_TYPE_LFM2A | lfm2a | LFM2-Audio (Conformer) |
Mixed modality:
| Enum | String | Models |
|---|---|---|
PROJECTOR_TYPE_QWEN25O | qwen2.5o | Qwen2.5-Omni (splits into QWEN25VL + QWEN2A) |
Sources: tools/mtmd/clip-impl.h207-286
clip_graph)clip_graph (clip-graph.h) is an abstract base class. All encoder computations are expressed as GGML computation graphs. The constructor captures the model reference, image dimensions, and initializes a ggml_context and ggml_cgraph. Subclasses override build() to return the completed ggml_cgraph *.
Shared utility methods on clip_graph:
| Method | Purpose |
|---|---|
build_inp() | Conv2D patch embedding (patch_embeddings_0) |
build_vit(...) | Standard Vision Transformer loop (layernorm + self-attention + FFN) |
build_attn(...) | Multi-head attention, with optional flash attention path |
build_ffn(...) | Feed-forward network (GELU, SILU, GELU_ERF, etc.) |
build_norm(...) | LayerNorm or RMSNorm |
build_rope_2d(...) | 2D rotary position embeddings |
build_patch_merge_permute(...) | Pixel shuffle / patch merger |
build_stack(...) | Frame stacking for audio (Ultravox) |
resize_position_embeddings(...) | Bilinear interpolation of position embeddings (SigLIP2 NaFlex) |
Sources: tools/mtmd/clip-graph.h14-117
clip_image_build_graph in clip.cpp dispatches to the correct concrete subclass based on proj_type:
Graph builder class hierarchy:
Sources: tools/mtmd/clip.cpp784-881 tools/mtmd/models/models.h1-129
The clip_graph_siglip subclass is shared by PROJECTOR_TYPE_GEMMA3, PROJECTOR_TYPE_IDEFICS3, PROJECTOR_TYPE_LFM2, and PROJECTOR_TYPE_JANUS_PRO. The clip_graph_whisper_enc subclass handles all Whisper-based audio encoders (ULTRAVOX, VOXTRAL, QWEN2A, GLMA, MUSIC_FLAMINGO).
Sources: tools/mtmd/clip.cpp790-880
Sequence: image input through to LLM decode
Sources: tools/mtmd/mtmd.cpp551-678 tools/mtmd/mtmd-helper.cpp229-305 tools/mtmd/clip.cpp784-881
clip_image_preprocess (called from mtmd_tokenize) handles:
clip_hparams::image_res_candidates and splitting into tiles.clip_image_f32_batch with grid_x/grid_y set for tiled models.Models supporting high-resolution input split images into an "overview" image plus a grid of tiles. The mtmd_context has a slice_tmpl field (type mtmd_slice_tmpl) controlling how the tokens are arranged around these tiles:
| Enum value | Layout | Models |
|---|---|---|
MTMD_SLICE_TMPL_NONE | No tiling | Most basic models |
MTMD_SLICE_TMPL_MINICPMV_2_5 | <image>overview</image><slice>...</slice> | MiniCPM-V 2.5 |
MTMD_SLICE_TMPL_MINICPMV_2_6 | <image>overview</image><slice>tile</slice>... | MiniCPM-V 2.6+ |
MTMD_SLICE_TMPL_LLAMA4 | <|image_start|>tiles<|image|>overview<|image_end|> | Llama 4 |
MTMD_SLICE_TMPL_IDEFICS3 | Row/column templated tokens | SmolVLM / Idefics3 |
MTMD_SLICE_TMPL_LFM2 | <|img_row_R_col_C|> per tile | LFM2-VL |
Sources: tools/mtmd/mtmd.cpp82-90 tools/mtmd/mtmd.cpp221-332
Qwen2-VL and related models use Multimodal Rotary Position Embeddings (M-RoPE). When mtmd_decode_use_mrope(ctx) returns true, the helper layer must provide a 4-component position vector per token (temporal, height, width, unused).
decode_embd_batch in mtmd-helper.cpp manages two methods for building this layout:
set_position_mrope_2d(pos_0, nx, ny, seq_id) — for image chunks: height index in dim 1, width index in dim 2.set_position_mrope_1d(pos_0, seq_id) — for audio chunks: sequential index replicated across dims 0–2.Sources: tools/mtmd/mtmd-helper.cpp156-193
Audio input follows the same mtmd_bitmap API as images, but with is_audio = true and data containing float32 PCM samples. The pipeline differs in preprocessing:
mtmd_audio_preprocessor (abstract) converts raw PCM to mel spectrograms.
mtmd_audio_preprocessor_whisper — for Whisper-based models (Ultravox, Qwen2A, Voxtral, GlmA, MusicFlamingo).mtmd_audio_preprocessor_conformer — for LFM2-Audio.clip_image_f32 where nx = n_frames, ny = n_mel.clip_graph_whisper_enc::build() constructs the encoder graph: two 1D convolution layers followed by a Vision Transformer loop, then a projector MLP.The audio_has_avgpool() and audio_has_stack_frames() flags on clip_model control whether average pooling or frame stacking is applied after the transformer.
Sources: tools/mtmd/mtmd.cpp335-376 tools/mtmd/models/whisper-enc.cpp1-115 tools/mtmd/clip-model.h378-388
mtmd-helper.h)The helper layer bridges libmtmd and libllama to simplify the eval loop.
| Function | Description |
|---|---|
mtmd_helper_eval_chunks | Iterates chunks, calls mtmd_encode_chunk then llama_decode, returns new n_past |
mtmd_helper_eval_chunk_single | Evaluates one chunk (text or image/audio) |
mtmd_helper_decode_image_chunk | Builds decode_embd_batch, calls llama_decode in sub-batches |
mtmd_helper_get_n_tokens | Total token count across all chunks |
mtmd_helper_get_n_pos | Total position count (accounts for M-RoPE temporal positions) |
mtmd_helper_bitmap_init_from_file | Load image or audio file using stb_image / miniaudio |
mtmd_helper_bitmap_init_from_file is the primary entry point for file-based input. It uses stb_image.h for image decoding and miniaudio.h for audio decoding (both vendored in tools/mtmd/vendor/).
Sources: tools/mtmd/mtmd-helper.cpp99-115 tools/mtmd/mtmd-helper.cpp229-305 tools/mtmd/mtmd-helper.cpp307-380
llama-mtmd-cli Toolllama-mtmd-cli (tools/mtmd/mtmd-cli.cpp) is the reference CLI for multimodal inference.
Modes:
-p <prompt> + --image <path>): formats prompt, loads media, evaluates, generates./image <path> and /audio <path> load media; the next text message sends them together.The mtmd_cli_context struct manages all state: mtmd::context_ptr ctx_vision, llama_context, common_sampler, chat history, and a mtmd::bitmaps accumulator for media loaded before the next message.
Usage:
llama-mtmd-cli -hf ggml-org/gemma-3-4b-it-GGUF
llama-mtmd-cli -m model.gguf --mmproj mmproj.gguf --image photo.jpg -p "<__media__> describe this"
llama-mtmd-cli -hf ggml-org/ultravox-v0_5-llama-3_2-1b-GGUF --audio recording.wav -p "<__media__> transcribe"
By default the mmproj is GPU-offloaded. Pass --no-mmproj-offload to keep it on CPU.
Sources: tools/mtmd/mtmd-cli.cpp67-175 docs/multimodal.md1-40
The mmproj GGUF file carries all encoder configuration under the clip.* namespace. Key metadata fields:
| Key | Purpose |
|---|---|
clip.projector_type | Maps to projector_type enum |
clip.has_vision_encoder | Bool: file contains vision weights |
clip.has_audio_encoder | Bool: file contains audio weights |
clip.vision.image_size | Expected input resolution |
clip.vision.patch_size | Conv2D patch stride |
clip.vision.embedding_length | Hidden dimension n_embd |
clip.vision.block_count | Number of transformer layers |
clip.vision.projection_dim | Output embedding dimension (must equal text model's n_embd) |
clip.vision.image_grid_pinpoints | Candidate resolutions for tiled models |
clip.audio.num_mel_bins | Mel spectrogram height for audio models |
clip.audio.projector.stack_factor | Frame stacking factor (Ultravox) |
Sources: tools/mtmd/clip-impl.h20-65 tools/mtmd/clip.cpp959-1001
To add a new encoder architecture:
PROJECTOR_TYPE_* entry to the projector_type enum in clip-impl.h and its string mapping in PROJECTOR_TYPE_NAMES.clip_graph_* subclass in tools/mtmd/models/, inheriting from clip_graph, overriding build().clip_model (clip-model.h) and load them in clip_model_loader::load_tensors (clip.cpp).clip_image_build_graph (clip.cpp).mtmd_context::init_vision() or init_audio() in mtmd.cpp.tests.sh for integration testing.Sources: tools/mtmd/clip.cpp784-881 tools/mtmd/mtmd.cpp221-376 tools/mtmd/tests.sh46-96
Refresh this wiki
This wiki was recently refreshed. Please wait 2 days to refresh again.