This document describes the LLM (Large Language Model) integration system for generating image captions and descriptions in MarkItDown. The system enables multimodal LLMs to analyze images and produce descriptive text that is embedded in the Markdown output. This functionality is optional and activated by providing LLM client configuration to the MarkItDown class.
For information about external service integrations more broadly, see External Tool Integration. For details about Azure Document Intelligence integration, see Azure Document Intelligence Integration.
The LLM integration is configured at the MarkItDown class level and propagated to converters that support image captioning. Three parameters control the behavior:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
llm_client | OpenAI-compatible client | Yes | None | Client object with chat completions API |
llm_model | str | Yes | None | Model identifier (e.g., "gpt-4o") |
llm_prompt | str | No | "Write a detailed caption for this image." | Custom prompt for image description |
Configuration Example:
Sources: packages/markitdown/src/markitdown/_markitdown.py1-500 packages/markitdown/src/markitdown/converters/_image_converter.py39-85 packages/markitdown/src/markitdown/converters/_pptx_converter.py61-200
Two converters currently integrate LLM captioning:
The ImageConverter class processes standalone image files (JPEG, PNG). When LLM parameters are present, it calls _get_llm_description() to generate image captions.
Accepted formats:
image/jpeg, image/png.jpg, .jpeg, .pngOutput structure:
exiftool available)# Description: headingpackages/markitdown/src/markitdown/converters/_image_converter.py1-139
The PptxConverter class processes PowerPoint presentations. For each embedded image shape, it optionally generates an LLM description that is combined with any existing alt text.
Image processing workflow:
shape.image.blob)StreamInfo with image metadatallm_caption() function!<FileRef file-url="https://github.com/microsoft/markitdown/blob/2b6ec9f3/combined_description" undefined file-path="combined_description">Hii</FileRef>packages/markitdown/src/markitdown/converters/_pptx_converter.py92-152
Sources: packages/markitdown/src/markitdown/converters/_image_converter.py69-138 packages/markitdown/src/markitdown/converters/_pptx_converter.py98-152
| Component | File | Lines | Key Entities |
|---|---|---|---|
| ImageConverter LLM method | _image_converter.py | 87-138 | _get_llm_description() |
| PptxConverter LLM integration | _pptx_converter.py | 98-130 | get_shape_content(), llm_caption() |
| Shared caption utility | _llm_caption.py | N/A | llm_caption() |
| Test validation | test_module_misc.py | 415-483 | test_markitdown_llm_parameters(), test_markitdown_llm() |
Both converters follow a similar pattern for encoding images and calling the LLM API:
Sources: packages/markitdown/src/markitdown/converters/_image_converter.py99-118
The converters construct messages in the OpenAI chat completions format:
packages/markitdown/src/markitdown/converters/_image_converter.py121-138
The _get_llm_description() method in ImageConverter performs the following steps:
"Write a detailed caption for this image."stream_info.mimetype, falls back to mimetypes.guess_type(), defaults to "application/octet-stream"file_stream.tell(), restores after reading with file_stream.seek(cur_pos) to allow multiple readsNone on exception during base64 encodingclient.chat.completions.create()packages/markitdown/src/markitdown/converters/_image_converter.py87-138
The PowerPoint converter extracts image metadata from the shape object:
Image Metadata Extraction:
Sources: packages/markitdown/src/markitdown/converters/_pptx_converter.py98-152
Key implementation details:
io.BytesIO stream from shape.image.blob line 116StreamInfo with PPTX-specific metadata lines 106-114llm_caption() with prompt parameter lines 120-126[\r\n\[\]] and collapsing whitespace lines 140-141The test suite validates LLM integration through mock and live tests:
The test_markitdown_llm_parameters() test uses unittest.mock.MagicMock to verify parameter propagation without requiring actual API calls:
Sources: packages/markitdown/tests/test_module_misc.py415-457
Test validation points:
client.chat.completions.create was called line 437call_args to inspect message structure lines 438-441ImageConverter (image files) and PptxConverter (PPTX files) lines 434, 449The test_markitdown_llm() test requires OPENAI_API_KEY environment variable and validates actual LLM responses:
Test expectations:
"5bda1dd6" lines 84-86, 467-469["red", "circle", "blue", "square"] lines 473-474packages/markitdown/tests/test_module_misc.py459-483
Tests are conditionally skipped based on availability:
packages/markitdown/tests/test_module_misc.py28-462
The LLM integration includes multiple fallback mechanisms:
Both converters use try-except blocks that allow conversion to continue without LLM descriptions:
ImageConverter:
packages/markitdown/src/markitdown/converters/_image_converter.py111-113
PptxConverter:
packages/markitdown/src/markitdown/converters/_pptx_converter.py127-129
LLM captioning only activates when both required parameters are present:
packages/markitdown/src/markitdown/converters/_image_converter.py69-71 packages/markitdown/src/markitdown/converters/_pptx_converter.py102-104
This design ensures that:
Sources: packages/markitdown/src/markitdown/converters/_image_converter.py69-85 packages/markitdown/src/markitdown/converters/_pptx_converter.py98-130
Refresh this wiki