LLM Integration for Image Captioning

Relevant source files

Purpose and Scope

This document describes the LLM (Large Language Model) integration system for generating image captions and descriptions in MarkItDown. The system enables multimodal LLMs to analyze images and produce descriptive text that is embedded in the Markdown output. This functionality is optional and activated by providing LLM client configuration to the MarkItDown class.

For information about external service integrations more broadly, see External Tool Integration. For details about Azure Document Intelligence integration, see Azure Document Intelligence Integration.

Configuration and Initialization

The LLM integration is configured at the MarkItDown class level and propagated to converters that support image captioning. Three parameters control the behavior:

Parameter	Type	Required	Default	Description
`llm_client`	OpenAI-compatible client	Yes	None	Client object with chat completions API
`llm_model`	str	Yes	None	Model identifier (e.g., "gpt-4o")
`llm_prompt`	str	No	"Write a detailed caption for this image."	Custom prompt for image description

Configuration Example:

README.md167-177

Parameter Propagation Flow

Sources: packages/markitdown/src/markitdown/_markitdown.py1-500 packages/markitdown/src/markitdown/converters/_image_converter.py39-85 packages/markitdown/src/markitdown/converters/_pptx_converter.py61-200

Supported Converters

Two converters currently integrate LLM captioning:

ImageConverter

The ImageConverter class processes standalone image files (JPEG, PNG). When LLM parameters are present, it calls _get_llm_description() to generate image captions.

Accepted formats:

MIME types: image/jpeg, image/png
Extensions: .jpg, .jpeg, .png

Output structure:

EXIF metadata (if exiftool available)
LLM-generated description under # Description: heading

packages/markitdown/src/markitdown/converters/_image_converter.py1-139

PptxConverter

The PptxConverter class processes PowerPoint presentations. For each embedded image shape, it optionally generates an LLM description that is combined with any existing alt text.

Image processing workflow:

Extract image blob from shape (shape.image.blob)
Create temporary StreamInfo with image metadata
Call llm_caption() function
Combine LLM description with existing alt text
Embed in Markdown as !<FileRef file-url="https://github.com/microsoft/markitdown/blob/2b6ec9f3/combined_description" undefined file-path="combined_description">Hii</FileRef>

packages/markitdown/src/markitdown/converters/_pptx_converter.py92-152

Implementation Architecture

Sources: packages/markitdown/src/markitdown/converters/_image_converter.py69-138 packages/markitdown/src/markitdown/converters/_pptx_converter.py98-152

Code Entity Mapping

Component	File	Lines	Key Entities
ImageConverter LLM method	`_image_converter.py`	87-138	`_get_llm_description()`
PptxConverter LLM integration	`_pptx_converter.py`	98-130	`get_shape_content()`, `llm_caption()`
Shared caption utility	`_llm_caption.py`	N/A	`llm_caption()`
Test validation	`test_module_misc.py`	415-483	`test_markitdown_llm_parameters()`, `test_markitdown_llm()`

Image Encoding and API Integration

Both converters follow a similar pattern for encoding images and calling the LLM API:

Data URI Construction

Sources: packages/markitdown/src/markitdown/converters/_image_converter.py99-118

OpenAI API Call Structure

The converters construct messages in the OpenAI chat completions format:

packages/markitdown/src/markitdown/converters/_image_converter.py121-138

ImageConverter Implementation Details

The _get_llm_description() method in ImageConverter performs the following steps:

Prompt defaulting: If no custom prompt provided, uses default: "Write a detailed caption for this image."
MIME type resolution: Attempts to get from stream_info.mimetype, falls back to mimetypes.guess_type(), defaults to "application/octet-stream"
Stream position management: Saves current position with file_stream.tell(), restores after reading with file_stream.seek(cur_pos) to allow multiple reads
Error handling: Returns None on exception during base64 encoding
API call: Synchronous call to client.chat.completions.create()

packages/markitdown/src/markitdown/converters/_image_converter.py87-138

PptxConverter Implementation Details

The PowerPoint converter extracts image metadata from the shape object:

Image Metadata Extraction:

Sources: packages/markitdown/src/markitdown/converters/_pptx_converter.py98-152

Key implementation details:

Creates temporary io.BytesIO stream from shape.image.blob line 116
Constructs StreamInfo with PPTX-specific metadata lines 106-114
Calls llm_caption() with prompt parameter lines 120-126
Falls back silently on exception (no LLM description added) lines 127-129
Extracts embedded alt text from XML attributes lines 132-136
Combines LLM description and alt text with newline separator line 139
Sanitizes combined text by removing special characters [\r\n\[\]] and collapsing whitespace lines 140-141

Testing and Validation

The test suite validates LLM integration through mock and live tests:

Mock Testing Pattern

The test_markitdown_llm_parameters() test uses unittest.mock.MagicMock to verify parameter propagation without requiring actual API calls:

Sources: packages/markitdown/tests/test_module_misc.py415-457

Test validation points:

Verifies client.chat.completions.create was called line 437
Extracts call_args to inspect message structure lines 438-441
Validates prompt was passed as first content element line 441
Tests both ImageConverter (image files) and PptxConverter (PPTX files) lines 434, 449

Live API Testing

The test_markitdown_llm() test requires OPENAI_API_KEY environment variable and validates actual LLM responses:

Test expectations:

Image file contains specific test string: "5bda1dd6" lines 84-86, 467-469
LLM description includes contextually relevant terms: ["red", "circle", "blue", "square"] lines 473-474
PPTX files receive LLM captions for embedded images lines 477-482

packages/markitdown/tests/test_module_misc.py459-483

Conditional Test Execution

Tests are conditionally skipped based on availability:

packages/markitdown/tests/test_module_misc.py28-462

Error Handling and Fallbacks

The LLM integration includes multiple fallback mechanisms:

Silent Failure Pattern

Both converters use try-except blocks that allow conversion to continue without LLM descriptions:

ImageConverter:

packages/markitdown/src/markitdown/converters/_image_converter.py111-113

PptxConverter:

packages/markitdown/src/markitdown/converters/_pptx_converter.py127-129

Conditional Activation

LLM captioning only activates when both required parameters are present:

packages/markitdown/src/markitdown/converters/_image_converter.py69-71 packages/markitdown/src/markitdown/converters/_pptx_converter.py102-104

This design ensures that:

Converters function normally without LLM configuration
Missing API keys or network failures don't break document conversion
Users can incrementally adopt LLM features without code changes

Sources: packages/markitdown/src/markitdown/converters/_image_converter.py69-85 packages/markitdown/src/markitdown/converters/_pptx_converter.py98-130

LLM Integration for Image Captioning

Relevant source files

Purpose and Scope

Configuration and Initialization

The LLM integration is configured at the MarkItDown class level and propagated to converters that support image captioning. Three parameters control the behavior:

Parameter	Type	Required	Default	Description
`llm_client`	OpenAI-compatible client	Yes	None	Client object with chat completions API
`llm_model`	str	Yes	None	Model identifier (e.g., "gpt-4o")
`llm_prompt`	str	No	"Write a detailed caption for this image."	Custom prompt for image description

Configuration Example:

README.md167-177

Parameter Propagation Flow

Supported Converters

Two converters currently integrate LLM captioning:

ImageConverter

The ImageConverter class processes standalone image files (JPEG, PNG). When LLM parameters are present, it calls _get_llm_description() to generate image captions.

Accepted formats:

MIME types: image/jpeg, image/png
Extensions: .jpg, .jpeg, .png

Output structure:

EXIF metadata (if exiftool available)
LLM-generated description under # Description: heading

packages/markitdown/src/markitdown/converters/_image_converter.py1-139

PptxConverter

The PptxConverter class processes PowerPoint presentations. For each embedded image shape, it optionally generates an LLM description that is combined with any existing alt text.

Image processing workflow:

Extract image blob from shape (shape.image.blob)
Create temporary StreamInfo with image metadata
Call llm_caption() function
Combine LLM description with existing alt text
Embed in Markdown as !<FileRef file-url="https://github.com/microsoft/markitdown/blob/2b6ec9f3/combined_description" undefined file-path="combined_description">Hii</FileRef>

packages/markitdown/src/markitdown/converters/_pptx_converter.py92-152

Implementation Architecture

Sources: packages/markitdown/src/markitdown/converters/_image_converter.py69-138 packages/markitdown/src/markitdown/converters/_pptx_converter.py98-152

Code Entity Mapping

Component	File	Lines	Key Entities
ImageConverter LLM method	`_image_converter.py`	87-138	`_get_llm_description()`
PptxConverter LLM integration	`_pptx_converter.py`	98-130	`get_shape_content()`, `llm_caption()`
Shared caption utility	`_llm_caption.py`	N/A	`llm_caption()`
Test validation	`test_module_misc.py`	415-483	`test_markitdown_llm_parameters()`, `test_markitdown_llm()`

Image Encoding and API Integration

Both converters follow a similar pattern for encoding images and calling the LLM API:

Data URI Construction

Sources: packages/markitdown/src/markitdown/converters/_image_converter.py99-118

OpenAI API Call Structure

The converters construct messages in the OpenAI chat completions format:

packages/markitdown/src/markitdown/converters/_image_converter.py121-138

ImageConverter Implementation Details

The _get_llm_description() method in ImageConverter performs the following steps:

Prompt defaulting: If no custom prompt provided, uses default: "Write a detailed caption for this image."
MIME type resolution: Attempts to get from stream_info.mimetype, falls back to mimetypes.guess_type(), defaults to "application/octet-stream"
Stream position management: Saves current position with file_stream.tell(), restores after reading with file_stream.seek(cur_pos) to allow multiple reads
Error handling: Returns None on exception during base64 encoding
API call: Synchronous call to client.chat.completions.create()

packages/markitdown/src/markitdown/converters/_image_converter.py87-138

PptxConverter Implementation Details

The PowerPoint converter extracts image metadata from the shape object:

Image Metadata Extraction:

Sources: packages/markitdown/src/markitdown/converters/_pptx_converter.py98-152

Key implementation details:

Creates temporary io.BytesIO stream from shape.image.blob line 116
Constructs StreamInfo with PPTX-specific metadata lines 106-114
Calls llm_caption() with prompt parameter lines 120-126
Falls back silently on exception (no LLM description added) lines 127-129
Extracts embedded alt text from XML attributes lines 132-136
Combines LLM description and alt text with newline separator line 139
Sanitizes combined text by removing special characters [\r\n\[\]] and collapsing whitespace lines 140-141

Testing and Validation

The test suite validates LLM integration through mock and live tests:

Mock Testing Pattern

The test_markitdown_llm_parameters() test uses unittest.mock.MagicMock to verify parameter propagation without requiring actual API calls:

Sources: packages/markitdown/tests/test_module_misc.py415-457

Test validation points:

Verifies client.chat.completions.create was called line 437
Extracts call_args to inspect message structure lines 438-441
Validates prompt was passed as first content element line 441
Tests both ImageConverter (image files) and PptxConverter (PPTX files) lines 434, 449

Live API Testing

The test_markitdown_llm() test requires OPENAI_API_KEY environment variable and validates actual LLM responses:

Test expectations:

Image file contains specific test string: "5bda1dd6" lines 84-86, 467-469
LLM description includes contextually relevant terms: ["red", "circle", "blue", "square"] lines 473-474
PPTX files receive LLM captions for embedded images lines 477-482

packages/markitdown/tests/test_module_misc.py459-483

Conditional Test Execution

Tests are conditionally skipped based on availability:

packages/markitdown/tests/test_module_misc.py28-462

Error Handling and Fallbacks

The LLM integration includes multiple fallback mechanisms:

Silent Failure Pattern

Both converters use try-except blocks that allow conversion to continue without LLM descriptions:

ImageConverter:

packages/markitdown/src/markitdown/converters/_image_converter.py111-113

PptxConverter:

packages/markitdown/src/markitdown/converters/_pptx_converter.py127-129

Conditional Activation

LLM captioning only activates when both required parameters are present:

packages/markitdown/src/markitdown/converters/_image_converter.py69-71 packages/markitdown/src/markitdown/converters/_pptx_converter.py102-104

This design ensures that:

Converters function normally without LLM configuration
Missing API keys or network failures don't break document conversion
Users can incrementally adopt LLM features without code changes

Sources: packages/markitdown/src/markitdown/converters/_image_converter.py69-85 packages/markitdown/src/markitdown/converters/_pptx_converter.py98-130

LLM Integration for Image Captioning

Purpose and Scope

Configuration and Initialization

Parameter Propagation Flow

Supported Converters

ImageConverter

PptxConverter

Implementation Architecture

Code Entity Mapping

Image Encoding and API Integration

Data URI Construction

OpenAI API Call Structure

ImageConverter Implementation Details

PptxConverter Implementation Details

Testing and Validation

Mock Testing Pattern

Live API Testing

Conditional Test Execution

Error Handling and Fallbacks

Silent Failure Pattern

Conditional Activation

On this page

LLM Integration for Image Captioning

Purpose and Scope

Configuration and Initialization

Parameter Propagation Flow

Supported Converters

ImageConverter

PptxConverter

Implementation Architecture

Code Entity Mapping

Image Encoding and API Integration

Data URI Construction

OpenAI API Call Structure

ImageConverter Implementation Details

PptxConverter Implementation Details

Testing and Validation

Mock Testing Pattern

Live API Testing

Conditional Test Execution

Error Handling and Fallbacks

Silent Failure Pattern

Conditional Activation

On this page