Media Converters

Relevant source files

This page documents MarkItDown's media converters for processing image and audio files. The ImageConverter handles JPEG and PNG images by extracting metadata and generating descriptions, while audio transcription functionality supports converting speech to text.

For web-based media including YouTube videos, see Web Content Converters. For detailed information on LLM integration for image captioning, see LLM Integration for Image Captioning. For configuration of external tools like exiftool and SpeechRecognition, see External Tool Integration.

Sources: packages/markitdown/src/markitdown/converters/_image_converter.py

ImageConverter Overview

The ImageConverter class processes image files by extracting metadata and generating textual descriptions. It accepts JPEG and PNG formats and produces markdown output containing metadata fields and optional AI-generated descriptions.

Accepted Formats:

MIME Types	File Extensions	Processing Capabilities
`image/jpeg`	`.jpg`, `.jpeg`	Metadata extraction, LLM description
`image/png`	`.png`	Metadata extraction, LLM description

The converter operates in two stages: metadata extraction using exiftool (if available) and description generation using an LLM client (if configured).

Sources: packages/markitdown/src/markitdown/converters/_image_converter.py8-13 packages/markitdown/src/markitdown/converters/_image_converter.py16-37

ImageConverter Processing Pipeline

Diagram: ImageConverter processing flow showing metadata extraction and LLM description stages

Sources: packages/markitdown/src/markitdown/converters/_image_converter.py39-85 packages/markitdown/src/markitdown/converters/_image_converter.py87-138

ImageConverter Code Structure

Diagram: ImageConverter class structure mapping code entities to their roles

Sources: packages/markitdown/src/markitdown/converters/_image_converter.py16-138

Metadata Extraction

The ImageConverter extracts image metadata by calling exiftool_metadata() from the _exiftool module. The function requires exiftool to be installed and accessible via the exiftool_path parameter or the EXIFTOOL_PATH environment variable.

Extracted Metadata Fields:

ImageSize - Dimensions of the image
Title - Image title
Caption - Image caption
Description - Image description
Keywords - Associated keywords
Artist / Author - Creator information
DateTimeOriginal / CreateDate - Timestamp information
GPSPosition - Geolocation data

The converter iterates through these fields and includes any present values in the markdown output packages/markitdown/src/markitdown/converters/_image_converter.py48-66

Sources: packages/markitdown/src/markitdown/converters/_image_converter.py4 packages/markitdown/src/markitdown/converters/_image_converter.py48-66

LLM-Based Image Description

When an LLM client is configured, ImageConverter generates textual descriptions of images using the _get_llm_description() method. This method:

Converts image to data URI: Reads the file stream, encodes to base64, and constructs a data URI with appropriate MIME type packages/markitdown/src/markitdown/converters/_image_converter.py100-118
Prepares API request: Creates a message array with the user prompt and image URL in OpenAI's chat completion format packages/markitdown/src/markitdown/converters/_image_converter.py120-134
Calls LLM API: Invokes client.chat.completions.create() with the configured model packages/markitdown/src/markitdown/converters/_image_converter.py137-138

Default Prompt: If no custom prompt is provided via llm_prompt, the default is "Write a detailed caption for this image." packages/markitdown/src/markitdown/converters/_image_converter.py96-97

The description is appended to the markdown output under a "Description:" heading packages/markitdown/src/markitdown/converters/_image_converter.py80-81

Sources: packages/markitdown/src/markitdown/converters/_image_converter.py69-81 packages/markitdown/src/markitdown/converters/_image_converter.py87-138

Audio Transcription

The transcribe_audio() function provides audio-to-text conversion using the SpeechRecognition library with Google's speech recognition service. This function is typically called by an audio converter class.

Diagram: Audio transcription processing flow

Sources: packages/markitdown/src/markitdown/converters/_transcribe_audio.py23-49

Audio Format Support

The transcribe_audio() function handles multiple audio formats with different processing paths:

Native Formats (Direct Processing):

wav - Waveform Audio File Format
aiff - Audio Interchange File Format
flac - Free Lossless Audio Codec

Converted Formats (via pydub):

mp3 - MPEG Audio Layer III
mp4 - MPEG-4 audio container

For MP3 and MP4 files, the function uses pydub to convert the audio to WAV format before processing packages/markitdown/src/markitdown/converters/_transcribe_audio.py34-41 Native formats are passed directly to the SpeechRecognition library packages/markitdown/src/markitdown/converters/_transcribe_audio.py34-35

Unsupported formats raise a ValueError packages/markitdown/src/markitdown/converters/_transcribe_audio.py43

Sources: packages/markitdown/src/markitdown/converters/_transcribe_audio.py34-43

Speech Recognition Implementation

The transcription process uses the following SpeechRecognition API workflow:

Initialize recognizer: Creates sr.Recognizer() instance packages/markitdown/src/markitdown/converters/_transcribe_audio.py45
Load audio file: Opens audio source with sr.AudioFile() context manager packages/markitdown/src/markitdown/converters/_transcribe_audio.py46
Record audio data: Captures entire audio with recognizer.record(source) packages/markitdown/src/markitdown/converters/_transcribe_audio.py47
Recognize speech: Calls recognizer.recognize_google(audio) to transcribe using Google's service packages/markitdown/src/markitdown/converters/_transcribe_audio.py48
Handle empty results: Returns "[No speech detected]" if transcript is empty, otherwise returns the transcribed text packages/markitdown/src/markitdown/converters/_transcribe_audio.py49

Sources: packages/markitdown/src/markitdown/converters/_transcribe_audio.py45-49

Dependency Management

Both media converters implement deferred dependency checking to support optional installation:

Diagram: Dependency checking pattern for optional features

ImageConverter Dependencies:

exiftool (external binary): Optional, checked at runtime via exiftool_metadata()
LLM client: Optional, checked in convert() via kwarg presence

Audio Transcription Dependencies:

speech_recognition: Required, checked at function entry packages/markitdown/src/markitdown/converters/_transcribe_audio.py24-32
pydub: Required, checked at function entry packages/markitdown/src/markitdown/converters/_transcribe_audio.py24-32

Installation command for audio transcription: pip install markitdown[audio-transcription] or pip install markitdown[all]

Sources: packages/markitdown/src/markitdown/converters/_transcribe_audio.py8-32 packages/markitdown/src/markitdown/converters/_image_converter.py4

Integration with Azure Document Intelligence

Azure Document Intelligence can also process image files as an alternative to ImageConverter. The DocumentIntelligenceConverter supports these image formats:

Format	MIME Types	File Extensions
JPEG	`image/jpeg`	`.jpg`, `.jpeg`
PNG	`image/png`	`.png`
BMP	`image/bmp`	`.bmp`
TIFF	`image/tiff`	`.tiff`

When configured, the Document Intelligence converter uses OCR with features like:

DocumentAnalysisFeature.FORMULAS - Formula extraction
DocumentAnalysisFeature.OCR_HIGH_RESOLUTION - High-resolution OCR
DocumentAnalysisFeature.STYLE_FONT - Font style extraction

These features are applied to image types but not to office document types packages/markitdown/src/markitdown/converters/_doc_intel_converter.py207-235

The converter registry's priority system determines whether ImageConverter or DocumentIntelligenceConverter processes an image file based on registration order.

Sources: packages/markitdown/src/markitdown/converters/_doc_intel_converter.py55-69 packages/markitdown/src/markitdown/converters/_doc_intel_converter.py93-101 packages/markitdown/src/markitdown/converters/_doc_intel_converter.py207-235

Configuration Examples

ImageConverter with metadata extraction:

ImageConverter with LLM description:

Audio transcription (via converter using transcribe_audio):

Sources: packages/markitdown/src/markitdown/converters/_image_converter.py39-85 packages/markitdown/src/markitdown/converters/_transcribe_audio.py23

Media Converters

Relevant source files

Sources: packages/markitdown/src/markitdown/converters/_image_converter.py

ImageConverter Overview

Accepted Formats:

MIME Types	File Extensions	Processing Capabilities
`image/jpeg`	`.jpg`, `.jpeg`	Metadata extraction, LLM description
`image/png`	`.png`	Metadata extraction, LLM description

The converter operates in two stages: metadata extraction using exiftool (if available) and description generation using an LLM client (if configured).

Sources: packages/markitdown/src/markitdown/converters/_image_converter.py8-13 packages/markitdown/src/markitdown/converters/_image_converter.py16-37

ImageConverter Processing Pipeline

Diagram: ImageConverter processing flow showing metadata extraction and LLM description stages

Sources: packages/markitdown/src/markitdown/converters/_image_converter.py39-85 packages/markitdown/src/markitdown/converters/_image_converter.py87-138

ImageConverter Code Structure

Diagram: ImageConverter class structure mapping code entities to their roles

Sources: packages/markitdown/src/markitdown/converters/_image_converter.py16-138

Metadata Extraction

Extracted Metadata Fields:

ImageSize - Dimensions of the image
Title - Image title
Caption - Image caption
Description - Image description
Keywords - Associated keywords
Artist / Author - Creator information
DateTimeOriginal / CreateDate - Timestamp information
GPSPosition - Geolocation data

The converter iterates through these fields and includes any present values in the markdown output packages/markitdown/src/markitdown/converters/_image_converter.py48-66

Sources: packages/markitdown/src/markitdown/converters/_image_converter.py4 packages/markitdown/src/markitdown/converters/_image_converter.py48-66

LLM-Based Image Description

When an LLM client is configured, ImageConverter generates textual descriptions of images using the _get_llm_description() method. This method:

Converts image to data URI: Reads the file stream, encodes to base64, and constructs a data URI with appropriate MIME type packages/markitdown/src/markitdown/converters/_image_converter.py100-118
Prepares API request: Creates a message array with the user prompt and image URL in OpenAI's chat completion format packages/markitdown/src/markitdown/converters/_image_converter.py120-134
Calls LLM API: Invokes client.chat.completions.create() with the configured model packages/markitdown/src/markitdown/converters/_image_converter.py137-138

Default Prompt: If no custom prompt is provided via llm_prompt, the default is "Write a detailed caption for this image." packages/markitdown/src/markitdown/converters/_image_converter.py96-97

The description is appended to the markdown output under a "Description:" heading packages/markitdown/src/markitdown/converters/_image_converter.py80-81

Sources: packages/markitdown/src/markitdown/converters/_image_converter.py69-81 packages/markitdown/src/markitdown/converters/_image_converter.py87-138

Audio Transcription

Diagram: Audio transcription processing flow

Sources: packages/markitdown/src/markitdown/converters/_transcribe_audio.py23-49

Audio Format Support

The transcribe_audio() function handles multiple audio formats with different processing paths:

Native Formats (Direct Processing):

wav - Waveform Audio File Format
aiff - Audio Interchange File Format
flac - Free Lossless Audio Codec

Converted Formats (via pydub):

mp3 - MPEG Audio Layer III
mp4 - MPEG-4 audio container

Unsupported formats raise a ValueError packages/markitdown/src/markitdown/converters/_transcribe_audio.py43

Sources: packages/markitdown/src/markitdown/converters/_transcribe_audio.py34-43

Speech Recognition Implementation

The transcription process uses the following SpeechRecognition API workflow:

Initialize recognizer: Creates sr.Recognizer() instance packages/markitdown/src/markitdown/converters/_transcribe_audio.py45
Load audio file: Opens audio source with sr.AudioFile() context manager packages/markitdown/src/markitdown/converters/_transcribe_audio.py46
Record audio data: Captures entire audio with recognizer.record(source) packages/markitdown/src/markitdown/converters/_transcribe_audio.py47
Recognize speech: Calls recognizer.recognize_google(audio) to transcribe using Google's service packages/markitdown/src/markitdown/converters/_transcribe_audio.py48
Handle empty results: Returns "[No speech detected]" if transcript is empty, otherwise returns the transcribed text packages/markitdown/src/markitdown/converters/_transcribe_audio.py49

Sources: packages/markitdown/src/markitdown/converters/_transcribe_audio.py45-49

Dependency Management

Both media converters implement deferred dependency checking to support optional installation:

Diagram: Dependency checking pattern for optional features

ImageConverter Dependencies:

exiftool (external binary): Optional, checked at runtime via exiftool_metadata()
LLM client: Optional, checked in convert() via kwarg presence

Audio Transcription Dependencies:

speech_recognition: Required, checked at function entry packages/markitdown/src/markitdown/converters/_transcribe_audio.py24-32
pydub: Required, checked at function entry packages/markitdown/src/markitdown/converters/_transcribe_audio.py24-32

Installation command for audio transcription: pip install markitdown[audio-transcription] or pip install markitdown[all]

Sources: packages/markitdown/src/markitdown/converters/_transcribe_audio.py8-32 packages/markitdown/src/markitdown/converters/_image_converter.py4

Integration with Azure Document Intelligence

Azure Document Intelligence can also process image files as an alternative to ImageConverter. The DocumentIntelligenceConverter supports these image formats:

Format	MIME Types	File Extensions
JPEG	`image/jpeg`	`.jpg`, `.jpeg`
PNG	`image/png`	`.png`
BMP	`image/bmp`	`.bmp`
TIFF	`image/tiff`	`.tiff`

When configured, the Document Intelligence converter uses OCR with features like:

DocumentAnalysisFeature.FORMULAS - Formula extraction
DocumentAnalysisFeature.OCR_HIGH_RESOLUTION - High-resolution OCR
DocumentAnalysisFeature.STYLE_FONT - Font style extraction

These features are applied to image types but not to office document types packages/markitdown/src/markitdown/converters/_doc_intel_converter.py207-235

The converter registry's priority system determines whether ImageConverter or DocumentIntelligenceConverter processes an image file based on registration order.

Configuration Examples

ImageConverter with metadata extraction:

ImageConverter with LLM description:

Audio transcription (via converter using transcribe_audio):

Sources: packages/markitdown/src/markitdown/converters/_image_converter.py39-85 packages/markitdown/src/markitdown/converters/_transcribe_audio.py23

Media Converters

ImageConverter Overview

ImageConverter Processing Pipeline

ImageConverter Code Structure

Metadata Extraction

LLM-Based Image Description

Audio Transcription

Audio Format Support

Speech Recognition Implementation

Dependency Management

Integration with Azure Document Intelligence

Configuration Examples

On this page

Media Converters

ImageConverter Overview

ImageConverter Processing Pipeline

ImageConverter Code Structure

Metadata Extraction

LLM-Based Image Description

Audio Transcription

Audio Format Support

Speech Recognition Implementation

Dependency Management

Integration with Azure Document Intelligence

Configuration Examples

On this page