This page documents MarkItDown's media converters for processing image and audio files. The ImageConverter handles JPEG and PNG images by extracting metadata and generating descriptions, while audio transcription functionality supports converting speech to text.
For web-based media including YouTube videos, see Web Content Converters. For detailed information on LLM integration for image captioning, see LLM Integration for Image Captioning. For configuration of external tools like exiftool and SpeechRecognition, see External Tool Integration.
Sources: packages/markitdown/src/markitdown/converters/_image_converter.py
The ImageConverter class processes image files by extracting metadata and generating textual descriptions. It accepts JPEG and PNG formats and produces markdown output containing metadata fields and optional AI-generated descriptions.
Accepted Formats:
| MIME Types | File Extensions | Processing Capabilities |
|---|---|---|
image/jpeg | .jpg, .jpeg | Metadata extraction, LLM description |
image/png | .png | Metadata extraction, LLM description |
The converter operates in two stages: metadata extraction using exiftool (if available) and description generation using an LLM client (if configured).
Sources: packages/markitdown/src/markitdown/converters/_image_converter.py8-13 packages/markitdown/src/markitdown/converters/_image_converter.py16-37
Diagram: ImageConverter processing flow showing metadata extraction and LLM description stages
Sources: packages/markitdown/src/markitdown/converters/_image_converter.py39-85 packages/markitdown/src/markitdown/converters/_image_converter.py87-138
Diagram: ImageConverter class structure mapping code entities to their roles
Sources: packages/markitdown/src/markitdown/converters/_image_converter.py16-138
The ImageConverter extracts image metadata by calling exiftool_metadata() from the _exiftool module. The function requires exiftool to be installed and accessible via the exiftool_path parameter or the EXIFTOOL_PATH environment variable.
Extracted Metadata Fields:
ImageSize - Dimensions of the imageTitle - Image titleCaption - Image captionDescription - Image descriptionKeywords - Associated keywordsArtist / Author - Creator informationDateTimeOriginal / CreateDate - Timestamp informationGPSPosition - Geolocation dataThe converter iterates through these fields and includes any present values in the markdown output packages/markitdown/src/markitdown/converters/_image_converter.py48-66
Sources: packages/markitdown/src/markitdown/converters/_image_converter.py4 packages/markitdown/src/markitdown/converters/_image_converter.py48-66
When an LLM client is configured, ImageConverter generates textual descriptions of images using the _get_llm_description() method. This method:
Converts image to data URI: Reads the file stream, encodes to base64, and constructs a data URI with appropriate MIME type packages/markitdown/src/markitdown/converters/_image_converter.py100-118
Prepares API request: Creates a message array with the user prompt and image URL in OpenAI's chat completion format packages/markitdown/src/markitdown/converters/_image_converter.py120-134
Calls LLM API: Invokes client.chat.completions.create() with the configured model packages/markitdown/src/markitdown/converters/_image_converter.py137-138
Default Prompt: If no custom prompt is provided via llm_prompt, the default is "Write a detailed caption for this image." packages/markitdown/src/markitdown/converters/_image_converter.py96-97
The description is appended to the markdown output under a "Description:" heading packages/markitdown/src/markitdown/converters/_image_converter.py80-81
Sources: packages/markitdown/src/markitdown/converters/_image_converter.py69-81 packages/markitdown/src/markitdown/converters/_image_converter.py87-138
The transcribe_audio() function provides audio-to-text conversion using the SpeechRecognition library with Google's speech recognition service. This function is typically called by an audio converter class.
Diagram: Audio transcription processing flow
Sources: packages/markitdown/src/markitdown/converters/_transcribe_audio.py23-49
The transcribe_audio() function handles multiple audio formats with different processing paths:
Native Formats (Direct Processing):
wav - Waveform Audio File Formataiff - Audio Interchange File Formatflac - Free Lossless Audio CodecConverted Formats (via pydub):
mp3 - MPEG Audio Layer IIImp4 - MPEG-4 audio containerFor MP3 and MP4 files, the function uses pydub to convert the audio to WAV format before processing packages/markitdown/src/markitdown/converters/_transcribe_audio.py34-41 Native formats are passed directly to the SpeechRecognition library packages/markitdown/src/markitdown/converters/_transcribe_audio.py34-35
Unsupported formats raise a ValueError packages/markitdown/src/markitdown/converters/_transcribe_audio.py43
Sources: packages/markitdown/src/markitdown/converters/_transcribe_audio.py34-43
The transcription process uses the following SpeechRecognition API workflow:
Initialize recognizer: Creates sr.Recognizer() instance packages/markitdown/src/markitdown/converters/_transcribe_audio.py45
Load audio file: Opens audio source with sr.AudioFile() context manager packages/markitdown/src/markitdown/converters/_transcribe_audio.py46
Record audio data: Captures entire audio with recognizer.record(source) packages/markitdown/src/markitdown/converters/_transcribe_audio.py47
Recognize speech: Calls recognizer.recognize_google(audio) to transcribe using Google's service packages/markitdown/src/markitdown/converters/_transcribe_audio.py48
Handle empty results: Returns "[No speech detected]" if transcript is empty, otherwise returns the transcribed text packages/markitdown/src/markitdown/converters/_transcribe_audio.py49
Sources: packages/markitdown/src/markitdown/converters/_transcribe_audio.py45-49
Both media converters implement deferred dependency checking to support optional installation:
Diagram: Dependency checking pattern for optional features
ImageConverter Dependencies:
exiftool_metadata()convert() via kwarg presenceAudio Transcription Dependencies:
Installation command for audio transcription: pip install markitdown[audio-transcription] or pip install markitdown[all]
Sources: packages/markitdown/src/markitdown/converters/_transcribe_audio.py8-32 packages/markitdown/src/markitdown/converters/_image_converter.py4
Azure Document Intelligence can also process image files as an alternative to ImageConverter. The DocumentIntelligenceConverter supports these image formats:
| Format | MIME Types | File Extensions |
|---|---|---|
| JPEG | image/jpeg | .jpg, .jpeg |
| PNG | image/png | .png |
| BMP | image/bmp | .bmp |
| TIFF | image/tiff | .tiff |
When configured, the Document Intelligence converter uses OCR with features like:
DocumentAnalysisFeature.FORMULAS - Formula extractionDocumentAnalysisFeature.OCR_HIGH_RESOLUTION - High-resolution OCRDocumentAnalysisFeature.STYLE_FONT - Font style extractionThese features are applied to image types but not to office document types packages/markitdown/src/markitdown/converters/_doc_intel_converter.py207-235
The converter registry's priority system determines whether ImageConverter or DocumentIntelligenceConverter processes an image file based on registration order.
Sources: packages/markitdown/src/markitdown/converters/_doc_intel_converter.py55-69 packages/markitdown/src/markitdown/converters/_doc_intel_converter.py93-101 packages/markitdown/src/markitdown/converters/_doc_intel_converter.py207-235
ImageConverter with metadata extraction:
ImageConverter with LLM description:
Audio transcription (via converter using transcribe_audio):
Sources: packages/markitdown/src/markitdown/converters/_image_converter.py39-85 packages/markitdown/src/markitdown/converters/_transcribe_audio.py23
Refresh this wiki