This page documents the specialized converters in MarkItDown that handle file formats outside the major document categories. These converters process email messages (Outlook MSG), compressed archives (ZIP), tabular data (CSV), and plain text files. Each converter implements the DocumentConverter interface and participates in the converter registry's priority-based selection system.
For converters handling office documents (DOCX, XLSX, PPTX), see Office Document Converters. For PDF processing, see PDF Converter. For web content (HTML, RSS, EPUB), see Web Content Converters. For image and audio processing, see Media Converters.
Sources: packages/markitdown/src/markitdown/converters/__init__.py1-49
The specialized converters handle four distinct file format categories:
| Converter | File Types | Primary Library | MIME Types | Purpose |
|---|---|---|---|---|
OutlookMsgConverter | .msg | olefile | application/vnd.ms-outlook | Extract email metadata and body content |
ZipConverter | .zip | zipfile (built-in) | application/zip | Recursively convert archive contents |
CsvConverter | .csv | csv (built-in) | text/csv, application/csv | Convert tabular data to Markdown tables |
PlainTextConverter | .txt, .json, etc. | charset_normalizer | text/* | Handle text files with charset detection |
The following diagram maps the specialized converters to their accepted input formats and conversion strategies:
Sources: packages/markitdown/src/markitdown/converters/__init__.py1-49 packages/markitdown/src/markitdown/converters/_outlook_msg_converter.py1-150 packages/markitdown/src/markitdown/converters/_csv_converter.py1-78
The OutlookMsgConverter converts Outlook .msg email files to Markdown by extracting metadata (From, To, Subject) and body content. It uses the olefile package to parse the Microsoft OLE file structure underlying .msg files.
Class Definition: packages/markitdown/src/markitdown/converters/_outlook_msg_converter.py24-150
The converter implements multi-layered detection to identify Outlook MSG files:
Implementation: packages/markitdown/src/markitdown/converters/_outlook_msg_converter.py32-71
The acceptance logic checks in order:
.msg (line 42-43)application/vnd.ms-outlook (lines 45-47)olefile.isOleFile() (lines 52-53)__properties_version1.0 and __recip_version1.0 streams (lines 59-65)This exhaustive approach ensures reliable identification even when metadata is incomplete or incorrect.
Sources: packages/markitdown/src/markitdown/converters/_outlook_msg_converter.py32-71
The converter extracts metadata from specific OLE streams within the MSG file:
| Header Field | OLE Stream Path | Encoding |
|---|---|---|
| From | __substg1.0_0C1F001F | UTF-16-LE (fallback UTF-8) |
| To | __substg1.0_0E04001F | UTF-16-LE (fallback UTF-8) |
| Subject | __substg1.0_0037001F | UTF-16-LE (fallback UTF-8) |
| Body | __substg1.0_1000001F | UTF-16-LE (fallback UTF-8) |
Extraction Helper: packages/markitdown/src/markitdown/converters/_outlook_msg_converter.py127-149
The _get_stream_data() helper method handles:
errors="ignore" (lines 145-146)The converter produces structured Markdown output:
Generation Logic: packages/markitdown/src/markitdown/converters/_outlook_msg_converter.py98-125
The DocumentConverterResult includes the email subject as the title field (line 124), enabling downstream processing to identify the document's purpose.
Test Vectors: packages/markitdown/tests/_test_vectors.py75-89
Sources: packages/markitdown/src/markitdown/converters/_outlook_msg_converter.py24-150 packages/markitdown/tests/_test_vectors.py75-89
The ZipConverter recursively processes ZIP archives by extracting and converting each contained file using the appropriate converter from the registry. This enables processing of document collections packaged in ZIP format.
Class Definition: Implementation not provided in source files, but exported from packages/markitdown/src/markitdown/converters/__init__.py19
The ZIP converter operates as a meta-converter that delegates to other converters:
This architecture allows the ZIP converter to handle heterogeneous archives containing different document types (DOCX, PDF, HTML, etc.) without implementing format-specific logic.
Test Coverage: packages/markitdown/tests/_test_vectors.py195-218
The test vector test_files.zip verifies that the converter extracts and processes multiple document types, producing combined output containing unique identifiers from DOCX, PPTX, XLSX, and HTML files.
Sources: packages/markitdown/src/markitdown/converters/__init__.py19 packages/markitdown/tests/_test_vectors.py195-218
The CsvConverter transforms CSV files into Markdown tables, handling multi-byte character encodings and malformed row structures.
Class Definition: packages/markitdown/src/markitdown/converters/_csv_converter.py15-78
The converter accepts CSV files based on extension or MIME type:
Accepted Extensions: .csv packages/markitdown/src/markitdown/converters/_csv_converter.py12
Accepted MIME Types: text/csv, application/csv packages/markitdown/src/markitdown/converters/_csv_converter.py8-11
Detection Logic: packages/markitdown/src/markitdown/converters/_csv_converter.py23-36
The converter implements robust charset detection for international content:
Implementation: packages/markitdown/src/markitdown/converters/_csv_converter.py45-48
When StreamInfo includes charset metadata (e.g., cp932 for Japanese Shift-JIS), the converter uses it directly. Otherwise, charset_normalizer.from_bytes() performs ML-based detection to identify the encoding.
Test Case: packages/markitdown/tests/_test_vectors.py141-154
The test vector test_mskanji.csv with charset="cp932" validates handling of Japanese characters:
The conversion process follows this structure:
Generation Logic: packages/markitdown/src/markitdown/converters/_csv_converter.py50-77
The converter ensures consistent table structure by:
--- for each column (line 64)Example output:
Sources: packages/markitdown/src/markitdown/converters/_csv_converter.py1-78 packages/markitdown/tests/_test_vectors.py141-154
The PlainTextConverter serves as a fallback converter for text-based files that don't match more specific converters. It handles .txt, .json, and other text formats by performing charset detection and decoding.
Class Definition: Implementation not provided in source files, but exported from packages/markitdown/src/markitdown/converters/__init__.py5
The PlainTextConverter typically has low priority in the converter registry to ensure it only processes files after all specialized converters have declined:
This priority scheme prevents the plain text converter from prematurely accepting files that should be handled by format-specific converters (e.g., HTML, RSS, JSON notebooks).
Based on test vectors, the plain text converter handles:
| File Type | Extension | MIME Type | Test Vector |
|---|---|---|---|
| Plain Text | .txt | text/plain | N/A |
| JSON | .json | application/json | packages/markitdown/tests/_test_vectors.py155-165 |
| Generic Text | Various | text/* | N/A |
The JSON test vector verifies that JSON files are passed through with UUIDs preserved (lines 161-162), demonstrating that the converter does not parse structured formats—it treats them as plain text.
Sources: packages/markitdown/src/markitdown/converters/__init__.py5 packages/markitdown/tests/_test_vectors.py155-165
The specialized converters demonstrate different approaches to dependency handling:
OutlookMsgConverter: packages/markitdown/src/markitdown/converters/_outlook_msg_converter.py7-15
The converter defers dependency checking until convert() is called (lines 79-91), raising MissingDependencyException with installation guidance:
Missing dependency for OutlookMsgConverter:
Please install with: pip install markitdown[outlook]
CsvConverter: packages/markitdown/src/markitdown/converters/_csv_converter.py1-2
The CSV converter relies only on Python's built-in csv module, requiring no optional dependencies. The charset_normalizer library (line 4) is a core MarkItDown dependency.
| Converter | Feature Group | Package |
|---|---|---|
OutlookMsgConverter | [outlook] | olefile>=0.47 |
ZipConverter | (built-in) | zipfile (stdlib) |
CsvConverter | (built-in) | csv (stdlib) |
PlainTextConverter | (built-in) | None |
Installation command: pip install markitdown[outlook] for Outlook MSG support.
Sources: packages/markitdown/src/markitdown/converters/_outlook_msg_converter.py7-15 packages/markitdown/src/markitdown/converters/_csv_converter.py1-4
The specialized converters integrate with the core MarkItDown system through the converter registry:
Each converter:
accepts() to determine if it can handle the file based on StreamInfo metadataconvert() to transform the file content into MarkdownDocumentConverterResult with markdown content and optional titleThe converter registry iterates through converters by priority, calling accepts() until one returns True, then invoking its convert() method. For details on this selection process, see DocumentConverter Interface.
Sources: packages/markitdown/src/markitdown/converters/__init__.py1-49
The specialized converters are included in the Docker image build:
Dockerfile Configuration: Dockerfile1-34
The [all] feature group (line 24) includes all optional dependencies, ensuring OutlookMsgConverter has access to olefile. The other specialized converters require no additional dependencies beyond the Python standard library.
Sources: Dockerfile1-34
Refresh this wiki