Specialized Converters

Relevant source files

Purpose and Scope

This page documents the specialized converters in MarkItDown that handle file formats outside the major document categories. These converters process email messages (Outlook MSG), compressed archives (ZIP), tabular data (CSV), and plain text files. Each converter implements the DocumentConverter interface and participates in the converter registry's priority-based selection system.

For converters handling office documents (DOCX, XLSX, PPTX), see Office Document Converters. For PDF processing, see PDF Converter. For web content (HTML, RSS, EPUB), see Web Content Converters. For image and audio processing, see Media Converters.

Sources: packages/markitdown/src/markitdown/converters/__init__.py1-49

Specialized Converter Overview

The specialized converters handle four distinct file format categories:

Converter	File Types	Primary Library	MIME Types	Purpose
`OutlookMsgConverter`	`.msg`	`olefile`	`application/vnd.ms-outlook`	Extract email metadata and body content
`ZipConverter`	`.zip`	`zipfile` (built-in)	`application/zip`	Recursively convert archive contents
`CsvConverter`	`.csv`	`csv` (built-in)	`text/csv`, `application/csv`	Convert tabular data to Markdown tables
`PlainTextConverter`	`.txt`, `.json`, etc.	charset_normalizer	`text/*`	Handle text files with charset detection

The following diagram maps the specialized converters to their accepted input formats and conversion strategies:

Sources: packages/markitdown/src/markitdown/converters/__init__.py1-49 packages/markitdown/src/markitdown/converters/_outlook_msg_converter.py1-150 packages/markitdown/src/markitdown/converters/_csv_converter.py1-78

OutlookMsgConverter

Overview

The OutlookMsgConverter converts Outlook .msg email files to Markdown by extracting metadata (From, To, Subject) and body content. It uses the olefile package to parse the Microsoft OLE file structure underlying .msg files.

Class Definition: packages/markitdown/src/markitdown/converters/_outlook_msg_converter.py24-150

File Type Detection

The converter implements multi-layered detection to identify Outlook MSG files:

Implementation: packages/markitdown/src/markitdown/converters/_outlook_msg_converter.py32-71

The acceptance logic checks in order:

File extension: .msg (line 42-43)
MIME type: Starts with application/vnd.ms-outlook (lines 45-47)
OLE file structure: Validates using olefile.isOleFile() (lines 52-53)
Outlook-specific streams: Verifies presence of __properties_version1.0 and __recip_version1.0 streams (lines 59-65)

This exhaustive approach ensures reliable identification even when metadata is incomplete or incorrect.

Sources: packages/markitdown/src/markitdown/converters/_outlook_msg_converter.py32-71

Email Metadata Extraction

The converter extracts metadata from specific OLE streams within the MSG file:

Header Field	OLE Stream Path	Encoding
From	`__substg1.0_0C1F001F`	UTF-16-LE (fallback UTF-8)
To	`__substg1.0_0E04001F`	UTF-16-LE (fallback UTF-8)
Subject	`__substg1.0_0037001F`	UTF-16-LE (fallback UTF-8)
Body	`__substg1.0_1000001F`	UTF-16-LE (fallback UTF-8)

Extraction Helper: packages/markitdown/src/markitdown/converters/_outlook_msg_converter.py127-149

The _get_stream_data() helper method handles:

Stream existence validation (line 135)
UTF-16-LE decoding (primary, lines 138-139)
UTF-8 fallback (lines 141-143)
Error recovery with errors="ignore" (lines 145-146)

Markdown Output Structure

The converter produces structured Markdown output:

Generation Logic: packages/markitdown/src/markitdown/converters/_outlook_msg_converter.py98-125

The DocumentConverterResult includes the email subject as the title field (line 124), enabling downstream processing to identify the document's purpose.

Test Vectors: packages/markitdown/tests/_test_vectors.py75-89

Sources: packages/markitdown/src/markitdown/converters/_outlook_msg_converter.py24-150 packages/markitdown/tests/_test_vectors.py75-89

ZipConverter

Overview

The ZipConverter recursively processes ZIP archives by extracting and converting each contained file using the appropriate converter from the registry. This enables processing of document collections packaged in ZIP format.

Class Definition: Implementation not provided in source files, but exported from packages/markitdown/src/markitdown/converters/__init__.py19

Recursive Conversion Strategy

The ZIP converter operates as a meta-converter that delegates to other converters:

This architecture allows the ZIP converter to handle heterogeneous archives containing different document types (DOCX, PDF, HTML, etc.) without implementing format-specific logic.

Test Coverage: packages/markitdown/tests/_test_vectors.py195-218

The test vector test_files.zip verifies that the converter extracts and processes multiple document types, producing combined output containing unique identifiers from DOCX, PPTX, XLSX, and HTML files.

Sources: packages/markitdown/src/markitdown/converters/__init__.py19 packages/markitdown/tests/_test_vectors.py195-218

CsvConverter

Overview

The CsvConverter transforms CSV files into Markdown tables, handling multi-byte character encodings and malformed row structures.

Class Definition: packages/markitdown/src/markitdown/converters/_csv_converter.py15-78

File Type Detection

The converter accepts CSV files based on extension or MIME type:

Accepted Extensions: .csv packages/markitdown/src/markitdown/converters/_csv_converter.py12

Accepted MIME Types: text/csv, application/csv packages/markitdown/src/markitdown/converters/_csv_converter.py8-11

Detection Logic: packages/markitdown/src/markitdown/converters/_csv_converter.py23-36

Character Encoding Handling

The converter implements robust charset detection for international content:

Implementation: packages/markitdown/src/markitdown/converters/_csv_converter.py45-48

When StreamInfo includes charset metadata (e.g., cp932 for Japanese Shift-JIS), the converter uses it directly. Otherwise, charset_normalizer.from_bytes() performs ML-based detection to identify the encoding.

Test Case: packages/markitdown/tests/_test_vectors.py141-154

The test vector test_mskanji.csv with charset="cp932" validates handling of Japanese characters:

Input: 名前 (name), 年齢 (age), 住所 (address)
Output: Properly formatted Markdown table with preserved characters

Markdown Table Generation

The conversion process follows this structure:

Generation Logic: packages/markitdown/src/markitdown/converters/_csv_converter.py50-77

The converter ensures consistent table structure by:

Using the first row as headers (line 61)
Creating separator rows with --- for each column (line 64)
Padding short rows with empty strings (lines 69-70)
Truncating long rows to match header count (lines 72-73)

Example output:

Sources: packages/markitdown/src/markitdown/converters/_csv_converter.py1-78 packages/markitdown/tests/_test_vectors.py141-154

PlainTextConverter

Overview

The PlainTextConverter serves as a fallback converter for text-based files that don't match more specific converters. It handles .txt, .json, and other text formats by performing charset detection and decoding.

Class Definition: Implementation not provided in source files, but exported from packages/markitdown/src/markitdown/converters/__init__.py5

Converter Priority

The PlainTextConverter typically has low priority in the converter registry to ensure it only processes files after all specialized converters have declined:

This priority scheme prevents the plain text converter from prematurely accepting files that should be handled by format-specific converters (e.g., HTML, RSS, JSON notebooks).

Supported Text Formats

Based on test vectors, the plain text converter handles:

File Type	Extension	MIME Type	Test Vector
Plain Text	`.txt`	`text/plain`	N/A
JSON	`.json`	`application/json`	packages/markitdown/tests/_test_vectors.py155-165
Generic Text	Various	`text/*`	N/A

The JSON test vector verifies that JSON files are passed through with UUIDs preserved (lines 161-162), demonstrating that the converter does not parse structured formats—it treats them as plain text.

Sources: packages/markitdown/src/markitdown/converters/__init__.py5 packages/markitdown/tests/_test_vectors.py155-165

Dependency Management

The specialized converters demonstrate different approaches to dependency handling:

Optional Dependencies with Graceful Degradation

OutlookMsgConverter: packages/markitdown/src/markitdown/converters/_outlook_msg_converter.py7-15

The converter defers dependency checking until convert() is called (lines 79-91), raising MissingDependencyException with installation guidance:

Missing dependency for OutlookMsgConverter: 
Please install with: pip install markitdown[outlook]

Built-in Library Dependencies

CsvConverter: packages/markitdown/src/markitdown/converters/_csv_converter.py1-2

The CSV converter relies only on Python's built-in csv module, requiring no optional dependencies. The charset_normalizer library (line 4) is a core MarkItDown dependency.

Feature Group Mapping

Converter	Feature Group	Package
`OutlookMsgConverter`	`[outlook]`	`olefile>=0.47`
`ZipConverter`	(built-in)	`zipfile` (stdlib)
`CsvConverter`	(built-in)	`csv` (stdlib)
`PlainTextConverter`	(built-in)	None

Installation command: pip install markitdown[outlook] for Outlook MSG support.

Sources: packages/markitdown/src/markitdown/converters/_outlook_msg_converter.py7-15 packages/markitdown/src/markitdown/converters/_csv_converter.py1-4

Integration with Core System

The specialized converters integrate with the core MarkItDown system through the converter registry:

Each converter:

Implements accepts() to determine if it can handle the file based on StreamInfo metadata
Implements convert() to transform the file content into Markdown
Returns DocumentConverterResult with markdown content and optional title

The converter registry iterates through converters by priority, calling accepts() until one returns True, then invoking its convert() method. For details on this selection process, see DocumentConverter Interface.

Sources: packages/markitdown/src/markitdown/converters/__init__.py1-49

Docker Environment Support

The specialized converters are included in the Docker image build:

Dockerfile Configuration: Dockerfile1-34

The [all] feature group (line 24) includes all optional dependencies, ensuring OutlookMsgConverter has access to olefile. The other specialized converters require no additional dependencies beyond the Python standard library.

Sources: Dockerfile1-34

Specialized Converters

Relevant source files

Purpose and Scope

Sources: packages/markitdown/src/markitdown/converters/__init__.py1-49

Specialized Converter Overview

The specialized converters handle four distinct file format categories:

Converter	File Types	Primary Library	MIME Types	Purpose
`OutlookMsgConverter`	`.msg`	`olefile`	`application/vnd.ms-outlook`	Extract email metadata and body content
`ZipConverter`	`.zip`	`zipfile` (built-in)	`application/zip`	Recursively convert archive contents
`CsvConverter`	`.csv`	`csv` (built-in)	`text/csv`, `application/csv`	Convert tabular data to Markdown tables
`PlainTextConverter`	`.txt`, `.json`, etc.	charset_normalizer	`text/*`	Handle text files with charset detection

The following diagram maps the specialized converters to their accepted input formats and conversion strategies:

OutlookMsgConverter

Overview

Class Definition: packages/markitdown/src/markitdown/converters/_outlook_msg_converter.py24-150

File Type Detection

The converter implements multi-layered detection to identify Outlook MSG files:

Implementation: packages/markitdown/src/markitdown/converters/_outlook_msg_converter.py32-71

The acceptance logic checks in order:

File extension: .msg (line 42-43)
MIME type: Starts with application/vnd.ms-outlook (lines 45-47)
OLE file structure: Validates using olefile.isOleFile() (lines 52-53)
Outlook-specific streams: Verifies presence of __properties_version1.0 and __recip_version1.0 streams (lines 59-65)

This exhaustive approach ensures reliable identification even when metadata is incomplete or incorrect.

Sources: packages/markitdown/src/markitdown/converters/_outlook_msg_converter.py32-71

Email Metadata Extraction

The converter extracts metadata from specific OLE streams within the MSG file:

Header Field	OLE Stream Path	Encoding
From	`__substg1.0_0C1F001F`	UTF-16-LE (fallback UTF-8)
To	`__substg1.0_0E04001F`	UTF-16-LE (fallback UTF-8)
Subject	`__substg1.0_0037001F`	UTF-16-LE (fallback UTF-8)
Body	`__substg1.0_1000001F`	UTF-16-LE (fallback UTF-8)

Extraction Helper: packages/markitdown/src/markitdown/converters/_outlook_msg_converter.py127-149

The _get_stream_data() helper method handles:

Stream existence validation (line 135)
UTF-16-LE decoding (primary, lines 138-139)
UTF-8 fallback (lines 141-143)
Error recovery with errors="ignore" (lines 145-146)

Markdown Output Structure

The converter produces structured Markdown output:

Generation Logic: packages/markitdown/src/markitdown/converters/_outlook_msg_converter.py98-125

The DocumentConverterResult includes the email subject as the title field (line 124), enabling downstream processing to identify the document's purpose.

Test Vectors: packages/markitdown/tests/_test_vectors.py75-89

Sources: packages/markitdown/src/markitdown/converters/_outlook_msg_converter.py24-150 packages/markitdown/tests/_test_vectors.py75-89

ZipConverter

Overview

Class Definition: Implementation not provided in source files, but exported from packages/markitdown/src/markitdown/converters/__init__.py19

Recursive Conversion Strategy

The ZIP converter operates as a meta-converter that delegates to other converters:

This architecture allows the ZIP converter to handle heterogeneous archives containing different document types (DOCX, PDF, HTML, etc.) without implementing format-specific logic.

Test Coverage: packages/markitdown/tests/_test_vectors.py195-218

Sources: packages/markitdown/src/markitdown/converters/__init__.py19 packages/markitdown/tests/_test_vectors.py195-218

CsvConverter

Overview

The CsvConverter transforms CSV files into Markdown tables, handling multi-byte character encodings and malformed row structures.

Class Definition: packages/markitdown/src/markitdown/converters/_csv_converter.py15-78

File Type Detection

The converter accepts CSV files based on extension or MIME type:

Accepted Extensions: .csv packages/markitdown/src/markitdown/converters/_csv_converter.py12

Accepted MIME Types: text/csv, application/csv packages/markitdown/src/markitdown/converters/_csv_converter.py8-11

Detection Logic: packages/markitdown/src/markitdown/converters/_csv_converter.py23-36

Character Encoding Handling

The converter implements robust charset detection for international content:

Implementation: packages/markitdown/src/markitdown/converters/_csv_converter.py45-48

Test Case: packages/markitdown/tests/_test_vectors.py141-154

The test vector test_mskanji.csv with charset="cp932" validates handling of Japanese characters:

Input: 名前 (name), 年齢 (age), 住所 (address)
Output: Properly formatted Markdown table with preserved characters

Markdown Table Generation

The conversion process follows this structure:

Generation Logic: packages/markitdown/src/markitdown/converters/_csv_converter.py50-77

The converter ensures consistent table structure by:

Using the first row as headers (line 61)
Creating separator rows with --- for each column (line 64)
Padding short rows with empty strings (lines 69-70)
Truncating long rows to match header count (lines 72-73)

Example output:

Sources: packages/markitdown/src/markitdown/converters/_csv_converter.py1-78 packages/markitdown/tests/_test_vectors.py141-154

PlainTextConverter

Overview

Class Definition: Implementation not provided in source files, but exported from packages/markitdown/src/markitdown/converters/__init__.py5

Converter Priority

The PlainTextConverter typically has low priority in the converter registry to ensure it only processes files after all specialized converters have declined:

This priority scheme prevents the plain text converter from prematurely accepting files that should be handled by format-specific converters (e.g., HTML, RSS, JSON notebooks).

Supported Text Formats

Based on test vectors, the plain text converter handles:

File Type	Extension	MIME Type	Test Vector
Plain Text	`.txt`	`text/plain`	N/A
JSON	`.json`	`application/json`	packages/markitdown/tests/_test_vectors.py155-165
Generic Text	Various	`text/*`	N/A

Sources: packages/markitdown/src/markitdown/converters/__init__.py5 packages/markitdown/tests/_test_vectors.py155-165

Dependency Management

The specialized converters demonstrate different approaches to dependency handling:

Optional Dependencies with Graceful Degradation

OutlookMsgConverter: packages/markitdown/src/markitdown/converters/_outlook_msg_converter.py7-15

The converter defers dependency checking until convert() is called (lines 79-91), raising MissingDependencyException with installation guidance:

Missing dependency for OutlookMsgConverter: 
Please install with: pip install markitdown[outlook]

Built-in Library Dependencies

CsvConverter: packages/markitdown/src/markitdown/converters/_csv_converter.py1-2

The CSV converter relies only on Python's built-in csv module, requiring no optional dependencies. The charset_normalizer library (line 4) is a core MarkItDown dependency.

Feature Group Mapping

Converter	Feature Group	Package
`OutlookMsgConverter`	`[outlook]`	`olefile>=0.47`
`ZipConverter`	(built-in)	`zipfile` (stdlib)
`CsvConverter`	(built-in)	`csv` (stdlib)
`PlainTextConverter`	(built-in)	None

Installation command: pip install markitdown[outlook] for Outlook MSG support.

Sources: packages/markitdown/src/markitdown/converters/_outlook_msg_converter.py7-15 packages/markitdown/src/markitdown/converters/_csv_converter.py1-4

Integration with Core System

The specialized converters integrate with the core MarkItDown system through the converter registry:

Each converter:

Implements accepts() to determine if it can handle the file based on StreamInfo metadata
Implements convert() to transform the file content into Markdown
Returns DocumentConverterResult with markdown content and optional title

Sources: packages/markitdown/src/markitdown/converters/__init__.py1-49

Docker Environment Support

The specialized converters are included in the Docker image build:

Dockerfile Configuration: Dockerfile1-34

Sources: Dockerfile1-34

Specialized Converters

Purpose and Scope

Specialized Converter Overview

OutlookMsgConverter

Overview

File Type Detection

Email Metadata Extraction

Markdown Output Structure

ZipConverter

Overview

Recursive Conversion Strategy

CsvConverter

Overview

File Type Detection

Character Encoding Handling

Markdown Table Generation

PlainTextConverter

Overview

Converter Priority

Supported Text Formats

Dependency Management

Optional Dependencies with Graceful Degradation

Built-in Library Dependencies

Feature Group Mapping

Integration with Core System

Docker Environment Support

On this page

Specialized Converters

Purpose and Scope

Specialized Converter Overview

OutlookMsgConverter

Overview

File Type Detection

Email Metadata Extraction

Markdown Output Structure

ZipConverter

Overview

Recursive Conversion Strategy

CsvConverter

Overview

File Type Detection

Character Encoding Handling

Markdown Table Generation

PlainTextConverter

Overview

Converter Priority

Supported Text Formats

Dependency Management

Optional Dependencies with Graceful Degradation

Built-in Library Dependencies

Feature Group Mapping

Integration with Core System

Docker Environment Support

On this page