Overview

Relevant source files

MarkItDown is a Python utility for converting various document formats to Markdown, specifically optimized for ingestion by Large Language Models (LLMs) and text analysis pipelines. The system prioritizes preserving document structure (headings, lists, tables, links) in Markdown format while maintaining token efficiency for LLM processing.

The conversion engine supports over 15 file formats including office documents (DOCX, XLSX, PPTX), PDFs, web content (HTML, RSS, YouTube), media files (images, audio), and specialized formats (MSG, ZIP, EPUB). The architecture employs a modular converter registry with priority-based selection, optional external service integrations (Azure Document Intelligence, LLM captioning), and a plugin system for third-party extensions.

For installation and setup instructions, see Installation and Setup. For CLI usage, see Command Line Interface. For architectural details, see Architecture.

Sources: README.md1-41 packages/markitdown/pyproject.toml1-34

System Purpose and Design Philosophy

MarkItDown converts documents to Markdown because mainstream LLMs natively "speak" Markdown, having been trained on vast amounts of Markdown-formatted text. The format is token-efficient and preserves structural semantics (headings, lists, tables) while remaining close to plain text. Unlike high-fidelity conversion tools designed for human consumption, MarkItDown prioritizes programmatic text analysis over visual presentation.

The system distinguishes itself from alternatives like textract by focusing on structure preservation rather than raw text extraction. Output maintains hierarchical organization, table layouts, and link relationships, making it suitable for downstream semantic analysis, retrieval-augmented generation (RAG) pipelines, and document understanding workflows.

Sources: README.md33-40

Three-Tier Architecture

The system implements a three-tier architecture separating user interfaces, core orchestration, and format-specific conversion logic.

Figure 1: Three-Tier Architecture showing separation between interfaces, orchestration, and conversion logic

Sources: README.md77-177 packages/markitdown/src/markitdown/_markitdown.py

User Interface Layer

The system provides three entry points for different use cases:

Interface	Entry Point	Use Case
CLI	`markitdown` command via `__main__.py`	Command-line batch processing, shell pipelines
Python API	`MarkItDown` class instantiation	Programmatic integration, custom workflows
MCP Server	`markitdown-mcp` package	AI assistant integration (Claude Desktop, etc.)

All interfaces converge on the MarkItDown class, which serves as the central orchestrator for conversion operations.

Sources: README.md79-89 packages/markitdown/pyproject.toml71-72

Core Orchestration Layer

The MarkItDown class implements the conversion orchestration logic:

Figure 2: Conversion orchestration flow through MarkItDown class methods

Key orchestration components:

convert(): Entry point accepting paths, URIs, or streams
convert_stream(): Processes binary file-like objects
convert_local(): Core logic for converter selection and execution
_converters: Registry list sorted by priority (highest first)
StreamInfo: Metadata container with mimetype, extension, filename, charset

Sources: packages/markitdown/src/markitdown/_markitdown.py

Converter Ecosystem Layer

Each converter implements the DocumentConverter abstract base class interface:

Figure 3: DocumentConverter class hierarchy with priority values

Converters are selected via two-phase matching:

Acceptance Phase: accepts(file_stream, stream_info) returns True if the converter can handle the file
Execution Phase: convert(file_stream, stream_info) performs the conversion and returns DocumentConverterResult

Priority determines iteration order when multiple converters accept the same file. Higher priority converters (e.g., PdfConverter at priority 5) are tried before lower priority converters (e.g., PlainTextConverter at priority 1).

Sources: packages/markitdown/src/markitdown/_markitdown.py packages/markitdown/src/markitdown/converters/_pptx_converter.py34-60

Format Coverage Matrix

The converter ecosystem supports seven major format categories:

Category	File Types	Converter Class	Feature Group	Primary Libraries
Office Documents	`.docx`, `.pptx`, `.xlsx`, `.xls`	`DocxConverter`, `PptxConverter`, `XlsxConverter`, `XlsConverter`	`[docx]`, `[pptx]`, `[xlsx]`, `[xls]`	mammoth, python-pptx, pandas, openpyxl, xlrd
PDF Documents	`.pdf`	`PdfConverter`	`[pdf]`	pdfminer.six, pdfplumber
Web Content	`.html`, `.htm`, RSS, YouTube URLs, Wikipedia, `.epub`	`HtmlConverter`, `RssConverter`, `YouTubeConverter`, `WikipediaConverter`, `EpubConverter`	`[youtube-transcription]`	BeautifulSoup, youtube-transcript-api
Media Files	Images (`.jpg`, `.png`, etc.), Audio (`.wav`, `.mp3`)	`ImageConverter`, `AudioConverter`	`[audio-transcription]`	SpeechRecognition, pydub, exiftool
Email/Messaging	`.msg` (Outlook)	`OutlookMsgConverter`	`[outlook]`	olefile
Archives	`.zip`	`ZipConverter`	(core)	zipfile
Structured Data	`.csv`, `.json`, `.xml`, `.ipynb`	`CsvConverter`, `JsonConverter`, `XmlConverter`, `IpynbConverter`	(core)	defusedxml

All converters produce DocumentConverterResult objects containing:

text_content: The Markdown output string
title: Optional document title
source: Original input identifier

Sources: README.md18-31 packages/markitdown/pyproject.toml36-61

Optional Dependency Management

MarkItDown uses feature groups to organize optional dependencies, allowing users to install only required converters:

When a converter requires missing dependencies, it raises MissingDependencyException with installation instructions. For example, attempting to convert a PPTX file without the [pptx] feature group:

Sources: README.md97-117 packages/markitdown/pyproject.toml36-61 packages/markitdown/src/markitdown/converters/_pptx_converter.py17-24

External Service Integration

MarkItDown integrates with optional external services for enhanced conversion capabilities:

Figure 4: External service integration architecture

Integration Modes

Service	Configuration Parameters	Use Case	Relevant Converters
LLM API	`llm_client`, `llm_model`, `llm_prompt`	Image captioning and visual content description	`ImageConverter`, `PptxConverter`
Azure Document Intelligence	`docintel_endpoint`, `docintel_credential` (or `AZURE_API_KEY` env var)	Complex document layout analysis for PDFs and Office files	`DocumentIntelligenceConverter`
ExifTool	`exiftool_path` (or `EXIFTOOL_PATH` env var)	Image metadata extraction	`ImageConverter`
YouTube Transcript API	(no configuration)	Fetch video captions and transcripts	`YouTubeConverter`
SpeechRecognition	(no configuration)	Audio file transcription	`AudioConverter`

Example configuration with LLM integration:

Sources: README.md136-177 packages/markitdown/pyproject.toml36-61 packages/markitdown/src/markitdown/converters/_pptx_converter.py102-129

Deployment Models

MarkItDown supports four primary deployment patterns:

Figure 5: Four deployment patterns with their components

Deployment Pattern Details

1. Local Development Installation

Standard Python package installation via pip
Direct API access through from markitdown import MarkItDown
CLI access through markitdown command
Recommended for development, testing, and integration into existing Python applications

2. Docker Containerization

Multi-stage build with Python 3.13-slim base
Pre-installed system dependencies (ffmpeg for audio, exiftool for image metadata)
Stdin/stdout piping for file processing: docker run --rm -i markitdown:latest < input.pdf > output.md
Volume mounting for file access: docker run -v /local:/data markitdown:latest /data/file.pdf

3. MCP Server Integration

Implements Model Context Protocol for AI assistant integration
Two transport modes:
- STDIO: Direct process communication, ideal for Claude Desktop
- HTTP/SSE: HTTP server on port 3001 with Server-Sent Events, suitable for web integrations
Single tool: convert_to_markdown with file path/URI parameter
Configuration in Claude Desktop: ~/.config/claude/claude_desktop_config.json

4. Plugin Extension

Third-party converters register via Python entry points mechanism
Entry point group: markitdown.plugin
Plugins implement register_converters(md: MarkItDown) -> None function
Disabled by default, enabled with --use-plugins CLI flag or enable_plugins=True API parameter
Example: markitdown-sample-plugin provides RTF conversion support

Sources: README.md45-184 packages/markitdown/pyproject.toml1-14

Conversion Pipeline Flow

The following diagram illustrates the complete conversion pipeline from input to output:

Figure 6: Complete conversion pipeline with decision flow and error handling

Pipeline Stages

Stage 1: Input Normalization

File paths converted to binary streams
URIs fetched and converted to streams (supports data:, file:, http:, https: schemes)
Streams validated as binary mode (raises error for text mode streams)

Stage 2: File Type Detection

Primary: magika ML model identifies file types from byte patterns
Secondary: MIME type from HTTP headers or content sniffing
Tertiary: File extension analysis
Results aggregated in StreamInfo dataclass

Stage 3: Converter Selection

Registry _converters list sorted by descending priority
Each converter's accepts() method called with file_stream and stream_info
First accepting converter selected for conversion attempt

Stage 4: Conversion Execution

Selected converter's convert() method processes the stream
Returns DocumentConverterResult on success
Exceptions caught and recorded, next converter tried

Stage 5: Output Normalization

_normalize_markdown() cleans whitespace and line endings
Ensures consistent output formatting across converters
Returns final DocumentConverterResult to caller

Stage 6: Error Handling

If no converters accept: UnsupportedFormatException
If converters accept but all fail: FileConversionException with diagnostic details
Missing dependencies: MissingDependencyException with installation instructions

Sources: packages/markitdown/src/markitdown/_markitdown.py

Plugin Architecture

MarkItDown supports third-party extensions through Python's entry points mechanism:

Figure 7: Plugin registration flow through entry points mechanism

Plugin Interface Requirements

Plugins must implement:

Entry Point Declaration in pyproject.toml:

Registration Function:

Converter Implementation inheriting from DocumentConverter

Plugins are discovered at runtime when enable_plugins=True or --use-plugins is specified. To list installed plugins: markitdown --list-plugins

Sources: README.md119-133

Summary

MarkItDown provides a flexible, extensible architecture for document-to-Markdown conversion optimized for LLM processing. Key architectural features include:

Modular Design: Separation of interfaces, orchestration, and conversion logic
Priority-Based Selection: Automatic converter selection based on file type and priority
Optional Dependencies: Feature groups allow minimal installations for specific use cases
External Service Integration: Support for LLMs, Azure Document Intelligence, and external tools
Multiple Deployment Options: Local, containerized, MCP server, and plugin-extensible modes
Robust Error Handling: Graceful degradation with clear error messages and fallback behavior

The system balances extensibility with simplicity, allowing users to install only needed capabilities while maintaining a consistent API across all deployment modes.

Sources: README.md1-249 packages/markitdown/pyproject.toml1-114

Overview

Relevant source files

For installation and setup instructions, see Installation and Setup. For CLI usage, see Command Line Interface. For architectural details, see Architecture.

Sources: README.md1-41 packages/markitdown/pyproject.toml1-34

System Purpose and Design Philosophy

Sources: README.md33-40

Three-Tier Architecture

The system implements a three-tier architecture separating user interfaces, core orchestration, and format-specific conversion logic.

Figure 1: Three-Tier Architecture showing separation between interfaces, orchestration, and conversion logic

Sources: README.md77-177 packages/markitdown/src/markitdown/_markitdown.py

User Interface Layer

The system provides three entry points for different use cases:

Interface	Entry Point	Use Case
CLI	`markitdown` command via `__main__.py`	Command-line batch processing, shell pipelines
Python API	`MarkItDown` class instantiation	Programmatic integration, custom workflows
MCP Server	`markitdown-mcp` package	AI assistant integration (Claude Desktop, etc.)

All interfaces converge on the MarkItDown class, which serves as the central orchestrator for conversion operations.

Sources: README.md79-89 packages/markitdown/pyproject.toml71-72

Core Orchestration Layer

The MarkItDown class implements the conversion orchestration logic:

Figure 2: Conversion orchestration flow through MarkItDown class methods

Key orchestration components:

convert(): Entry point accepting paths, URIs, or streams
convert_stream(): Processes binary file-like objects
convert_local(): Core logic for converter selection and execution
_converters: Registry list sorted by priority (highest first)
StreamInfo: Metadata container with mimetype, extension, filename, charset

Sources: packages/markitdown/src/markitdown/_markitdown.py

Converter Ecosystem Layer

Each converter implements the DocumentConverter abstract base class interface:

Figure 3: DocumentConverter class hierarchy with priority values

Converters are selected via two-phase matching:

Acceptance Phase: accepts(file_stream, stream_info) returns True if the converter can handle the file
Execution Phase: convert(file_stream, stream_info) performs the conversion and returns DocumentConverterResult

Sources: packages/markitdown/src/markitdown/_markitdown.py packages/markitdown/src/markitdown/converters/_pptx_converter.py34-60

Format Coverage Matrix

The converter ecosystem supports seven major format categories:

Category	File Types	Converter Class	Feature Group	Primary Libraries
Office Documents	`.docx`, `.pptx`, `.xlsx`, `.xls`	`DocxConverter`, `PptxConverter`, `XlsxConverter`, `XlsConverter`	`[docx]`, `[pptx]`, `[xlsx]`, `[xls]`	mammoth, python-pptx, pandas, openpyxl, xlrd
PDF Documents	`.pdf`	`PdfConverter`	`[pdf]`	pdfminer.six, pdfplumber
Web Content	`.html`, `.htm`, RSS, YouTube URLs, Wikipedia, `.epub`	`HtmlConverter`, `RssConverter`, `YouTubeConverter`, `WikipediaConverter`, `EpubConverter`	`[youtube-transcription]`	BeautifulSoup, youtube-transcript-api
Media Files	Images (`.jpg`, `.png`, etc.), Audio (`.wav`, `.mp3`)	`ImageConverter`, `AudioConverter`	`[audio-transcription]`	SpeechRecognition, pydub, exiftool
Email/Messaging	`.msg` (Outlook)	`OutlookMsgConverter`	`[outlook]`	olefile
Archives	`.zip`	`ZipConverter`	(core)	zipfile
Structured Data	`.csv`, `.json`, `.xml`, `.ipynb`	`CsvConverter`, `JsonConverter`, `XmlConverter`, `IpynbConverter`	(core)	defusedxml

All converters produce DocumentConverterResult objects containing:

text_content: The Markdown output string
title: Optional document title
source: Original input identifier

Sources: README.md18-31 packages/markitdown/pyproject.toml36-61

Optional Dependency Management

MarkItDown uses feature groups to organize optional dependencies, allowing users to install only required converters:

Sources: README.md97-117 packages/markitdown/pyproject.toml36-61 packages/markitdown/src/markitdown/converters/_pptx_converter.py17-24

External Service Integration

MarkItDown integrates with optional external services for enhanced conversion capabilities:

Figure 4: External service integration architecture

Integration Modes

Service	Configuration Parameters	Use Case	Relevant Converters
LLM API	`llm_client`, `llm_model`, `llm_prompt`	Image captioning and visual content description	`ImageConverter`, `PptxConverter`
Azure Document Intelligence	`docintel_endpoint`, `docintel_credential` (or `AZURE_API_KEY` env var)	Complex document layout analysis for PDFs and Office files	`DocumentIntelligenceConverter`
ExifTool	`exiftool_path` (or `EXIFTOOL_PATH` env var)	Image metadata extraction	`ImageConverter`
YouTube Transcript API	(no configuration)	Fetch video captions and transcripts	`YouTubeConverter`
SpeechRecognition	(no configuration)	Audio file transcription	`AudioConverter`

Example configuration with LLM integration:

Sources: README.md136-177 packages/markitdown/pyproject.toml36-61 packages/markitdown/src/markitdown/converters/_pptx_converter.py102-129

Deployment Models

MarkItDown supports four primary deployment patterns:

Figure 5: Four deployment patterns with their components

Deployment Pattern Details

1. Local Development Installation

Standard Python package installation via pip
Direct API access through from markitdown import MarkItDown
CLI access through markitdown command
Recommended for development, testing, and integration into existing Python applications

2. Docker Containerization

Multi-stage build with Python 3.13-slim base
Pre-installed system dependencies (ffmpeg for audio, exiftool for image metadata)
Stdin/stdout piping for file processing: docker run --rm -i markitdown:latest < input.pdf > output.md
Volume mounting for file access: docker run -v /local:/data markitdown:latest /data/file.pdf

3. MCP Server Integration

Implements Model Context Protocol for AI assistant integration
Two transport modes:
- STDIO: Direct process communication, ideal for Claude Desktop
- HTTP/SSE: HTTP server on port 3001 with Server-Sent Events, suitable for web integrations
Single tool: convert_to_markdown with file path/URI parameter
Configuration in Claude Desktop: ~/.config/claude/claude_desktop_config.json

4. Plugin Extension

Third-party converters register via Python entry points mechanism
Entry point group: markitdown.plugin
Plugins implement register_converters(md: MarkItDown) -> None function
Disabled by default, enabled with --use-plugins CLI flag or enable_plugins=True API parameter
Example: markitdown-sample-plugin provides RTF conversion support

Sources: README.md45-184 packages/markitdown/pyproject.toml1-14

Conversion Pipeline Flow

The following diagram illustrates the complete conversion pipeline from input to output:

Figure 6: Complete conversion pipeline with decision flow and error handling

Pipeline Stages

Stage 1: Input Normalization

File paths converted to binary streams
URIs fetched and converted to streams (supports data:, file:, http:, https: schemes)
Streams validated as binary mode (raises error for text mode streams)

Stage 2: File Type Detection

Primary: magika ML model identifies file types from byte patterns
Secondary: MIME type from HTTP headers or content sniffing
Tertiary: File extension analysis
Results aggregated in StreamInfo dataclass

Stage 3: Converter Selection

Registry _converters list sorted by descending priority
Each converter's accepts() method called with file_stream and stream_info
First accepting converter selected for conversion attempt

Stage 4: Conversion Execution

Selected converter's convert() method processes the stream
Returns DocumentConverterResult on success
Exceptions caught and recorded, next converter tried

Stage 5: Output Normalization

_normalize_markdown() cleans whitespace and line endings
Ensures consistent output formatting across converters
Returns final DocumentConverterResult to caller

Stage 6: Error Handling

If no converters accept: UnsupportedFormatException
If converters accept but all fail: FileConversionException with diagnostic details
Missing dependencies: MissingDependencyException with installation instructions

Sources: packages/markitdown/src/markitdown/_markitdown.py

Plugin Architecture

MarkItDown supports third-party extensions through Python's entry points mechanism:

Figure 7: Plugin registration flow through entry points mechanism

Plugin Interface Requirements

Plugins must implement:

Entry Point Declaration in pyproject.toml:

Registration Function:

Converter Implementation inheriting from DocumentConverter

Plugins are discovered at runtime when enable_plugins=True or --use-plugins is specified. To list installed plugins: markitdown --list-plugins

Sources: README.md119-133

Summary

MarkItDown provides a flexible, extensible architecture for document-to-Markdown conversion optimized for LLM processing. Key architectural features include:

Modular Design: Separation of interfaces, orchestration, and conversion logic
Priority-Based Selection: Automatic converter selection based on file type and priority
Optional Dependencies: Feature groups allow minimal installations for specific use cases
External Service Integration: Support for LLMs, Azure Document Intelligence, and external tools
Multiple Deployment Options: Local, containerized, MCP server, and plugin-extensible modes
Robust Error Handling: Graceful degradation with clear error messages and fallback behavior

The system balances extensibility with simplicity, allowing users to install only needed capabilities while maintaining a consistent API across all deployment modes.

Sources: README.md1-249 packages/markitdown/pyproject.toml1-114

Overview

System Purpose and Design Philosophy

Three-Tier Architecture

User Interface Layer

Core Orchestration Layer

Converter Ecosystem Layer

Format Coverage Matrix

Optional Dependency Management

External Service Integration

Integration Modes

Deployment Models

Deployment Pattern Details

Conversion Pipeline Flow

Pipeline Stages

Plugin Architecture

Plugin Interface Requirements

Summary

On this page

Overview

System Purpose and Design Philosophy

Three-Tier Architecture

User Interface Layer

Core Orchestration Layer

Converter Ecosystem Layer

Format Coverage Matrix

Optional Dependency Management

External Service Integration

Integration Modes

Deployment Models

Deployment Pattern Details

Conversion Pipeline Flow

Pipeline Stages

Plugin Architecture

Plugin Interface Requirements

Summary

On this page