MarkItDown is a Python utility for converting various document formats to Markdown, specifically optimized for ingestion by Large Language Models (LLMs) and text analysis pipelines. The system prioritizes preserving document structure (headings, lists, tables, links) in Markdown format while maintaining token efficiency for LLM processing.
The conversion engine supports over 15 file formats including office documents (DOCX, XLSX, PPTX), PDFs, web content (HTML, RSS, YouTube), media files (images, audio), and specialized formats (MSG, ZIP, EPUB). The architecture employs a modular converter registry with priority-based selection, optional external service integrations (Azure Document Intelligence, LLM captioning), and a plugin system for third-party extensions.
For installation and setup instructions, see Installation and Setup. For CLI usage, see Command Line Interface. For architectural details, see Architecture.
Sources: README.md1-41 packages/markitdown/pyproject.toml1-34
MarkItDown converts documents to Markdown because mainstream LLMs natively "speak" Markdown, having been trained on vast amounts of Markdown-formatted text. The format is token-efficient and preserves structural semantics (headings, lists, tables) while remaining close to plain text. Unlike high-fidelity conversion tools designed for human consumption, MarkItDown prioritizes programmatic text analysis over visual presentation.
The system distinguishes itself from alternatives like textract by focusing on structure preservation rather than raw text extraction. Output maintains hierarchical organization, table layouts, and link relationships, making it suitable for downstream semantic analysis, retrieval-augmented generation (RAG) pipelines, and document understanding workflows.
Sources: README.md33-40
The system implements a three-tier architecture separating user interfaces, core orchestration, and format-specific conversion logic.
Figure 1: Three-Tier Architecture showing separation between interfaces, orchestration, and conversion logic
Sources: README.md77-177 packages/markitdown/src/markitdown/_markitdown.py
The system provides three entry points for different use cases:
| Interface | Entry Point | Use Case |
|---|---|---|
| CLI | markitdown command via __main__.py | Command-line batch processing, shell pipelines |
| Python API | MarkItDown class instantiation | Programmatic integration, custom workflows |
| MCP Server | markitdown-mcp package | AI assistant integration (Claude Desktop, etc.) |
All interfaces converge on the MarkItDown class, which serves as the central orchestrator for conversion operations.
Sources: README.md79-89 packages/markitdown/pyproject.toml71-72
The MarkItDown class implements the conversion orchestration logic:
Figure 2: Conversion orchestration flow through MarkItDown class methods
Key orchestration components:
convert(): Entry point accepting paths, URIs, or streamsconvert_stream(): Processes binary file-like objectsconvert_local(): Core logic for converter selection and execution_converters: Registry list sorted by priority (highest first)StreamInfo: Metadata container with mimetype, extension, filename, charsetSources: packages/markitdown/src/markitdown/_markitdown.py
Each converter implements the DocumentConverter abstract base class interface:
Figure 3: DocumentConverter class hierarchy with priority values
Converters are selected via two-phase matching:
accepts(file_stream, stream_info) returns True if the converter can handle the fileconvert(file_stream, stream_info) performs the conversion and returns DocumentConverterResultPriority determines iteration order when multiple converters accept the same file. Higher priority converters (e.g., PdfConverter at priority 5) are tried before lower priority converters (e.g., PlainTextConverter at priority 1).
Sources: packages/markitdown/src/markitdown/_markitdown.py packages/markitdown/src/markitdown/converters/_pptx_converter.py34-60
The converter ecosystem supports seven major format categories:
| Category | File Types | Converter Class | Feature Group | Primary Libraries |
|---|---|---|---|---|
| Office Documents | .docx, .pptx, .xlsx, .xls | DocxConverter, PptxConverter, XlsxConverter, XlsConverter | [docx], [pptx], [xlsx], [xls] | mammoth, python-pptx, pandas, openpyxl, xlrd |
| PDF Documents | .pdf | PdfConverter | [pdf] | pdfminer.six, pdfplumber |
| Web Content | .html, .htm, RSS, YouTube URLs, Wikipedia, .epub | HtmlConverter, RssConverter, YouTubeConverter, WikipediaConverter, EpubConverter | [youtube-transcription] | BeautifulSoup, youtube-transcript-api |
| Media Files | Images (.jpg, .png, etc.), Audio (.wav, .mp3) | ImageConverter, AudioConverter | [audio-transcription] | SpeechRecognition, pydub, exiftool |
| Email/Messaging | .msg (Outlook) | OutlookMsgConverter | [outlook] | olefile |
| Archives | .zip | ZipConverter | (core) | zipfile |
| Structured Data | .csv, .json, .xml, .ipynb | CsvConverter, JsonConverter, XmlConverter, IpynbConverter | (core) | defusedxml |
All converters produce DocumentConverterResult objects containing:
text_content: The Markdown output stringtitle: Optional document titlesource: Original input identifierSources: README.md18-31 packages/markitdown/pyproject.toml36-61
MarkItDown uses feature groups to organize optional dependencies, allowing users to install only required converters:
When a converter requires missing dependencies, it raises MissingDependencyException with installation instructions. For example, attempting to convert a PPTX file without the [pptx] feature group:
Sources: README.md97-117 packages/markitdown/pyproject.toml36-61 packages/markitdown/src/markitdown/converters/_pptx_converter.py17-24
MarkItDown integrates with optional external services for enhanced conversion capabilities:
Figure 4: External service integration architecture
| Service | Configuration Parameters | Use Case | Relevant Converters |
|---|---|---|---|
| LLM API | llm_client, llm_model, llm_prompt | Image captioning and visual content description | ImageConverter, PptxConverter |
| Azure Document Intelligence | docintel_endpoint, docintel_credential (or AZURE_API_KEY env var) | Complex document layout analysis for PDFs and Office files | DocumentIntelligenceConverter |
| ExifTool | exiftool_path (or EXIFTOOL_PATH env var) | Image metadata extraction | ImageConverter |
| YouTube Transcript API | (no configuration) | Fetch video captions and transcripts | YouTubeConverter |
| SpeechRecognition | (no configuration) | Audio file transcription | AudioConverter |
Example configuration with LLM integration:
Sources: README.md136-177 packages/markitdown/pyproject.toml36-61 packages/markitdown/src/markitdown/converters/_pptx_converter.py102-129
MarkItDown supports four primary deployment patterns:
Figure 5: Four deployment patterns with their components
1. Local Development Installation
from markitdown import MarkItDownmarkitdown command2. Docker Containerization
docker run --rm -i markitdown:latest < input.pdf > output.mddocker run -v /local:/data markitdown:latest /data/file.pdf3. MCP Server Integration
convert_to_markdown with file path/URI parameter~/.config/claude/claude_desktop_config.json4. Plugin Extension
markitdown.pluginregister_converters(md: MarkItDown) -> None function--use-plugins CLI flag or enable_plugins=True API parametermarkitdown-sample-plugin provides RTF conversion supportSources: README.md45-184 packages/markitdown/pyproject.toml1-14
The following diagram illustrates the complete conversion pipeline from input to output:
Figure 6: Complete conversion pipeline with decision flow and error handling
Stage 1: Input Normalization
data:, file:, http:, https: schemes)Stage 2: File Type Detection
magika ML model identifies file types from byte patternsStreamInfo dataclassStage 3: Converter Selection
_converters list sorted by descending priorityaccepts() method called with file_stream and stream_infoStage 4: Conversion Execution
convert() method processes the streamDocumentConverterResult on successStage 5: Output Normalization
_normalize_markdown() cleans whitespace and line endingsDocumentConverterResult to callerStage 6: Error Handling
UnsupportedFormatExceptionFileConversionException with diagnostic detailsMissingDependencyException with installation instructionsSources: packages/markitdown/src/markitdown/_markitdown.py
MarkItDown supports third-party extensions through Python's entry points mechanism:
Figure 7: Plugin registration flow through entry points mechanism
Plugins must implement:
pyproject.toml:DocumentConverterPlugins are discovered at runtime when enable_plugins=True or --use-plugins is specified. To list installed plugins: markitdown --list-plugins
Sources: README.md119-133
MarkItDown provides a flexible, extensible architecture for document-to-Markdown conversion optimized for LLM processing. Key architectural features include:
The system balances extensibility with simplicity, allowing users to install only needed capabilities while maintaining a consistent API across all deployment modes.
Sources: README.md1-249 packages/markitdown/pyproject.toml1-114
Refresh this wiki