This document provides an overview of the core components that comprise the MarkItDown system. These components form the foundational layer responsible for orchestrating document conversion, managing converter registration, handling file type detection, and processing various input sources.
The core components covered include:
MarkItDown orchestrator classStreamInfo metadata handlingFor details on specific converters and their implementations, see Conversion System. For deployment and integration patterns, see Integration and Deployment.
The MarkItDown system is organized around several key components that work together to convert documents to Markdown. The following diagram shows the primary components and their relationships:
Sources: packages/markitdown/src/markitdown/_markitdown.py1-784
The MarkItDown class serves as the main orchestrator for the conversion system. It manages the converter registry, handles various input sources (local files, streams, URIs), and coordinates file type detection.
The MarkItDown class is initialized with optional parameters for enabling built-in converters and plugins:
| Parameter | Type | Default | Description |
|---|---|---|---|
enable_builtins | Union[None, bool] | None (treated as True) | Enable built-in converters |
enable_plugins | Union[None, bool] | None (treated as False) | Enable plugin converters |
| Additional kwargs | Various | - | Configuration for LLM client, Azure services, exiftool path, etc. |
When built-ins are enabled via enable_builtins() packages/markitdown/src/markitdown/_markitdown.py140-230 the orchestrator:
The MarkItDown class provides multiple entry points for conversion, all routing through the internal _convert() method:
Sources: packages/markitdown/src/markitdown/_markitdown.py252-536
For detailed information on the MarkItDown class methods and usage patterns, see MarkItDown Class.
The StreamInfo dataclass serves as a metadata container for file streams. It stores:
mimetype: MIME type of the contentcharset: Character encodingextension: File extension (with leading dot)filename: Full filenamelocal_path: Path to local fileurl: URL or URI of the sourceFile type detection uses a multi-layered approach combining multiple sources of information:
Sources: packages/markitdown/src/markitdown/_markitdown.py673-772
The _get_stream_info_guesses() method packages/markitdown/src/markitdown/_markitdown.py673-772 coordinates this detection process, producing a prioritized list of StreamInfo guesses that the converter registry will attempt in order.
For more details on stream handling, see Stream Handling and File Detection.
The converter registry is a core component that manages the selection and ordering of document converters. Each converter is wrapped in a ConverterRegistration dataclass:
Sources: packages/markitdown/src/markitdown/_markitdown.py85-91
Two priority constants define the ordering:
| Constant | Value | Purpose |
|---|---|---|
PRIORITY_SPECIFIC_FILE_FORMAT | 0.0 | Specific format converters (DOCX, PDF, etc.) |
PRIORITY_GENERIC_FILE_FORMAT | 10.0 | Generic/catch-all converters (PlainText, HTML, ZIP) |
Sources: packages/markitdown/src/markitdown/_markitdown.py54-59
Lower priority values are tried first, so specific converters take precedence over generic ones.
Sources: packages/markitdown/src/markitdown/_markitdown.py538-631
The built-in converters are registered in a specific order within enable_builtins() packages/markitdown/src/markitdown/_markitdown.py181-226:
| Order | Converter | Priority | Notes |
|---|---|---|---|
| 1 | PlainTextConverter | 10.0 | Generic, registered first |
| 2 | ZipConverter | 10.0 | Generic |
| 3 | HtmlConverter | 10.0 | Generic |
| 4 | RssConverter | 0.0 | Specific formats below |
| 5 | WikipediaConverter | 0.0 | |
| 6 | YouTubeConverter | 0.0 | |
| 7 | BingSerpConverter | 0.0 | |
| 8 | DocxConverter | 0.0 | |
| 9 | XlsxConverter | 0.0 | |
| 10 | XlsConverter | 0.0 | |
| 11 | PptxConverter | 0.0 | |
| 12 | AudioConverter | 0.0 | |
| 13 | ImageConverter | 0.0 | |
| 14 | IpynbConverter | 0.0 | |
| 15 | PdfConverter | 0.0 | |
| 16 | OutlookMsgConverter | 0.0 | |
| 17 | EpubConverter | 0.0 | |
| 18 | CsvConverter | 0.0 | |
| Last | DocumentIntelligenceConverter | 0.0 | Optional, if endpoint provided |
Due to the stable sort and insertion at position 0, converters registered later effectively have higher priority within the same priority value. This means DocumentIntelligenceConverter (if registered) and CsvConverter are tried before PlainTextConverter among specific format converters.
Sources: packages/markitdown/src/markitdown/_markitdown.py181-226
Plugins can register converters with custom priorities via the register_converters() function called during plugin loading packages/markitdown/src/markitdown/_markitdown.py240-247 This allows plugins to:
For details on the plugin system, see Plugin System.
The exception hierarchy provides structured error reporting for conversion failures. The core exceptions are:
Sources: packages/markitdown/src/markitdown/_markitdown.py46-50
The _convert() method tracks failed conversion attempts and raises appropriate exceptions:
FileConversionException when one or more converters accepted the file but failed packages/markitdown/src/markitdown/_markitdown.py625-626UnsupportedFormatException when no converter accepted the file packages/markitdown/src/markitdown/_markitdown.py629-631For complete exception documentation, see Exception Handling.
The URI handling utilities support multiple URI schemes:
| Scheme | Handler | Description |
|---|---|---|
file: | file_uri_to_path() | Convert file URIs to local paths |
data: | parse_data_uri() | Parse data URIs into mimetype and bytes |
http: / https: | requests library | Fetch remote content |
Sources: packages/markitdown/src/markitdown/_markitdown.py405-464
The convert_uri() method packages/markitdown/src/markitdown/_markitdown.py405-464 serves as the entry point for all URI processing, dispatching to the appropriate handler based on the scheme.
For detailed URI handling documentation, see URI Handling.
The plugin system enables third-party extensions to register custom converters without modifying the core codebase. Plugins are discovered using Python's entry_points mechanism.
Sources: packages/markitdown/src/markitdown/_markitdown.py65-82 packages/markitdown/src/markitdown/_markitdown.py232-250
Plugins must implement a register_converters() function with the signature:
The function receives the MarkItDown instance and any keyword arguments passed during initialization. Plugins typically:
markitdown.register_converter() with appropriate prioritiesFor a complete example and detailed plugin documentation, see Plugin System and Sample RTF Converter Plugin.
After successful conversion, the _convert() method normalizes the output Markdown packages/markitdown/src/markitdown/_markitdown.py617-622:
\r?\n and removes trailing whitespace from each lineThis ensures consistent Markdown formatting regardless of the source document's line endings or spacing.
Sources: packages/markitdown/src/markitdown/_markitdown.py617-622
The core components work together to provide a flexible, extensible document conversion system:
These components form the foundation upon which all document converters are built. For information on specific converters, see Conversion System.
Refresh this wiki