The DOCX Converter handles conversion of Microsoft Word DOCX files to Markdown format. This converter preserves document structure including headings, tables, and formatting where possible. It is part of the Office Document Converters subsystem (see Office Document Converters for overview) and is one of several format-specific converters in the MarkItDown conversion system.
For other Office format converters, see:
Sources: packages/markitdown/src/markitdown/converters/_docx_converter.py1-84
The DocxConverter class inherits from HtmlConverter, leveraging its HTML-to-Markdown conversion capabilities. This design pattern allows DOCX conversion to be implemented as a two-stage process: DOCX → HTML → Markdown.
Diagram: DocxConverter Class Hierarchy
The DocxConverter maintains an internal _html_converter instance packages/markitdown/src/markitdown/converters/_docx_converter.py38 which it uses to perform the final HTML-to-Markdown transformation. This composition pattern allows reuse of HTML conversion logic without reimplementing Markdown generation.
Sources: packages/markitdown/src/markitdown/converters/_docx_converter.py31-38
The accepts() method determines whether a file stream is a valid DOCX document based on MIME type prefixes and file extensions.
| Criteria Type | Values | Code Reference |
|---|---|---|
| MIME Type Prefix | application/vnd.openxmlformats-officedocument.wordprocessingml.document | _docx_converter.py24-26 |
| File Extension | .docx | _docx_converter.py28 |
Diagram: File Acceptance Decision Flow
The converter first checks the file extension for an exact match, then falls back to MIME type prefix matching. Both the extension and MIME type are normalized to lowercase before comparison packages/markitdown/src/markitdown/converters/_docx_converter.py46-47
Sources: packages/markitdown/src/markitdown/converters/_docx_converter.py40-56
The DOCX conversion follows a three-stage pipeline: pre-processing, mammoth conversion to HTML, and HTML-to-Markdown transformation.
Diagram: DOCX Conversion Pipeline
The pre_process_docx() function prepares the DOCX stream for mammoth conversion packages/markitdown/src/markitdown/converters/_docx_converter.py79 This utility function is imported from converter_utils.docx.pre_process packages/markitdown/src/markitdown/converters/_docx_converter.py8 and performs any necessary transformations to ensure compatibility with mammoth.
The mammoth library converts the DOCX document to HTML packages/markitdown/src/markitdown/converters/_docx_converter.py81 Mammoth is the primary external dependency for this converter and provides the core DOCX parsing and HTML generation functionality.
The conversion uses an optional style_map parameter packages/markitdown/src/markitdown/converters/_docx_converter.py78 which allows customization of how DOCX styles are mapped to HTML elements. If provided via kwargs, this style map is passed directly to mammoth's convert_to_html() function.
The resulting HTML string (accessed via mammoth result's .value attribute) is passed to the internal _html_converter instance's convert_string() method packages/markitdown/src/markitdown/converters/_docx_converter.py80-82 This delegates the HTML-to-Markdown conversion to the HtmlConverter class (see Web Content Converters for details on HTML conversion logic).
Sources: packages/markitdown/src/markitdown/converters/_docx_converter.py58-83
The style_map parameter enables customization of how DOCX paragraph and character styles are converted to HTML elements. This is a pass-through parameter that is forwarded directly to mammoth's conversion engine.
The converter extracts the style_map from kwargs with a default of None packages/markitdown/src/markitdown/converters/_docx_converter.py78 allowing the parameter to be optional. When no style map is provided, mammoth uses its default style mappings.
Sources: packages/markitdown/src/markitdown/converters/_docx_converter.py78-82
The DocxConverter requires the mammoth library to function. The converter implements lazy dependency checking with informative error messages.
Diagram: Mammoth Dependency Check Flow
The dependency is checked at module import time packages/markitdown/src/markitdown/converters/_docx_converter.py16-21 storing any ImportError details in _dependency_exc_info. This deferred exception pattern allows the module to load even when mammoth is not installed, enabling MarkItDown to function with other converters.
When convert() is called, the converter checks if _dependency_exc_info is set packages/markitdown/src/markitdown/converters/_docx_converter.py65-76 If so, it raises a MissingDependencyException with a formatted message indicating:
DocxConverter).docx)docx)The error message is generated using MISSING_DEPENDENCY_MESSAGE.format() packages/markitdown/src/markitdown/converters/_docx_converter.py67-71 which prompts users to install the required dependency via pip install markitdown[docx].
Sources: packages/markitdown/src/markitdown/converters/_docx_converter.py13-76
The DocxConverter is automatically discovered and registered by the MarkItDown system through the converter registry. Its priority in the converter selection process is determined by its specificity (format-specific converters have higher priority than generic converters).
Diagram: DocxConverter Registration and Selection
The converter is imported through the converters package's __init__.py and added to the list of available converters. During file processing, the converter registry iterates through converters in priority order, calling each converter's accepts() method until one returns True (see DocumentConverter Interface for details on the converter selection algorithm).
Sources: packages/markitdown/src/markitdown/converters/_docx_converter.py31-56
The converter implements robust error handling for both missing dependencies and conversion failures.
| Exception Type | Trigger Condition | Code Location |
|---|---|---|
MissingDependencyException | mammoth not installed | _docx_converter.py66-76 |
| Conversion errors | Propagated from mammoth or HtmlConverter | _docx_converter.py58-83 |
When mammoth is not installed and a conversion is attempted:
convert() is called with a DOCX file stream_dependency_exc_info packages/markitdown/src/markitdown/converters/_docx_converter.py65MissingDependencyException with contextImportError traceback via .with_traceback() packages/markitdown/src/markitdown/converters/_docx_converter.py74-76This pattern ensures that missing dependencies result in actionable error messages rather than cryptic import failures.
Sources: packages/markitdown/src/markitdown/converters/_docx_converter.py64-76
Refresh this wiki