DOCX Converter

Relevant source files

packages/markitdown/src/markitdown/converters/_docx_converter.py

Purpose and Scope

The DOCX Converter handles conversion of Microsoft Word DOCX files to Markdown format. This converter preserves document structure including headings, tables, and formatting where possible. It is part of the Office Document Converters subsystem (see Office Document Converters for overview) and is one of several format-specific converters in the MarkItDown conversion system.

For other Office format converters, see:

PPTX Converter for PowerPoint presentations
Specialized Converters for Excel spreadsheets (XLSX/XLS)

Sources: packages/markitdown/src/markitdown/converters/_docx_converter.py1-84

Class Architecture

The DocxConverter class inherits from HtmlConverter, leveraging its HTML-to-Markdown conversion capabilities. This design pattern allows DOCX conversion to be implemented as a two-stage process: DOCX → HTML → Markdown.

Diagram: DocxConverter Class Hierarchy

The DocxConverter maintains an internal _html_converter instance packages/markitdown/src/markitdown/converters/_docx_converter.py38 which it uses to perform the final HTML-to-Markdown transformation. This composition pattern allows reuse of HTML conversion logic without reimplementing Markdown generation.

Sources: packages/markitdown/src/markitdown/converters/_docx_converter.py31-38

File Acceptance Criteria

The accepts() method determines whether a file stream is a valid DOCX document based on MIME type prefixes and file extensions.

Accepted Formats

Criteria Type	Values	Code Reference
MIME Type Prefix	`application/vnd.openxmlformats-officedocument.wordprocessingml.document`	_docx_converter.py24-26
File Extension	`.docx`	_docx_converter.py28

Acceptance Logic

Diagram: File Acceptance Decision Flow

The converter first checks the file extension for an exact match, then falls back to MIME type prefix matching. Both the extension and MIME type are normalized to lowercase before comparison packages/markitdown/src/markitdown/converters/_docx_converter.py46-47

Sources: packages/markitdown/src/markitdown/converters/_docx_converter.py40-56

Conversion Process

The DOCX conversion follows a three-stage pipeline: pre-processing, mammoth conversion to HTML, and HTML-to-Markdown transformation.

Diagram: DOCX Conversion Pipeline

Stage 1: Pre-processing

The pre_process_docx() function prepares the DOCX stream for mammoth conversion packages/markitdown/src/markitdown/converters/_docx_converter.py79 This utility function is imported from converter_utils.docx.pre_process packages/markitdown/src/markitdown/converters/_docx_converter.py8 and performs any necessary transformations to ensure compatibility with mammoth.

Stage 2: Mammoth Conversion

The mammoth library converts the DOCX document to HTML packages/markitdown/src/markitdown/converters/_docx_converter.py81 Mammoth is the primary external dependency for this converter and provides the core DOCX parsing and HTML generation functionality.

The conversion uses an optional style_map parameter packages/markitdown/src/markitdown/converters/_docx_converter.py78 which allows customization of how DOCX styles are mapped to HTML elements. If provided via kwargs, this style map is passed directly to mammoth's convert_to_html() function.

Stage 3: HTML to Markdown

The resulting HTML string (accessed via mammoth result's .value attribute) is passed to the internal _html_converter instance's convert_string() method packages/markitdown/src/markitdown/converters/_docx_converter.py80-82 This delegates the HTML-to-Markdown conversion to the HtmlConverter class (see Web Content Converters for details on HTML conversion logic).

Sources: packages/markitdown/src/markitdown/converters/_docx_converter.py58-83

Style Mapping

The style_map parameter enables customization of how DOCX paragraph and character styles are converted to HTML elements. This is a pass-through parameter that is forwarded directly to mammoth's conversion engine.

Style Map Usage

The converter extracts the style_map from kwargs with a default of None packages/markitdown/src/markitdown/converters/_docx_converter.py78 allowing the parameter to be optional. When no style map is provided, mammoth uses its default style mappings.

Sources: packages/markitdown/src/markitdown/converters/_docx_converter.py78-82

Dependency Management

The DocxConverter requires the mammoth library to function. The converter implements lazy dependency checking with informative error messages.

Dependency Loading Pattern

Diagram: Mammoth Dependency Check Flow

The dependency is checked at module import time packages/markitdown/src/markitdown/converters/_docx_converter.py16-21 storing any ImportError details in _dependency_exc_info. This deferred exception pattern allows the module to load even when mammoth is not installed, enabling MarkItDown to function with other converters.

When convert() is called, the converter checks if _dependency_exc_info is set packages/markitdown/src/markitdown/converters/_docx_converter.py65-76 If so, it raises a MissingDependencyException with a formatted message indicating:

The converter name (DocxConverter)
The file extension (.docx)
The feature group to install (docx)

The error message is generated using MISSING_DEPENDENCY_MESSAGE.format() packages/markitdown/src/markitdown/converters/_docx_converter.py67-71 which prompts users to install the required dependency via pip install markitdown[docx].

Sources: packages/markitdown/src/markitdown/converters/_docx_converter.py13-76

Integration with Converter Registry

The DocxConverter is automatically discovered and registered by the MarkItDown system through the converter registry. Its priority in the converter selection process is determined by its specificity (format-specific converters have higher priority than generic converters).

Registration Flow

Diagram: DocxConverter Registration and Selection

The converter is imported through the converters package's __init__.py and added to the list of available converters. During file processing, the converter registry iterates through converters in priority order, calling each converter's accepts() method until one returns True (see DocumentConverter Interface for details on the converter selection algorithm).

Sources: packages/markitdown/src/markitdown/converters/_docx_converter.py31-56

Error Handling

The converter implements robust error handling for both missing dependencies and conversion failures.

Error Types

Exception Type	Trigger Condition	Code Location
`MissingDependencyException`	`mammoth` not installed	_docx_converter.py66-76
Conversion errors	Propagated from mammoth or HtmlConverter	_docx_converter.py58-83

Missing Dependency Flow

When mammoth is not installed and a conversion is attempted:

convert() is called with a DOCX file stream
The method checks _dependency_exc_info packages/markitdown/src/markitdown/converters/_docx_converter.py65
If set, constructs a MissingDependencyException with context
The exception includes the original ImportError traceback via .with_traceback() packages/markitdown/src/markitdown/converters/_docx_converter.py74-76
Users receive a clear error message with installation instructions

This pattern ensures that missing dependencies result in actionable error messages rather than cryptic import failures.

Sources: packages/markitdown/src/markitdown/converters/_docx_converter.py64-76

DOCX Converter

Relevant source files

packages/markitdown/src/markitdown/converters/_docx_converter.py

Purpose and Scope

For other Office format converters, see:

PPTX Converter for PowerPoint presentations
Specialized Converters for Excel spreadsheets (XLSX/XLS)

Sources: packages/markitdown/src/markitdown/converters/_docx_converter.py1-84

Class Architecture

Diagram: DocxConverter Class Hierarchy

Sources: packages/markitdown/src/markitdown/converters/_docx_converter.py31-38

File Acceptance Criteria

The accepts() method determines whether a file stream is a valid DOCX document based on MIME type prefixes and file extensions.

Accepted Formats

Criteria Type	Values	Code Reference
MIME Type Prefix	`application/vnd.openxmlformats-officedocument.wordprocessingml.document`	_docx_converter.py24-26
File Extension	`.docx`	_docx_converter.py28

Acceptance Logic

Diagram: File Acceptance Decision Flow

Sources: packages/markitdown/src/markitdown/converters/_docx_converter.py40-56

Conversion Process

The DOCX conversion follows a three-stage pipeline: pre-processing, mammoth conversion to HTML, and HTML-to-Markdown transformation.

Diagram: DOCX Conversion Pipeline

Stage 1: Pre-processing

Stage 2: Mammoth Conversion

Stage 3: HTML to Markdown

Sources: packages/markitdown/src/markitdown/converters/_docx_converter.py58-83

Style Mapping

Style Map Usage

Sources: packages/markitdown/src/markitdown/converters/_docx_converter.py78-82

Dependency Management

The DocxConverter requires the mammoth library to function. The converter implements lazy dependency checking with informative error messages.

Dependency Loading Pattern

Diagram: Mammoth Dependency Check Flow

The converter name (DocxConverter)
The file extension (.docx)
The feature group to install (docx)

Sources: packages/markitdown/src/markitdown/converters/_docx_converter.py13-76

Integration with Converter Registry

Registration Flow

Diagram: DocxConverter Registration and Selection

Sources: packages/markitdown/src/markitdown/converters/_docx_converter.py31-56

Error Handling

The converter implements robust error handling for both missing dependencies and conversion failures.

Error Types

Exception Type	Trigger Condition	Code Location
`MissingDependencyException`	`mammoth` not installed	_docx_converter.py66-76
Conversion errors	Propagated from mammoth or HtmlConverter	_docx_converter.py58-83

Missing Dependency Flow

When mammoth is not installed and a conversion is attempted:

convert() is called with a DOCX file stream
The method checks _dependency_exc_info packages/markitdown/src/markitdown/converters/_docx_converter.py65
If set, constructs a MissingDependencyException with context
The exception includes the original ImportError traceback via .with_traceback() packages/markitdown/src/markitdown/converters/_docx_converter.py74-76
Users receive a clear error message with installation instructions

This pattern ensures that missing dependencies result in actionable error messages rather than cryptic import failures.

Sources: packages/markitdown/src/markitdown/converters/_docx_converter.py64-76

DOCX Converter

Purpose and Scope

Class Architecture

File Acceptance Criteria

Accepted Formats

Acceptance Logic

Conversion Process

Stage 1: Pre-processing

Stage 2: Mammoth Conversion

Stage 3: HTML to Markdown

Style Mapping

Style Map Usage

Dependency Management

Dependency Loading Pattern

Integration with Converter Registry

Registration Flow

Error Handling

Error Types

Missing Dependency Flow

On this page

DOCX Converter

Purpose and Scope

Class Architecture

File Acceptance Criteria

Accepted Formats

Acceptance Logic

Conversion Process

Stage 1: Pre-processing

Stage 2: Mammoth Conversion

Stage 3: HTML to Markdown

Style Mapping

Style Map Usage

Dependency Management

Dependency Loading Pattern

Integration with Converter Registry

Registration Flow

Error Handling

Error Types

Missing Dependency Flow

On this page