Testing Framework

Relevant source files

Purpose and Scope

This document describes the testing framework for MarkItDown, including the test suite organization, test vector system, parameterized testing patterns, and test file structure. The framework uses pytest with custom test vectors to validate document conversion across multiple input formats and access patterns.

For information about CI/CD workflows and quality assurance processes, see CI/CD and Quality Assurance. For general project configuration including test environments, see Project Configuration.

Test Suite Architecture

The MarkItDown test suite is organized into two main test modules with distinct responsibilities:

Test Module Organization

Sources: packages/markitdown/tests/test_module_vectors.py1-235 packages/markitdown/tests/test_module_misc.py1-505

FileTestVector Dataclass

The test framework uses a FileTestVector dataclass to define test cases systematically. Each vector specifies a test file, expected content assertions, and metadata:

Field	Type	Purpose
`filename`	`str`	Name of file in `test_files/` directory
`mimetype`	`str`	Expected MIME type for content detection
`charset`	`str`	Expected character encoding
`url`	`str` (optional)	Mock URL for testing URL context
`must_include`	`List[str]`	Strings that must appear in conversion result
`must_not_include`	`List[str]`	Strings that must not appear in result

The vectors are organized into two collections imported from _test_vectors:

GENERAL_TEST_VECTORS: Standard format conversion test cases
DATA_URI_TEST_VECTORS: Cases for testing data URI preservation with keep_data_uris=True

Sources: packages/markitdown/tests/test_module_vectors.py10-12

Parameterized Testing Pattern

Test Execution Flow

Parameterized Test Examples

Each test function is decorated with @pytest.mark.parametrize("test_vector", GENERAL_TEST_VECTORS) to run against all vectors:

packages/markitdown/tests/test_module_vectors.py27-55

packages/markitdown/tests/test_module_vectors.py57-69

packages/markitdown/tests/test_module_vectors.py71-90

Sources: packages/markitdown/tests/test_module_vectors.py27-160

Test Categories

Format Conversion Tests

These tests validate conversion accuracy across different input methods:

Test Function	Input Method	Validates
`test_convert_local`	File path	Local file conversion
`test_convert_stream_with_hints`	Stream + `StreamInfo`	Conversion with full metadata
`test_convert_stream_without_hints`	Stream only	File type detection from content
`test_convert_http_uri`	HTTP/HTTPS URL	Remote resource fetching
`test_convert_file_uri`	`file://` URI	File URI parsing
`test_convert_data_uri`	`data:` URI	Embedded content handling

Sources: packages/markitdown/tests/test_module_vectors.py57-160

Stream Info Detection Tests

The test_guess_stream_info function validates the file type detection system:

The test constructs a StreamInfo object with known values and verifies the system correctly identifies file types using magika and MIME type inference.

Sources: packages/markitdown/tests/test_module_vectors.py28-55

Integration Tests

The test_module_misc.py file contains integration tests for external dependencies:

LLM Integration Tests

packages/markitdown/tests/test_module_misc.py459-483

Tests image captioning with OpenAI integration:

Validates LLM client configuration
Verifies caption generation for JPG files
Tests PPTX embedded image captioning

ExifTool Integration Tests

packages/markitdown/tests/test_module_misc.py385-413

Tests metadata extraction:

Validates exiftool_path configuration
Tests EXIFTOOL_PATH environment variable
Verifies metadata extraction from images and audio

Speech Transcription Tests

packages/markitdown/tests/test_module_misc.py350-368

Tests audio transcription for WAV, MP3, and M4A formats.

Sources: packages/markitdown/tests/test_module_misc.py350-483

Test Execution Control

Conditional Test Skipping

The test suite uses environment-based skipping for tests requiring external resources or credentials:

packages/markitdown/tests/test_module_misc.py22-35

Skip Decorators

Tests use pytest's skipif marker to conditionally skip:

packages/markitdown/tests/test_module_vectors.py105-124

Sources: packages/markitdown/tests/test_module_misc.py22-35 packages/markitdown/tests/test_module_vectors.py19-21 packages/markitdown/tests/test_module_vectors.py105-109

Test Utilities and Helpers

String Validation Helper

The validate_strings function provides assertion logic for content verification:

packages/markitdown/tests/test_module_misc.py100-108

This helper is used throughout tests to verify conversion results contain expected content markers.

Test Constants

Test files define constant dictionaries with expected metadata:

packages/markitdown/tests/test_module_misc.py39-45

Similar constants exist for PDF URLs, YouTube URLs, DOCX content, and other test scenarios.

Sources: packages/markitdown/tests/test_module_misc.py39-97 packages/markitdown/tests/test_module_misc.py100-108

Test File Organization

Directory Structure

Sample Test Files

The test_files/ directory contains representative samples for each supported format:

Category	Example Files	Purpose
PDF	`_borderless_table.pdf`, `_multipage.pdf`	Table extraction, multi-page handling
Office	`test.docx`, `test.xlsx`, `test.pptx`	Standard Office conversion
Media	`test.jpg`, `test.mp3`, `test.wav`	Metadata extraction, transcription
Web	`test.html`, `test.rss`	HTML parsing, feed processing
Special	`test_outlook_msg.msg`, `test.zip`	Format-specific features

Files use descriptive naming with patterns like:

[TYPE]-[YEAR]-[CATEGORY]-[ID]_[description].[ext] for business documents
test.[ext] for basic format tests
test_[feature].[ext] for specific feature tests

Sources: packages/markitdown/tests/test_files/ (multiple files)

Test Execution Patterns

Direct Execution Support

Both test modules support direct execution with if __name__ == "__main__": blocks:

packages/markitdown/tests/test_module_vectors.py202-234

This allows running tests without pytest:

Pytest Execution

Standard pytest execution runs all parameterized tests:

Sources: packages/markitdown/tests/test_module_vectors.py202-234 packages/markitdown/tests/test_module_misc.py485-505

Edge Case and Exception Testing

Exception Handling Tests

packages/markitdown/tests/test_module_misc.py370-383

This test validates:

UnsupportedFormatException for unknown formats
FileConversionException with conversion attempts tracking
Converter selection based on forced extension

Special Case Tests

Test	Purpose	Key Validation
`test_stream_info_operations`	`StreamInfo.copy_and_update()`	Attribute merging logic
`test_data_uris`	Data URI parsing	Base64 decoding, MIME type extraction
`test_file_uris`	File URI parsing	`file://` URL handling
`test_docx_comments`	DOCX comment extraction	Style map configuration
`test_docx_equations`	Math equation rendering	LaTeX formatting
`test_input_as_strings`	Stream input handling	HTML string conversion
`test_doc_rlink`	CVE-2025-11849 security	Rlink reference validation

Sources: packages/markitdown/tests/test_module_misc.py110-330 packages/markitdown/tests/test_module_misc.py370-383

URI Handling Test Coverage

URI Test Matrix

Sources: packages/markitdown/tests/test_module_misc.py182-253 packages/markitdown/tests/test_module_vectors.py110-160 packages/markitdown/src/markitdown/_uri_utils.py1-53

Testing Framework

Relevant source files

Purpose and Scope

For information about CI/CD workflows and quality assurance processes, see CI/CD and Quality Assurance. For general project configuration including test environments, see Project Configuration.

Test Suite Architecture

The MarkItDown test suite is organized into two main test modules with distinct responsibilities:

Test Module Organization

Sources: packages/markitdown/tests/test_module_vectors.py1-235 packages/markitdown/tests/test_module_misc.py1-505

FileTestVector Dataclass

The test framework uses a FileTestVector dataclass to define test cases systematically. Each vector specifies a test file, expected content assertions, and metadata:

Field	Type	Purpose
`filename`	`str`	Name of file in `test_files/` directory
`mimetype`	`str`	Expected MIME type for content detection
`charset`	`str`	Expected character encoding
`url`	`str` (optional)	Mock URL for testing URL context
`must_include`	`List[str]`	Strings that must appear in conversion result
`must_not_include`	`List[str]`	Strings that must not appear in result

The vectors are organized into two collections imported from _test_vectors:

GENERAL_TEST_VECTORS: Standard format conversion test cases
DATA_URI_TEST_VECTORS: Cases for testing data URI preservation with keep_data_uris=True

Sources: packages/markitdown/tests/test_module_vectors.py10-12

Parameterized Testing Pattern

Test Execution Flow

Parameterized Test Examples

Each test function is decorated with @pytest.mark.parametrize("test_vector", GENERAL_TEST_VECTORS) to run against all vectors:

packages/markitdown/tests/test_module_vectors.py27-55

packages/markitdown/tests/test_module_vectors.py57-69

packages/markitdown/tests/test_module_vectors.py71-90

Sources: packages/markitdown/tests/test_module_vectors.py27-160

Test Categories

Format Conversion Tests

These tests validate conversion accuracy across different input methods:

Test Function	Input Method	Validates
`test_convert_local`	File path	Local file conversion
`test_convert_stream_with_hints`	Stream + `StreamInfo`	Conversion with full metadata
`test_convert_stream_without_hints`	Stream only	File type detection from content
`test_convert_http_uri`	HTTP/HTTPS URL	Remote resource fetching
`test_convert_file_uri`	`file://` URI	File URI parsing
`test_convert_data_uri`	`data:` URI	Embedded content handling

Sources: packages/markitdown/tests/test_module_vectors.py57-160

Stream Info Detection Tests

The test_guess_stream_info function validates the file type detection system:

The test constructs a StreamInfo object with known values and verifies the system correctly identifies file types using magika and MIME type inference.

Sources: packages/markitdown/tests/test_module_vectors.py28-55

Integration Tests

The test_module_misc.py file contains integration tests for external dependencies:

LLM Integration Tests

packages/markitdown/tests/test_module_misc.py459-483

Tests image captioning with OpenAI integration:

Validates LLM client configuration
Verifies caption generation for JPG files
Tests PPTX embedded image captioning

ExifTool Integration Tests

packages/markitdown/tests/test_module_misc.py385-413

Tests metadata extraction:

Validates exiftool_path configuration
Tests EXIFTOOL_PATH environment variable
Verifies metadata extraction from images and audio

Speech Transcription Tests

packages/markitdown/tests/test_module_misc.py350-368

Tests audio transcription for WAV, MP3, and M4A formats.

Sources: packages/markitdown/tests/test_module_misc.py350-483

Test Execution Control

Conditional Test Skipping

The test suite uses environment-based skipping for tests requiring external resources or credentials:

packages/markitdown/tests/test_module_misc.py22-35

Skip Decorators

Tests use pytest's skipif marker to conditionally skip:

packages/markitdown/tests/test_module_vectors.py105-124

Sources: packages/markitdown/tests/test_module_misc.py22-35 packages/markitdown/tests/test_module_vectors.py19-21 packages/markitdown/tests/test_module_vectors.py105-109

Test Utilities and Helpers

String Validation Helper

The validate_strings function provides assertion logic for content verification:

packages/markitdown/tests/test_module_misc.py100-108

This helper is used throughout tests to verify conversion results contain expected content markers.

Test Constants

Test files define constant dictionaries with expected metadata:

packages/markitdown/tests/test_module_misc.py39-45

Similar constants exist for PDF URLs, YouTube URLs, DOCX content, and other test scenarios.

Sources: packages/markitdown/tests/test_module_misc.py39-97 packages/markitdown/tests/test_module_misc.py100-108

Test File Organization

Directory Structure

Sample Test Files

The test_files/ directory contains representative samples for each supported format:

Category	Example Files	Purpose
PDF	`_borderless_table.pdf`, `_multipage.pdf`	Table extraction, multi-page handling
Office	`test.docx`, `test.xlsx`, `test.pptx`	Standard Office conversion
Media	`test.jpg`, `test.mp3`, `test.wav`	Metadata extraction, transcription
Web	`test.html`, `test.rss`	HTML parsing, feed processing
Special	`test_outlook_msg.msg`, `test.zip`	Format-specific features

Files use descriptive naming with patterns like:

[TYPE]-[YEAR]-[CATEGORY]-[ID]_[description].[ext] for business documents
test.[ext] for basic format tests
test_[feature].[ext] for specific feature tests

Sources: packages/markitdown/tests/test_files/ (multiple files)

Test Execution Patterns

Direct Execution Support

Both test modules support direct execution with if __name__ == "__main__": blocks:

packages/markitdown/tests/test_module_vectors.py202-234

This allows running tests without pytest:

Pytest Execution

Standard pytest execution runs all parameterized tests:

Sources: packages/markitdown/tests/test_module_vectors.py202-234 packages/markitdown/tests/test_module_misc.py485-505

Edge Case and Exception Testing

Exception Handling Tests

packages/markitdown/tests/test_module_misc.py370-383

This test validates:

UnsupportedFormatException for unknown formats
FileConversionException with conversion attempts tracking
Converter selection based on forced extension

Special Case Tests

Test	Purpose	Key Validation
`test_stream_info_operations`	`StreamInfo.copy_and_update()`	Attribute merging logic
`test_data_uris`	Data URI parsing	Base64 decoding, MIME type extraction
`test_file_uris`	File URI parsing	`file://` URL handling
`test_docx_comments`	DOCX comment extraction	Style map configuration
`test_docx_equations`	Math equation rendering	LaTeX formatting
`test_input_as_strings`	Stream input handling	HTML string conversion
`test_doc_rlink`	CVE-2025-11849 security	Rlink reference validation

Sources: packages/markitdown/tests/test_module_misc.py110-330 packages/markitdown/tests/test_module_misc.py370-383

URI Handling Test Coverage

URI Test Matrix

Sources: packages/markitdown/tests/test_module_misc.py182-253 packages/markitdown/tests/test_module_vectors.py110-160 packages/markitdown/src/markitdown/_uri_utils.py1-53

Testing Framework

Purpose and Scope

Test Suite Architecture

Test Module Organization

FileTestVector Dataclass

Parameterized Testing Pattern

Test Execution Flow

Parameterized Test Examples

Test Categories

Format Conversion Tests

Stream Info Detection Tests

Integration Tests

LLM Integration Tests

ExifTool Integration Tests

Speech Transcription Tests

Test Execution Control

Conditional Test Skipping

Skip Decorators

Test Utilities and Helpers

String Validation Helper

Test Constants

Test File Organization

Directory Structure

Sample Test Files

Test Execution Patterns

Direct Execution Support

Pytest Execution

Edge Case and Exception Testing

Exception Handling Tests

Special Case Tests

URI Handling Test Coverage

URI Test Matrix

On this page

Testing Framework

Purpose and Scope

Test Suite Architecture

Test Module Organization

FileTestVector Dataclass

Parameterized Testing Pattern

Test Execution Flow

Parameterized Test Examples

Test Categories

Format Conversion Tests

Stream Info Detection Tests

Integration Tests

LLM Integration Tests

ExifTool Integration Tests

Speech Transcription Tests

Test Execution Control

Conditional Test Skipping

Skip Decorators

Test Utilities and Helpers

String Validation Helper

Test Constants

Test File Organization

Directory Structure

Sample Test Files

Test Execution Patterns

Direct Execution Support

Pytest Execution

Edge Case and Exception Testing

Exception Handling Tests

Special Case Tests

URI Handling Test Coverage

URI Test Matrix

On this page