This document describes the testing framework for MarkItDown, including the test suite organization, test vector system, parameterized testing patterns, and test file structure. The framework uses pytest with custom test vectors to validate document conversion across multiple input formats and access patterns.
For information about CI/CD workflows and quality assurance processes, see CI/CD and Quality Assurance. For general project configuration including test environments, see Project Configuration.
The MarkItDown test suite is organized into two main test modules with distinct responsibilities:
Sources: packages/markitdown/tests/test_module_vectors.py1-235 packages/markitdown/tests/test_module_misc.py1-505
The test framework uses a FileTestVector dataclass to define test cases systematically. Each vector specifies a test file, expected content assertions, and metadata:
| Field | Type | Purpose |
|---|---|---|
filename | str | Name of file in test_files/ directory |
mimetype | str | Expected MIME type for content detection |
charset | str | Expected character encoding |
url | str (optional) | Mock URL for testing URL context |
must_include | List[str] | Strings that must appear in conversion result |
must_not_include | List[str] | Strings that must not appear in result |
The vectors are organized into two collections imported from _test_vectors:
GENERAL_TEST_VECTORS: Standard format conversion test casesDATA_URI_TEST_VECTORS: Cases for testing data URI preservation with keep_data_uris=TrueSources: packages/markitdown/tests/test_module_vectors.py10-12
Each test function is decorated with @pytest.mark.parametrize("test_vector", GENERAL_TEST_VECTORS) to run against all vectors:
packages/markitdown/tests/test_module_vectors.py27-55
packages/markitdown/tests/test_module_vectors.py57-69
packages/markitdown/tests/test_module_vectors.py71-90
Sources: packages/markitdown/tests/test_module_vectors.py27-160
These tests validate conversion accuracy across different input methods:
| Test Function | Input Method | Validates |
|---|---|---|
test_convert_local | File path | Local file conversion |
test_convert_stream_with_hints | Stream + StreamInfo | Conversion with full metadata |
test_convert_stream_without_hints | Stream only | File type detection from content |
test_convert_http_uri | HTTP/HTTPS URL | Remote resource fetching |
test_convert_file_uri | file:// URI | File URI parsing |
test_convert_data_uri | data: URI | Embedded content handling |
Sources: packages/markitdown/tests/test_module_vectors.py57-160
The test_guess_stream_info function validates the file type detection system:
The test constructs a StreamInfo object with known values and verifies the system correctly identifies file types using magika and MIME type inference.
Sources: packages/markitdown/tests/test_module_vectors.py28-55
The test_module_misc.py file contains integration tests for external dependencies:
packages/markitdown/tests/test_module_misc.py459-483
Tests image captioning with OpenAI integration:
packages/markitdown/tests/test_module_misc.py385-413
Tests metadata extraction:
exiftool_path configurationEXIFTOOL_PATH environment variablepackages/markitdown/tests/test_module_misc.py350-368
Tests audio transcription for WAV, MP3, and M4A formats.
Sources: packages/markitdown/tests/test_module_misc.py350-483
The test suite uses environment-based skipping for tests requiring external resources or credentials:
packages/markitdown/tests/test_module_misc.py22-35
Tests use pytest's skipif marker to conditionally skip:
packages/markitdown/tests/test_module_vectors.py105-124
Sources: packages/markitdown/tests/test_module_misc.py22-35 packages/markitdown/tests/test_module_vectors.py19-21 packages/markitdown/tests/test_module_vectors.py105-109
The validate_strings function provides assertion logic for content verification:
packages/markitdown/tests/test_module_misc.py100-108
This helper is used throughout tests to verify conversion results contain expected content markers.
Test files define constant dictionaries with expected metadata:
packages/markitdown/tests/test_module_misc.py39-45
Similar constants exist for PDF URLs, YouTube URLs, DOCX content, and other test scenarios.
Sources: packages/markitdown/tests/test_module_misc.py39-97 packages/markitdown/tests/test_module_misc.py100-108
The test_files/ directory contains representative samples for each supported format:
| Category | Example Files | Purpose |
|---|---|---|
*_borderless_table.pdf, *_multipage.pdf | Table extraction, multi-page handling | |
| Office | test.docx, test.xlsx, test.pptx | Standard Office conversion |
| Media | test.jpg, test.mp3, test.wav | Metadata extraction, transcription |
| Web | test.html, test.rss | HTML parsing, feed processing |
| Special | test_outlook_msg.msg, test.zip | Format-specific features |
Files use descriptive naming with patterns like:
[TYPE]-[YEAR]-[CATEGORY]-[ID]_[description].[ext] for business documentstest.[ext] for basic format teststest_[feature].[ext] for specific feature testsSources: packages/markitdown/tests/test_files/ (multiple files)
Both test modules support direct execution with if __name__ == "__main__": blocks:
packages/markitdown/tests/test_module_vectors.py202-234
This allows running tests without pytest:
Standard pytest execution runs all parameterized tests:
Sources: packages/markitdown/tests/test_module_vectors.py202-234 packages/markitdown/tests/test_module_misc.py485-505
packages/markitdown/tests/test_module_misc.py370-383
This test validates:
UnsupportedFormatException for unknown formatsFileConversionException with conversion attempts tracking| Test | Purpose | Key Validation |
|---|---|---|
test_stream_info_operations | StreamInfo.copy_and_update() | Attribute merging logic |
test_data_uris | Data URI parsing | Base64 decoding, MIME type extraction |
test_file_uris | File URI parsing | file:// URL handling |
test_docx_comments | DOCX comment extraction | Style map configuration |
test_docx_equations | Math equation rendering | LaTeX formatting |
test_input_as_strings | Stream input handling | HTML string conversion |
test_doc_rlink | CVE-2025-11849 security | Rlink reference validation |
Sources: packages/markitdown/tests/test_module_misc.py110-330 packages/markitdown/tests/test_module_misc.py370-383
Sources: packages/markitdown/tests/test_module_misc.py182-253 packages/markitdown/tests/test_module_vectors.py110-160 packages/markitdown/src/markitdown/_uri_utils.py1-53
Refresh this wiki