This document explains how MarkItDown processes Uniform Resource Identifiers (URIs) as input sources for document conversion. The system supports four URI schemes: file:// for local files, data: for inline data, and http:///https:// for web resources. URI handling is a key component of the conversion pipeline, enabling flexible input from various sources.
For information about how streams are processed after URI resolution, see Stream Handling and File Detection. For details on the overall conversion workflow, see MarkItDown Class.
MarkItDown supports the following URI schemes, detected by prefix matching in the convert() method:
| Scheme | Description | Example | Handler Method |
|---|---|---|---|
file:// | Local file system paths | file:///path/to/doc.pdf | convert_uri() → convert_local() |
data: | Inline data with MIME type | data:text/plain;base64,SGVsbG8= | convert_uri() → convert_stream() |
http:// | Web resources via HTTP | http://example.com/doc.pdf | convert_uri() → convert_response() |
https:// | Web resources via HTTPS | https://example.com/doc.pdf | convert_uri() → convert_response() |
Sources: packages/markitdown/src/markitdown/_markitdown.py268-281
Sources: packages/markitdown/src/markitdown/_markitdown.py252-300 packages/markitdown/src/markitdown/_markitdown.py405-464
The convert_uri() method packages/markitdown/src/markitdown/_markitdown.py405-464 serves as the central dispatcher for all URI types:
Sources: packages/markitdown/src/markitdown/_markitdown.py405-464
The file_uri_to_path() utility packages/markitdown/src/markitdown/_uri_utils.py8-16 converts file:// URIs to local file system paths:
Function Signature:
Return Values:
netloc: Network location (hostname) or None if emptypath: Absolute file system pathProcessing Steps:
urlparse() from urllib.parse"file"url2pathname()os.path.abspath()Example:
file:///home/user/document.pdf(None, "/home/user/document.pdf")Sources: packages/markitdown/src/markitdown/_uri_utils.py8-16
The parse_data_uri() utility packages/markitdown/src/markitdown/_uri_utils.py19-52 parses data: URIs according to RFC 2397:
Function Signature:
Return Values:
mimetype: MIME type (e.g., "text/plain") or Noneattributes: Dictionary of key-value parameters (e.g., {"charset": "utf-8"})content: Decoded binary contentData URI Format:
data:[<mimetype>][;<attribute>=<value>]*[;base64],<data>
Sources: packages/markitdown/src/markitdown/_uri_utils.py19-52
File URIs are processed by extracting the local path and delegating to convert_local():
Key Implementation Details:
stream_info, file_extension, and url parameters are forwarded to convert_local() packages/markitdown/src/markitdown/_markitdown.py425-431url parameter for mocking source locationSources: packages/markitdown/src/markitdown/_markitdown.py418-431
Data URIs are parsed and converted as in-memory streams:
Key Implementation Details:
StreamInfo packages/markitdown/src/markitdown/_markitdown.py436-439stream_info parameter for additional metadata packages/markitdown/src/markitdown/_markitdown.py440-441io.BytesIO for stream processing packages/markitdown/src/markitdown/_markitdown.py444file_extension and mock_url parameters packages/markitdown/src/markitdown/_markitdown.py446-448Sources: packages/markitdown/src/markitdown/_markitdown.py433-449
Web resources are fetched using the requests library:
Key Implementation Details:
self._requests_session configured with Markdown-preferring Accept header packages/markitdown/src/markitdown/_markitdown.py109-116stream=True) for efficient large file handling packages/markitdown/src/markitdown/_markitdown.py452convert_response() for header parsing and conversion packages/markitdown/src/markitdown/_markitdown.py454-460Sources: packages/markitdown/src/markitdown/_markitdown.py451-460
All URI types build a StreamInfo object to carry metadata through the conversion pipeline:
| URI Type | StreamInfo Fields Populated |
|---|---|
| File URI | local_path, extension, filename from extracted path |
| Data URI | mimetype, charset from URI header |
| HTTP/HTTPS URI | mimetype, charset, filename, extension, url from response headers |
Sources: packages/markitdown/src/markitdown/_markitdown.py315-323 packages/markitdown/src/markitdown/_markitdown.py436-441 packages/markitdown/src/markitdown/_markitdown.py508-524
The URI handling methods support several deprecated parameters for backward compatibility:
| Parameter | Replacement | Deprecated In |
|---|---|---|
file_extension | StreamInfo(extension=...) | All URI methods |
url | StreamInfo(url=...) | File and Data URI methods |
mock_url | StreamInfo(url=...) | HTTP/HTTPS URI methods |
Migration Example:
Sources: packages/markitdown/src/markitdown/_markitdown.py307-309 packages/markitdown/src/markitdown/_markitdown.py410-413
The URI handling system is comprehensively tested across multiple scenarios:
| Test Function | Purpose | Source Line |
|---|---|---|
test_convert_file_uri | Validates file:// URI conversion using Path.as_uri() | packages/markitdown/tests/test_module_vectors.py127-138 |
test_convert_data_uri | Validates data: URI with base64-encoded content | packages/markitdown/tests/test_module_vectors.py142-159 |
test_convert_http_uri | Validates http:///https:// URI fetching (skipped in CI) | packages/markitdown/tests/test_module_vectors.py110-123 |
test_convert_keep_data_uris | Validates data URI preservation in output | packages/markitdown/tests/test_module_vectors.py163-178 |
Sources: packages/markitdown/tests/test_module_vectors.py105-200
Sources: packages/markitdown/src/markitdown/_markitdown.py461-464 packages/markitdown/src/markitdown/_markitdown.py421-424 packages/markitdown/src/markitdown/_uri_utils.py20-25
Sources: packages/markitdown/tests/test_module_vectors.py110-159
Refresh this wiki