This document covers the converters that process web-based content formats in MarkItDown. These converters handle HTML pages, RSS/Atom feeds, EPUB books, and YouTube videos, transforming them into Markdown. All web content converters leverage BeautifulSoup for HTML parsing and a customized version of the markdownify library for HTML-to-Markdown conversion.
For PDF documents, see PDF Converter. For media files (images, audio), see Media Converters. For the base converter interface, see DocumentConverter Interface.
MarkItDown includes four primary web content converters:
| Converter | File Path | Accepted Formats | Dependencies |
|---|---|---|---|
HtmlConverter | converters/_html_converter.py | .html, .htm, text/html, application/xhtml | BeautifulSoup |
RssConverter | converters/_rss_converter.py | .rss, .atom, .xml (if RSS/Atom), RSS/Atom MIME types | defusedxml, BeautifulSoup |
YouTubeConverter | converters/_youtube_converter.py | YouTube URLs (https://www.youtube.com/watch?) | BeautifulSoup, youtube-transcript-api (optional) |
EpubConverter | converters/_epub_converter.py | .epub, application/epub+zip | zipfile, defusedxml, HtmlConverter |
All converters extend the DocumentConverter base class and produce DocumentConverterResult objects containing Markdown output.
Sources:
Sources:
The HtmlConverter class (packages/markitdown/src/markitdown/converters/_html_converter.py20-91) is the foundational web content converter. It accepts HTML content identified by MIME type or file extension and converts it to Markdown using BeautifulSoup and _CustomMarkdownify.
Acceptance Criteria (_html_converter.py23-39):
.html, .htmtext/html, application/xhtml*Conversion Process (_html_converter.py41-71):
The converter:
StreamInfo (defaulting to UTF-8)<script> and <style> tags (_html_converter.py52-53)<body> element if present, otherwise processes the entire document_CustomMarkdownify (see Custom Markdownify Integration)<title> tag content for the result's title fieldConvenience Method:
HtmlConverter.convert_string() (_html_converter.py73-90) provides a non-standard convenience method for converting HTML strings directly, used internally by other converters that generate HTML as intermediate output.
Sources:
The RssConverter class (packages/markitdown/src/markitdown/converters/_rss_converter.py29-193) handles RSS 2.0 and Atom feed formats. It parses XML feed structures and converts them to hierarchical Markdown documents.
Acceptance Criteria:
The converter uses a two-tier acceptance strategy:
Precise Indicators (_rss_converter.py10-17):
.rss, .atomapplication/rss+xml, application/atom+xmlCandidate Indicators with Content Inspection (_rss_converter.py19-26):
.xmltext/xml, application/xmlFor candidate files, the _check_xml() method (_rss_converter.py63-72) parses the XML and calls _feed_type() (_rss_converter.py74-82) to verify the presence of RSS (<rss> tag) or Atom (<feed> with <entry>) elements.
Sources:
The _parse_rss_type() method (_rss_converter.py133-168) extracts:
| RSS Element | Markdown Rendering | Code Reference |
|---|---|---|
<channel><title> | # {title} | Line 146-147 |
<channel><description> | Plain text below title | Line 148-149 |
<item><title> | ## {title} | Line 156-157 |
<item><pubDate> | Published on: {date} | Line 158-159 |
<item><description> | HTML-parsed content | Line 160-161 |
<item><content:encoded> | HTML-parsed content | Line 162-163 |
The _parse_atom_type() method (_rss_converter.py101-131) extracts:
| Atom Element | Markdown Rendering | Code Reference |
|---|---|---|
<feed><title> | # {title} | Line 110 |
<feed><subtitle> | Plain text below title | Line 111-112 |
<entry><title> | ## {title} | Line 119-120 |
<entry><updated> | Updated on: {date} | Line 121-122 |
<entry><summary> | HTML-parsed content | Line 123-124 |
<entry><content> | HTML-parsed content | Line 125-126 |
The _parse_content() method (_rss_converter.py170-177) handles RSS/Atom content that often contains HTML:
_CustomMarkdownifySources:
The YouTubeConverter class (packages/markitdown/src/markitdown/converters/_youtube_converter.py37-239) provides specialized handling for YouTube video pages, extracting video metadata, descriptions, and transcripts.
Acceptance Criteria (_youtube_converter.py40-68):
The converter accepts only YouTube watch URLs:
stream_info.url starts with https://www.youtube.com/watch?.html, .htm)Sources:
Meta Tag Processing (_youtube_converter.py80-96):
The converter iterates through all <meta> tags, checking for attributes itemprop, property, or name, and storing the content attribute value. Common metadata includes:
title / og:title / name - Video titleinteractionCount - View countkeywords - Video keywords/tagsduration - Video runtimedescription / og:description - Video descriptionytInitialData Extraction (_youtube_converter.py99-116):
YouTube embeds structured data in JavaScript variables within <script> tags. The converter:
ytInitialDatavar ytInitialData = ({.*?});attributedDescriptionBodyText using _findKey() (_youtube_converter.py211-224)content field for a more complete descriptionOptional Dependency (_youtube_converter.py12-23):
The youtube-transcript-api library is optional. If available, IS_YOUTUBE_TRANSCRIPT_CAPABLE is set to True.
Transcript Retrieval Process (_youtube_converter.py147-189):
vYouTubeTranscriptApi.list(video_id) to get available transcriptsyoutube_transcript_languages kwarg (defaults to ["en"] plus first available language)_retry_operation() (_youtube_converter.py226-238) to fetch transcript with retry logic (3 attempts, 2-second delay)Helper Methods:
_get() (_youtube_converter.py199-209): Retrieves first non-empty value from metadata for given keys_findKey() (_youtube_converter.py211-224): Recursively searches nested JSON structures for a specific key_retry_operation() (_youtube_converter.py226-238): Retries operations with exponential backoffSources:
The EpubConverter class (packages/markitdown/src/markitdown/converters/_epub_converter.py26-147) handles EPUB (Electronic Publication) files, which are ZIP archives containing HTML/XHTML content and metadata. The converter extends HtmlConverter and uses an instance of HtmlConverter internally to process individual HTML files.
Acceptance Criteria (_epub_converter.py35-51):
.epubapplication/epub+zip, application/x-epub+zipSources:
The OPF (Open Packaging Format) file contains Dublin Core metadata elements (_epub_converter.py69-78):
| OPF Element | Extracted Field | Method |
|---|---|---|
<dc:title> | title | _get_text_from_node() |
<dc:creator> (multiple) | authors | _get_all_texts_from_nodes() |
<dc:language> | language | _get_text_from_node() |
<dc:publisher> | publisher | _get_text_from_node() |
<dc:date> | date | _get_text_from_node() |
<dc:description> | description | _get_text_from_node() |
<dc:identifier> | identifier | _get_text_from_node() |
Manifest and Spine (_epub_converter.py80-98):
Content Conversion Loop (_epub_converter.py100-116):
For each file in the spine:
MIME_TYPE_MAPPING (_epub_converter.py20-23)
.html → text/html.xhtml → application/xhtml+xmlStreamInfo with filename, extension, and mimetype_html_converter instanceOutput Formatting (_epub_converter.py118-130):
**Title:** value)Sources:
The _CustomMarkdownify class (packages/markitdown/src/markitdown/converters/_markdownify.py8-127) extends markdownify.MarkdownConverter to provide customized HTML-to-Markdown conversion for MarkItDown. All web content converters use this class for final HTML-to-Markdown transformation.
Customizations:
| Feature | Default Behavior | Custom Behavior | Method |
|---|---|---|---|
| Heading style | Mixed | ATX (#, ##, etc.) | __init__() line 19 |
| Heading newlines | Variable | Always start with newline | convert_hn() line 24-37 |
| JavaScript links | Preserved | Removed | convert_a() line 39-83 |
| URI escaping | Basic | Full URL encoding | convert_a() line 57-66 |
| Data URIs in images | Preserved | Truncated | convert_img() line 85-110 |
| Checkboxes | Not handled | [x] / [ ] syntax | convert_input() line 112-123 |
The convert_a() method (_markdownify.py39-83) implements several safety and normalization features:
URL Scheme Filtering (_markdownify.py58-65):
1. Parse URL with urlparse()
2. Check if scheme exists and is in allowed list: ["http", "https", "file"]
3. If not allowed, return plain text (no link)
4. URL encode the path component while preserving other parts
This prevents JavaScript URIs (javascript:alert()) and other potentially dangerous schemes from being included in Markdown links.
URI Normalization:
\_)The convert_img() method (_markdownify.py85-110) handles image conversion with special attention to data URIs:
Data URI Truncation (_markdownify.py106-108):
This prevents extremely long base64-encoded images from bloating the Markdown output. The MIME type prefix is preserved (e.g., data:image/png;base64...) for reference.
Alt Text Processing:
alt attribute (or empty string if missing)data-src attribute if src is missing (common in lazy-loaded images)The convert_input() method (_markdownify.py112-123) adds support for HTML checkboxes:
This is particularly useful when converting HTML forms or task lists to Markdown.
The constructor accepts options passed through **kwargs (_markdownify.py18-22):
heading_style: Overridden to markdownify.ATX (defaults to # style)keep_data_uris: Controls data URI truncation (default: False)autolinks, default_title, keep_inline_images_in)Sources:
Inheritance Relationships:
YouTubeConverter accepts HTML but doesn't explicitly inherit from HtmlConverter (focuses on metadata extraction)EpubConverter extends HtmlConverter (_epub_converter.py26) and instantiates an internal HtmlConverter (_epub_converter.py33)RssConverter independently uses _CustomMarkdownify for content conversion (_rss_converter.py6)Common Dependencies:
_CustomMarkdownifyOptional Dependencies:
Sources:
Refresh this wiki