Web Content Converters

Relevant source files

This document covers the converters that process web-based content formats in MarkItDown. These converters handle HTML pages, RSS/Atom feeds, EPUB books, and YouTube videos, transforming them into Markdown. All web content converters leverage BeautifulSoup for HTML parsing and a customized version of the markdownify library for HTML-to-Markdown conversion.

For PDF documents, see PDF Converter. For media files (images, audio), see Media Converters. For the base converter interface, see DocumentConverter Interface.

Overview

MarkItDown includes four primary web content converters:

Converter	File Path	Accepted Formats	Dependencies
`HtmlConverter`	`converters/_html_converter.py`	`.html`, `.htm`, `text/html`, `application/xhtml`	BeautifulSoup
`RssConverter`	`converters/_rss_converter.py`	`.rss`, `.atom`, `.xml` (if RSS/Atom), RSS/Atom MIME types	defusedxml, BeautifulSoup
`YouTubeConverter`	`converters/_youtube_converter.py`	YouTube URLs (`https://www.youtube.com/watch?`)	BeautifulSoup, youtube-transcript-api (optional)
`EpubConverter`	`converters/_epub_converter.py`	`.epub`, `application/epub+zip`	zipfile, defusedxml, HtmlConverter

All converters extend the DocumentConverter base class and produce DocumentConverterResult objects containing Markdown output.

Sources:

Converter Selection and Priority

Sources:

HTML Converter

HtmlConverter Class

The HtmlConverter class (packages/markitdown/src/markitdown/converters/_html_converter.py20-91) is the foundational web content converter. It accepts HTML content identified by MIME type or file extension and converts it to Markdown using BeautifulSoup and _CustomMarkdownify.

Acceptance Criteria (_html_converter.py23-39):

File extensions: .html, .htm
MIME types: text/html, application/xhtml*

Conversion Process (_html_converter.py41-71):

The converter:

Parses the HTML stream using BeautifulSoup with the charset from StreamInfo (defaulting to UTF-8)
Removes <script> and <style> tags (_html_converter.py52-53)
Extracts the <body> element if present, otherwise processes the entire document
Converts HTML to Markdown using _CustomMarkdownify (see Custom Markdownify Integration)
Strips leading/trailing whitespace
Extracts the <title> tag content for the result's title field

Convenience Method:

HtmlConverter.convert_string() (_html_converter.py73-90) provides a non-standard convenience method for converting HTML strings directly, used internally by other converters that generate HTML as intermediate output.

Sources:

packages/markitdown/src/markitdown/converters/_html_converter.py20-91

RSS and Atom Feed Converter

RssConverter Class

The RssConverter class (packages/markitdown/src/markitdown/converters/_rss_converter.py29-193) handles RSS 2.0 and Atom feed formats. It parses XML feed structures and converts them to hierarchical Markdown documents.

Acceptance Criteria:

The converter uses a two-tier acceptance strategy:

Precise Indicators (_rss_converter.py10-17):

File extensions: .rss, .atom
MIME types: application/rss+xml, application/atom+xml

Candidate Indicators with Content Inspection (_rss_converter.py19-26):

File extensions: .xml
MIME types: text/xml, application/xml

For candidate files, the _check_xml() method (_rss_converter.py63-72) parses the XML and calls _feed_type() (_rss_converter.py74-82) to verify the presence of RSS (<rss> tag) or Atom (<feed> with <entry>) elements.

Feed Type Detection and Parsing

Sources:

packages/markitdown/src/markitdown/converters/_rss_converter.py63-99

RSS Feed Structure

The _parse_rss_type() method (_rss_converter.py133-168) extracts:

RSS Element	Markdown Rendering	Code Reference
`<channel><title>`	`# {title}`	Line 146-147
`<channel><description>`	Plain text below title	Line 148-149
`<item><title>`	`## {title}`	Line 156-157
`<item><pubDate>`	`Published on: {date}`	Line 158-159
`<item><description>`	HTML-parsed content	Line 160-161
`<item><content:encoded>`	HTML-parsed content	Line 162-163

Atom Feed Structure

The _parse_atom_type() method (_rss_converter.py101-131) extracts:

Atom Element	Markdown Rendering	Code Reference
`<feed><title>`	`# {title}`	Line 110
`<feed><subtitle>`	Plain text below title	Line 111-112
`<entry><title>`	`## {title}`	Line 119-120
`<entry><updated>`	`Updated on: {date}`	Line 121-122
`<entry><summary>`	HTML-parsed content	Line 123-124
`<entry><content>`	HTML-parsed content	Line 125-126

Content Parsing

The _parse_content() method (_rss_converter.py170-177) handles RSS/Atom content that often contains HTML:

Parses content with BeautifulSoup
Converts to Markdown using _CustomMarkdownify
Falls back to raw content if parsing fails

Sources:

packages/markitdown/src/markitdown/converters/_rss_converter.py29-193

YouTube Converter

YouTubeConverter Class

The YouTubeConverter class (packages/markitdown/src/markitdown/converters/_youtube_converter.py37-239) provides specialized handling for YouTube video pages, extracting video metadata, descriptions, and transcripts.

Acceptance Criteria (_youtube_converter.py40-68):

The converter accepts only YouTube watch URLs:

Checks if stream_info.url starts with https://www.youtube.com/watch?
Unquotes and normalizes the URL (_youtube_converter.py53-54)
Verifies HTML MIME type or extension (.html, .htm)

YouTube-Specific Extraction Pipeline

Sources:

packages/markitdown/src/markitdown/converters/_youtube_converter.py37-239

Metadata Extraction

Meta Tag Processing (_youtube_converter.py80-96):

The converter iterates through all <meta> tags, checking for attributes itemprop, property, or name, and storing the content attribute value. Common metadata includes:

title / og:title / name - Video title
interactionCount - View count
keywords - Video keywords/tags
duration - Video runtime
description / og:description - Video description

ytInitialData Extraction (_youtube_converter.py99-116):

YouTube embeds structured data in JavaScript variables within <script> tags. The converter:

Searches for scripts containing ytInitialData
Uses regex to extract the JSON object: var ytInitialData = ({.*?});
Parses the JSON and recursively searches for attributedDescriptionBodyText using _findKey() (_youtube_converter.py211-224)
Extracts the content field for a more complete description

Transcript Fetching

Optional Dependency (_youtube_converter.py12-23):

The youtube-transcript-api library is optional. If available, IS_YOUTUBE_TRANSCRIPT_CAPABLE is set to True.

Transcript Retrieval Process (_youtube_converter.py147-189):

Parse the video ID from URL query parameter v
Call YouTubeTranscriptApi.list(video_id) to get available transcripts
Determine preferred languages from youtube_transcript_languages kwarg (defaults to ["en"] plus first available language)
Use _retry_operation() (_youtube_converter.py226-238) to fetch transcript with retry logic (3 attempts, 2-second delay)
If fetching fails for preferred language, attempt translation (_youtube_converter.py182-187)
Join transcript parts into continuous text

Helper Methods:

_get() (_youtube_converter.py199-209): Retrieves first non-empty value from metadata for given keys
_findKey() (_youtube_converter.py211-224): Recursively searches nested JSON structures for a specific key
_retry_operation() (_youtube_converter.py226-238): Retries operations with exponential backoff

Sources:

packages/markitdown/src/markitdown/converters/_youtube_converter.py70-197

EPUB Converter

EpubConverter Class

The EpubConverter class (packages/markitdown/src/markitdown/converters/_epub_converter.py26-147) handles EPUB (Electronic Publication) files, which are ZIP archives containing HTML/XHTML content and metadata. The converter extends HtmlConverter and uses an instance of HtmlConverter internally to process individual HTML files.

Acceptance Criteria (_epub_converter.py35-51):

File extension: .epub
MIME types: application/epub+zip, application/x-epub+zip

EPUB Structure and Processing

Sources:

packages/markitdown/src/markitdown/converters/_epub_converter.py53-130

Metadata Extraction

The OPF (Open Packaging Format) file contains Dublin Core metadata elements (_epub_converter.py69-78):

OPF Element	Extracted Field	Method
`<dc:title>`	`title`	`_get_text_from_node()`
`<dc:creator>` (multiple)	`authors`	`_get_all_texts_from_nodes()`
`<dc:language>`	`language`	`_get_text_from_node()`
`<dc:publisher>`	`publisher`	`_get_text_from_node()`
`<dc:date>`	`date`	`_get_text_from_node()`
`<dc:description>`	`description`	`_get_text_from_node()`
`<dc:identifier>`	`identifier`	`_get_text_from_node()`

Content Organization

Manifest and Spine (_epub_converter.py80-98):

Manifest: Maps item IDs to file hrefs (relative paths)
Spine: Defines reading order as a list of item ID references
Path Resolution: Combines base path from OPF location with manifest hrefs

Content Conversion Loop (_epub_converter.py100-116):

For each file in the spine:

Open file from ZIP archive
Determine MIME type from extension using MIME_TYPE_MAPPING (_epub_converter.py20-23)
- .html → text/html
- .xhtml → application/xhtml+xml
Create StreamInfo with filename, extension, and mimetype
Convert using internal _html_converter instance
Strip and append resulting Markdown to content list

Output Formatting (_epub_converter.py118-130):

Format metadata as bold key-value pairs (**Title:** value)
Join multiple values (e.g., authors) with commas
Prepend metadata section to content
Join all sections with double newlines

Sources:

packages/markitdown/src/markitdown/converters/_epub_converter.py26-147

Custom Markdownify Integration

_CustomMarkdownify Class

The _CustomMarkdownify class (packages/markitdown/src/markitdown/converters/_markdownify.py8-127) extends markdownify.MarkdownConverter to provide customized HTML-to-Markdown conversion for MarkItDown. All web content converters use this class for final HTML-to-Markdown transformation.

Customizations:

Feature	Default Behavior	Custom Behavior	Method
Heading style	Mixed	ATX (`#`, `##`, etc.)	`__init__()` line 19
Heading newlines	Variable	Always start with newline	`convert_hn()` line 24-37
JavaScript links	Preserved	Removed	`convert_a()` line 39-83
URI escaping	Basic	Full URL encoding	`convert_a()` line 57-66
Data URIs in images	Preserved	Truncated	`convert_img()` line 85-110
Checkboxes	Not handled	`[x]` / `[ ]` syntax	`convert_input()` line 112-123

Link Processing

The convert_a() method (_markdownify.py39-83) implements several safety and normalization features:

URL Scheme Filtering (_markdownify.py58-65):

1. Parse URL with urlparse()
2. Check if scheme exists and is in allowed list: ["http", "https", "file"]
3. If not allowed, return plain text (no link)
4. URL encode the path component while preserving other parts

This prevents JavaScript URIs (javascript:alert()) and other potentially dangerous schemes from being included in Markdown links.

URI Normalization:

Unquote and re-quote URL paths for consistent encoding
Handle edge case where underscores in text are escaped (\_)
Escape double quotes in title attributes

Image Handling

The convert_img() method (_markdownify.py85-110) handles image conversion with special attention to data URIs:

Data URI Truncation (_markdownify.py106-108):

This prevents extremely long base64-encoded images from bloating the Markdown output. The MIME type prefix is preserved (e.g., data:image/png;base64...) for reference.

Alt Text Processing:

Extract from alt attribute (or empty string if missing)
Remove line breaks from alt text (_markdownify.py99)
Fallback to data-src attribute if src is missing (common in lazy-loaded images)

Checkbox Conversion

The convert_input() method (_markdownify.py112-123) adds support for HTML checkboxes:

This is particularly useful when converting HTML forms or task lists to Markdown.

Configuration Options

The constructor accepts options passed through **kwargs (_markdownify.py18-22):

heading_style: Overridden to markdownify.ATX (defaults to # style)
keep_data_uris: Controls data URI truncation (default: False)
All standard markdownify options (e.g., autolinks, default_title, keep_inline_images_in)

Sources:

packages/markitdown/src/markitdown/converters/_markdownify.py8-127

Integration with Core System

Usage Pattern Across Converters

Inheritance Relationships:

YouTubeConverter accepts HTML but doesn't explicitly inherit from HtmlConverter (focuses on metadata extraction)
EpubConverter extends HtmlConverter (_epub_converter.py26) and instantiates an internal HtmlConverter (_epub_converter.py33)
RssConverter independently uses _CustomMarkdownify for content conversion (_rss_converter.py6)

Common Dependencies:

BeautifulSoup 4 for HTML/XML parsing
defusedxml for secure XML parsing (RSS, EPUB)
markdownify library as base for _CustomMarkdownify

Optional Dependencies:

youtube-transcript-api for YouTube transcript fetching (gracefully degrades if missing)

Sources:

Web Content Converters

Relevant source files

For PDF documents, see PDF Converter. For media files (images, audio), see Media Converters. For the base converter interface, see DocumentConverter Interface.

Overview

MarkItDown includes four primary web content converters:

Converter	File Path	Accepted Formats	Dependencies
`HtmlConverter`	`converters/_html_converter.py`	`.html`, `.htm`, `text/html`, `application/xhtml`	BeautifulSoup
`RssConverter`	`converters/_rss_converter.py`	`.rss`, `.atom`, `.xml` (if RSS/Atom), RSS/Atom MIME types	defusedxml, BeautifulSoup
`YouTubeConverter`	`converters/_youtube_converter.py`	YouTube URLs (`https://www.youtube.com/watch?`)	BeautifulSoup, youtube-transcript-api (optional)
`EpubConverter`	`converters/_epub_converter.py`	`.epub`, `application/epub+zip`	zipfile, defusedxml, HtmlConverter

All converters extend the DocumentConverter base class and produce DocumentConverterResult objects containing Markdown output.

Sources:

Converter Selection and Priority

Sources:

HTML Converter

HtmlConverter Class

Acceptance Criteria (_html_converter.py23-39):

File extensions: .html, .htm
MIME types: text/html, application/xhtml*

Conversion Process (_html_converter.py41-71):

The converter:

Parses the HTML stream using BeautifulSoup with the charset from StreamInfo (defaulting to UTF-8)
Removes <script> and <style> tags (_html_converter.py52-53)
Extracts the <body> element if present, otherwise processes the entire document
Converts HTML to Markdown using _CustomMarkdownify (see Custom Markdownify Integration)
Strips leading/trailing whitespace
Extracts the <title> tag content for the result's title field

Convenience Method:

Sources:

packages/markitdown/src/markitdown/converters/_html_converter.py20-91

RSS and Atom Feed Converter

RssConverter Class

Acceptance Criteria:

The converter uses a two-tier acceptance strategy:

Precise Indicators (_rss_converter.py10-17):

File extensions: .rss, .atom
MIME types: application/rss+xml, application/atom+xml

Candidate Indicators with Content Inspection (_rss_converter.py19-26):

File extensions: .xml
MIME types: text/xml, application/xml

Feed Type Detection and Parsing

Sources:

packages/markitdown/src/markitdown/converters/_rss_converter.py63-99

RSS Feed Structure

The _parse_rss_type() method (_rss_converter.py133-168) extracts:

RSS Element	Markdown Rendering	Code Reference
`<channel><title>`	`# {title}`	Line 146-147
`<channel><description>`	Plain text below title	Line 148-149
`<item><title>`	`## {title}`	Line 156-157
`<item><pubDate>`	`Published on: {date}`	Line 158-159
`<item><description>`	HTML-parsed content	Line 160-161
`<item><content:encoded>`	HTML-parsed content	Line 162-163

Atom Feed Structure

The _parse_atom_type() method (_rss_converter.py101-131) extracts:

Atom Element	Markdown Rendering	Code Reference
`<feed><title>`	`# {title}`	Line 110
`<feed><subtitle>`	Plain text below title	Line 111-112
`<entry><title>`	`## {title}`	Line 119-120
`<entry><updated>`	`Updated on: {date}`	Line 121-122
`<entry><summary>`	HTML-parsed content	Line 123-124
`<entry><content>`	HTML-parsed content	Line 125-126

Content Parsing

The _parse_content() method (_rss_converter.py170-177) handles RSS/Atom content that often contains HTML:

Parses content with BeautifulSoup
Converts to Markdown using _CustomMarkdownify
Falls back to raw content if parsing fails

Sources:

packages/markitdown/src/markitdown/converters/_rss_converter.py29-193

YouTube Converter

YouTubeConverter Class

Acceptance Criteria (_youtube_converter.py40-68):

The converter accepts only YouTube watch URLs:

Checks if stream_info.url starts with https://www.youtube.com/watch?
Unquotes and normalizes the URL (_youtube_converter.py53-54)
Verifies HTML MIME type or extension (.html, .htm)

YouTube-Specific Extraction Pipeline

Sources:

packages/markitdown/src/markitdown/converters/_youtube_converter.py37-239

Metadata Extraction

Meta Tag Processing (_youtube_converter.py80-96):

The converter iterates through all <meta> tags, checking for attributes itemprop, property, or name, and storing the content attribute value. Common metadata includes:

title / og:title / name - Video title
interactionCount - View count
keywords - Video keywords/tags
duration - Video runtime
description / og:description - Video description

ytInitialData Extraction (_youtube_converter.py99-116):

YouTube embeds structured data in JavaScript variables within <script> tags. The converter:

Searches for scripts containing ytInitialData
Uses regex to extract the JSON object: var ytInitialData = ({.*?});
Parses the JSON and recursively searches for attributedDescriptionBodyText using _findKey() (_youtube_converter.py211-224)
Extracts the content field for a more complete description

Transcript Fetching

Optional Dependency (_youtube_converter.py12-23):

The youtube-transcript-api library is optional. If available, IS_YOUTUBE_TRANSCRIPT_CAPABLE is set to True.

Transcript Retrieval Process (_youtube_converter.py147-189):

Parse the video ID from URL query parameter v
Call YouTubeTranscriptApi.list(video_id) to get available transcripts
Determine preferred languages from youtube_transcript_languages kwarg (defaults to ["en"] plus first available language)
Use _retry_operation() (_youtube_converter.py226-238) to fetch transcript with retry logic (3 attempts, 2-second delay)
If fetching fails for preferred language, attempt translation (_youtube_converter.py182-187)
Join transcript parts into continuous text

Helper Methods:

_get() (_youtube_converter.py199-209): Retrieves first non-empty value from metadata for given keys
_findKey() (_youtube_converter.py211-224): Recursively searches nested JSON structures for a specific key
_retry_operation() (_youtube_converter.py226-238): Retries operations with exponential backoff

Sources:

packages/markitdown/src/markitdown/converters/_youtube_converter.py70-197

EPUB Converter

EpubConverter Class

Acceptance Criteria (_epub_converter.py35-51):

File extension: .epub
MIME types: application/epub+zip, application/x-epub+zip

EPUB Structure and Processing

Sources:

packages/markitdown/src/markitdown/converters/_epub_converter.py53-130

Metadata Extraction

The OPF (Open Packaging Format) file contains Dublin Core metadata elements (_epub_converter.py69-78):

OPF Element	Extracted Field	Method
`<dc:title>`	`title`	`_get_text_from_node()`
`<dc:creator>` (multiple)	`authors`	`_get_all_texts_from_nodes()`
`<dc:language>`	`language`	`_get_text_from_node()`
`<dc:publisher>`	`publisher`	`_get_text_from_node()`
`<dc:date>`	`date`	`_get_text_from_node()`
`<dc:description>`	`description`	`_get_text_from_node()`
`<dc:identifier>`	`identifier`	`_get_text_from_node()`

Content Organization

Manifest and Spine (_epub_converter.py80-98):

Manifest: Maps item IDs to file hrefs (relative paths)
Spine: Defines reading order as a list of item ID references
Path Resolution: Combines base path from OPF location with manifest hrefs

Content Conversion Loop (_epub_converter.py100-116):

For each file in the spine:

Open file from ZIP archive
Determine MIME type from extension using MIME_TYPE_MAPPING (_epub_converter.py20-23)
- .html → text/html
- .xhtml → application/xhtml+xml
Create StreamInfo with filename, extension, and mimetype
Convert using internal _html_converter instance
Strip and append resulting Markdown to content list

Output Formatting (_epub_converter.py118-130):

Format metadata as bold key-value pairs (**Title:** value)
Join multiple values (e.g., authors) with commas
Prepend metadata section to content
Join all sections with double newlines

Sources:

packages/markitdown/src/markitdown/converters/_epub_converter.py26-147

Custom Markdownify Integration

_CustomMarkdownify Class

Customizations:

Feature	Default Behavior	Custom Behavior	Method
Heading style	Mixed	ATX (`#`, `##`, etc.)	`__init__()` line 19
Heading newlines	Variable	Always start with newline	`convert_hn()` line 24-37
JavaScript links	Preserved	Removed	`convert_a()` line 39-83
URI escaping	Basic	Full URL encoding	`convert_a()` line 57-66
Data URIs in images	Preserved	Truncated	`convert_img()` line 85-110
Checkboxes	Not handled	`[x]` / `[ ]` syntax	`convert_input()` line 112-123

Link Processing

The convert_a() method (_markdownify.py39-83) implements several safety and normalization features:

URL Scheme Filtering (_markdownify.py58-65):

1. Parse URL with urlparse()
2. Check if scheme exists and is in allowed list: ["http", "https", "file"]
3. If not allowed, return plain text (no link)
4. URL encode the path component while preserving other parts

This prevents JavaScript URIs (javascript:alert()) and other potentially dangerous schemes from being included in Markdown links.

URI Normalization:

Unquote and re-quote URL paths for consistent encoding
Handle edge case where underscores in text are escaped (\_)
Escape double quotes in title attributes

Image Handling

The convert_img() method (_markdownify.py85-110) handles image conversion with special attention to data URIs:

Data URI Truncation (_markdownify.py106-108):

This prevents extremely long base64-encoded images from bloating the Markdown output. The MIME type prefix is preserved (e.g., data:image/png;base64...) for reference.

Alt Text Processing:

Extract from alt attribute (or empty string if missing)
Remove line breaks from alt text (_markdownify.py99)
Fallback to data-src attribute if src is missing (common in lazy-loaded images)

Checkbox Conversion

The convert_input() method (_markdownify.py112-123) adds support for HTML checkboxes:

This is particularly useful when converting HTML forms or task lists to Markdown.

Configuration Options

The constructor accepts options passed through **kwargs (_markdownify.py18-22):

heading_style: Overridden to markdownify.ATX (defaults to # style)
keep_data_uris: Controls data URI truncation (default: False)
All standard markdownify options (e.g., autolinks, default_title, keep_inline_images_in)

Sources:

packages/markitdown/src/markitdown/converters/_markdownify.py8-127

Integration with Core System

Usage Pattern Across Converters

Inheritance Relationships:

YouTubeConverter accepts HTML but doesn't explicitly inherit from HtmlConverter (focuses on metadata extraction)
EpubConverter extends HtmlConverter (_epub_converter.py26) and instantiates an internal HtmlConverter (_epub_converter.py33)
RssConverter independently uses _CustomMarkdownify for content conversion (_rss_converter.py6)

Common Dependencies:

BeautifulSoup 4 for HTML/XML parsing
defusedxml for secure XML parsing (RSS, EPUB)
markdownify library as base for _CustomMarkdownify

Optional Dependencies:

youtube-transcript-api for YouTube transcript fetching (gracefully degrades if missing)

Sources:

Web Content Converters

Overview

Converter Selection and Priority

HTML Converter

HtmlConverter Class

RSS and Atom Feed Converter

RssConverter Class

Feed Type Detection and Parsing

RSS Feed Structure

Atom Feed Structure

Content Parsing

YouTube Converter

YouTubeConverter Class

YouTube-Specific Extraction Pipeline

Metadata Extraction

Transcript Fetching

EPUB Converter

EpubConverter Class

EPUB Structure and Processing

Metadata Extraction

Content Organization

Custom Markdownify Integration

_CustomMarkdownify Class

Link Processing

Image Handling

Checkbox Conversion

Configuration Options

Integration with Core System

Usage Pattern Across Converters

On this page

Web Content Converters

Overview

Converter Selection and Priority

HTML Converter

HtmlConverter Class

RSS and Atom Feed Converter

RssConverter Class

Feed Type Detection and Parsing

RSS Feed Structure

Atom Feed Structure

Content Parsing

YouTube Converter

YouTubeConverter Class

YouTube-Specific Extraction Pipeline

Metadata Extraction

Transcript Fetching

EPUB Converter

EpubConverter Class

EPUB Structure and Processing

Metadata Extraction

Content Organization

Custom Markdownify Integration

_CustomMarkdownify Class

Link Processing

Image Handling

Checkbox Conversion

Configuration Options

Integration with Core System

Usage Pattern Across Converters

On this page