This page documents the MagicModel class implementations that transform raw model inference outputs into structured, hierarchical block representations. MagicModel is a critical component in the data transformation pipeline, responsible for extracting blocks and spans from backend-specific inference results.
Related Pages:
MagicModel has three backend-specific implementations, each processing different input formats but producing a common output structure:
Sources: mineru/backend/pipeline/pipeline_magic_model.py6-8 mineru/backend/vlm/vlm_magic_model.py12-14 mineru/backend/hybrid/hybrid_magic_model.py15-27
All three implementations expose the same getter methods:
| Method | Returns | Description |
|---|---|---|
get_image_blocks() | list | Image blocks with captions/footnotes |
get_table_blocks() | list | Table blocks with captions/footnotes |
get_text_blocks() | list | Regular text blocks |
get_title_blocks() | list | Title/heading blocks |
get_interline_equation_blocks() | list | Display equation blocks |
get_code_blocks() | list | Code and algorithm blocks |
get_list_blocks() | list | List item blocks |
get_ref_text_blocks() | list | Reference text blocks |
get_phonetic_blocks() | list | Phonetic annotation blocks |
get_discarded_blocks() | list | Discarded blocks (headers, footers) |
get_all_spans() | list | All atomic content spans |
Sources: mineru/backend/vlm/vlm_magic_model.py240-271 mineru/backend/hybrid/hybrid_magic_model.py315-346 mineru/backend/pipeline/pipeline_magic_model.py246-318
MagicModel works with two hierarchical enumerations that define the structure of parsed documents:
Sources: mineru/utils/enum_class.py3-31
ContentType defines the atomic span types that appear within blocks:
| ContentType | Description | Example Usage |
|---|---|---|
TEXT | Regular text content | Paragraphs, captions |
INLINE_EQUATION | Inline math formula | $x^2 + y^2$ |
INTERLINE_EQUATION | Display math formula | $$\int_0^\infty e^{-x}dx$$ |
IMAGE | Image content | Cropped image references |
TABLE | Table content | HTML table structure |
Sources: mineru/utils/enum_class.py33-40
The pipeline backend's MagicModel processes layout detection results with CategoryId labels:
Sources: mineru/backend/pipeline/pipeline_magic_model.py8-11
Sources: mineru/backend/pipeline/pipeline_magic_model.py8-21
The pipeline MagicModel uses distance-based association to link captions and footnotes with their parent elements:
The __tie_up_category_by_distance_v3 method uses spatial proximity to match captions and footnotes to their parent blocks, ensuring structural relationships are preserved.
Sources: mineru/backend/pipeline/pipeline_magic_model.py246-283 mineru/utils/magic_model_utils.py31-171
The pipeline backend maps CategoryId values to BlockType through the extraction process:
| CategoryId | BlockType Output |
|---|---|
CategoryId.Title | BlockType.TITLE |
CategoryId.Text | BlockType.TEXT |
CategoryId.ImageBody | BlockType.IMAGE_BODY |
CategoryId.ImageCaption | BlockType.IMAGE_CAPTION |
CategoryId.ImageFootnote | BlockType.IMAGE_FOOTNOTE |
CategoryId.TableBody | BlockType.TABLE_BODY |
CategoryId.TableCaption | BlockType.TABLE_CAPTION |
CategoryId.TableFootnote | BlockType.TABLE_FOOTNOTE |
CategoryId.InterlineEquation_Layout | BlockType.INTERLINE_EQUATION |
CategoryId.InlineEquation | ContentType.INLINE_EQUATION span |
CategoryId.Abandon | BlockType.DISCARDED |
Sources: mineru/utils/enum_class.py68-84 mineru/backend/pipeline/pipeline_magic_model.py284-318
The VLM backend's MagicModel receives fully parsed blocks with content from the vision-language model:
Sources: mineru/backend/vlm/vlm_magic_model.py12-45
The VLM MagicModel performs content-based processing:
Sources: mineru/backend/vlm/vlm_magic_model.py47-183
The VLM MagicModel extracts inline equations from text content using regex:
Sources: mineru/backend/vlm/vlm_magic_model.py106-148
VLM MagicModel processes code blocks with special logic:
If a code block contains inline equations (indicated by \(...\) patterns), it is automatically reclassified as an algorithm block, since algorithms often contain mathematical notation.
Sources: mineru/backend/vlm/vlm_magic_model.py76-80 mineru/backend/vlm/vlm_magic_model.py282-295 mineru/backend/vlm/vlm_magic_model.py86-108
The hybrid backend's MagicModel combines VLM results with OCR-based span extraction:
Sources: mineru/backend/hybrid/hybrid_magic_model.py15-27
The hybrid MagicModel uses two different processing paths based on the _vlm_ocr_enable flag:
Sources: mineru/backend/hybrid/hybrid_magic_model.py38-242
The hybrid backend defines certain block types that should NOT be extracted from VLM content when in span-fill mode:
These blocks are filled with OCR-derived spans instead of VLM content to ensure higher text extraction accuracy.
Sources: mineru/utils/enum_class.py120-132 mineru/backend/hybrid/hybrid_magic_model.py13 mineru/backend/hybrid/hybrid_magic_model.py135-242
When not in VLM OCR mode, the hybrid backend fills text blocks with OCR-derived spans:
The fix_text_block function organizes loose spans into structured lines within the block.
Sources: mineru/backend/hybrid/hybrid_magic_model.py224-242 mineru/utils/span_block_fix.py1-50
All three MagicModel implementations construct two-layer structures for images, tables, and code blocks:
The fix_two_layer_blocks function (used by VLM and Hybrid) implements index-based association. It differs from the pipeline backend which uses tie_up_category_by_distance_v3:
VLM/Hybrid Backend Association Flow (Index-Based)
Pipeline Backend Association (Distance-Based)
The pipeline backend uses tie_up_category_by_distance_v3 which matches components by spatial proximity rather than index:
This spatial matching uses bbox_distance to find nearest neighbors rather than relying on reading order indices.
Sources: mineru/backend/vlm/vlm_magic_model.py373-502 mineru/backend/hybrid/hybrid_magic_model.py449-577 mineru/utils/magic_model_utils.py173-299 mineru/backend/pipeline/pipeline_magic_model.py212-244 mineru/utils/magic_model_utils.py31-171
The fix_two_layer_blocks function validates position constraints and index continuity:
Position Constraints (Lines 461-472 in vlm_magic_model.py)
Index Continuity Validation (Lines 428-451 for captions, 454-466 for footnotes)
Captions must form a continuous sequence working backward from the body:
Footnotes must form a continuous sequence working forward from the body:
tie_up_category_by_index Matching Logic
The underlying tie_up_category_by_index function uses a three-tier priority system:
| Priority | Criterion | Code Reference |
|---|---|---|
| 1 (Highest) | Effective index difference | calc_effective_index_diff() at line 219-237 |
| 2 | Bbox edge distance | bbox_distance() at line 265 |
| 3 (Lowest) | Bbox center distance | bbox_center_distance() at line 285 |
Special rules when edge distance diff <= 2:
table_caption: match to later subject (line 276-278)*_footnote: match to earlier subject (line 279-282)Sources: mineru/backend/vlm/vlm_magic_model.py379-502 mineru/backend/hybrid/hybrid_magic_model.py456-577 mineru/utils/magic_model_utils.py173-299
The fix_list_blocks function associates list items with their container using overlap detection:
List Block Processing Flow
Sub-Type Determination (Lines 529-541)
The sub_type field is determined by counting block types within the list:
This allows distinguishing between text lists, reference lists, or mixed content lists.
Sources: mineru/backend/vlm/vlm_magic_model.py505-543 mineru/backend/hybrid/hybrid_magic_model.py580-618 mineru/utils/boxbase.py174-191
Spans are atomic content units stored in both individual blocks and the aggregated all_spans list:
| Field | Type | Present When | Description |
|---|---|---|---|
bbox | [x0, y0, x1, y1] | Always | Pixel coordinates |
type | ContentType enum | Always | TEXT, INLINE_EQUATION, INTERLINE_EQUATION, IMAGE, TABLE |
content | string | TEXT or EQUATION types | Extracted or recognized text/LaTeX |
html | string | TABLE type | HTML table structure from VLM/pipeline |
latex | string | TABLE type (pipeline) | OTSL or LaTeX table format |
score | float | Pipeline/Hybrid | Model confidence (0.0-1.0) |
image_path | string | IMAGE type (added later) | Path to cropped image file |
The pipeline backend generates spans from get_all_spans() method (lines 308-352):
Pipeline get_all_spans() Processing
Downstream Processing (model_json_to_middle_json.py)
Sources: mineru/backend/pipeline/pipeline_magic_model.py308-352 mineru/backend/pipeline/model_json_to_middle_json.py136-173 mineru/utils/span_pre_proc.py
The VLM backend generates spans directly from block content:
Sources: mineru/backend/vlm/vlm_magic_model.py86-164
The hybrid backend uses a conditional dual-path approach for span generation:
Span Source Selection (Lines 38-62)
Block Content vs Span Filling (Lines 135-242)
For each block during processing:
The not_extract_list includes: TEXT, TITLE, HEADER, FOOTER, PAGE_NUMBER, PAGE_FOOTNOTE, REF_TEXT, and all caption/footnote types. These blocks use span-filling for better accuracy.
Sources: mineru/backend/hybrid/hybrid_magic_model.py13 mineru/backend/hybrid/hybrid_magic_model.py38-62 mineru/backend/hybrid/hybrid_magic_model.py135-242 mineru/utils/enum_class.py120-132 mineru/utils/span_block_fix.py
The following diagram shows how blocks and spans are collected after MagicModel processing:
Sources: mineru/backend/pipeline/model_json_to_middle_json.py36-169 mineru/utils/block_pre_proc.py11-31
Sources: mineru/backend/pipeline/model_json_to_middle_json.py28-56
VLM and Hybrid backends use MagicModel in their respective result_to_middle_json functions:
Sources: mineru/backend/vlm/vlm_analyze.py1-200 mineru/backend/hybrid/hybrid_analyze.py1-300
After MagicModel extraction, blocks undergo further processing:
Sources: mineru/backend/pipeline/model_json_to_middle_json.py176-253 mineru/backend/pipeline/para_split.py1-50 mineru/utils/table_merge.py537-589
MagicModel serves as the critical transformation layer that:
The three implementations (pipeline, VLM, hybrid) use different input formats and processing strategies but converge on a unified output structure that enables consistent downstream processing in the document parsing pipeline.
Sources: mineru/backend/pipeline/pipeline_magic_model.py6-318 mineru/backend/vlm/vlm_magic_model.py12-543 mineru/backend/hybrid/hybrid_magic_model.py15-698
Refresh this wiki