MinerU VL¶

Vision-language model for layout-aware document extraction with specialized table and equation recognition.

Overview¶


Tasks	Text Extraction, Layout Analysis
Backends	PyTorch, VLLM, MLX, API
Speed	3-6s/page
Quality	Excellent (especially tables/equations)
VRAM	3-4GB

MinerU VL performs two-step extraction:

Layout Detection - Detects 22+ element types (text, tables, equations, figures, code, etc.)
Content Recognition - Extracts text/table/equation content from each region

Text Extraction¶

from omnidocs import Document
from omnidocs.tasks.text_extraction import MinerUVLTextExtractor
from omnidocs.tasks.text_extraction.mineruvl import MinerUVLTextPyTorchConfig

doc = Document.from_pdf("document.pdf")

extractor = MinerUVLTextExtractor(
    backend=MinerUVLTextPyTorchConfig(device="cuda")
)

result = extractor.extract(doc.get_page(0), output_format="markdown")
print(result.content)

With Detailed Blocks¶

# Get both text output and detailed block information
result, blocks = extractor.extract_with_blocks(image, output_format="markdown")

for block in blocks:
    print(f"{block.type}: {block.bbox}")
    print(f"  Content: {block.content[:50]}...")

Layout Analysis¶

from omnidocs.tasks.layout_extraction import MinerUVLLayoutDetector
from omnidocs.tasks.layout_extraction.mineruvl import MinerUVLLayoutPyTorchConfig

detector = MinerUVLLayoutDetector(
    backend=MinerUVLLayoutPyTorchConfig(device="cuda")
)

result = detector.extract(image)
for box in result.bboxes:
    print(f"{box.label}: {box.bbox}")

Detected Element Types¶

MinerU VL detects 22+ element types:

Category	Types
Text	text, title, header, footer, page_number
Tables	table, table_caption, table_footnote
Math	equation, equation_block
Code	code, algorithm, code_caption
Images	image, image_caption, image_footnote
Other	list, ref_text, aside_text, phonetic

Model Caching¶

MinerU VL supports intelligent model caching. When you create both a text extractor and layout detector, they share the same underlying model:

from omnidocs import get_cache_info

# First extractor loads the model (~4s)
text_extractor = MinerUVLTextExtractor(backend=MinerUVLTextMLXConfig())

# Second extractor reuses cached model (instant)
layout_detector = MinerUVLLayoutDetector(backend=MinerUVLLayoutMLXConfig())

# Check cache status
print(get_cache_info())
# {'num_entries': 1, 'entries': {..., 'ref_count': 2}}

Configure cache behavior:

from omnidocs import set_cache_config

# Limit cached models (default: 10)
set_cache_config(max_entries=5)

# Clear cache to free memory
from omnidocs import clear_cache
clear_cache()

Backend Configs¶

PyTorch (Local GPU)¶

from omnidocs.tasks.text_extraction.mineruvl import MinerUVLTextPyTorchConfig

config = MinerUVLTextPyTorchConfig(
    model="opendatalab/MinerU2.5-2509-1.2B",
    device="cuda",              # "cuda", "cpu", "auto"
    torch_dtype="float16",      # "float16", "bfloat16", "float32"
    use_flash_attention=False,  # Use SDPA by default
)

VLLM (High Throughput)¶

from omnidocs.tasks.text_extraction.mineruvl import MinerUVLTextVLLMConfig

config = MinerUVLTextVLLMConfig(
    model="opendatalab/MinerU2.5-2509-1.2B",
    tensor_parallel_size=1,
    gpu_memory_utilization=0.85,
    enforce_eager=True,
)

MLX (Apple Silicon)¶

from omnidocs.tasks.text_extraction.mineruvl import MinerUVLTextMLXConfig

config = MinerUVLTextMLXConfig(
    model="opendatalab/MinerU2.5-2509-1.2B",
    max_tokens=4096,
)

API (VLLM Server)¶

from omnidocs.tasks.text_extraction.mineruvl import MinerUVLTextAPIConfig

config = MinerUVLTextAPIConfig(
    server_url="http://localhost:8000",
    model_name="opendatalab/MinerU2.5-2509-1.2B",
    max_tokens=4096,
)

Output Formats¶

Markdown (Default)¶

result = extractor.extract(image, output_format="markdown")
# # Title
#
# This is paragraph text.
#
# | Header 1 | Header 2 |
# |----------|----------|
# | Cell 1   | Cell 2   |
#
# $$E = mc^2$$

HTML¶

result = extractor.extract(image, output_format="html")
# <h1>Title</h1>
# <p>This is paragraph text.</p>
# <table>...</table>
# <math>E = mc^2</math>

Comparison with Qwen¶

Feature	MinerU VL	Qwen VL
Model Size	1.2B	2B-32B
VRAM	3-4GB	4-64GB
Table Quality	Excellent (OTSL format)	Good
Equation Quality	Excellent (LaTeX)	Good
General Text	Good	Excellent
Speed	3-6s/page	2-3s/page

Recommendation: Use MinerU VL for documents with complex tables and equations. Use Qwen for general-purpose text extraction.

Troubleshooting¶

CUDA out of memory

# MinerU VL is already small (1.2B), try reducing batch size
# or use MLX on Apple Silicon
config = MinerUVLTextMLXConfig()

Tables not rendering correctly

# Use HTML output format for better table rendering
result = extractor.extract(image, output_format="html")

Slow inference

# Use VLLM backend for production
config = MinerUVLTextVLLMConfig(
    gpu_memory_utilization=0.9,
    enforce_eager=True,
)

Attribution¶

MinerU VL utilities include code adapted from mineru-vl-utils, licensed under AGPL-3.0.