MinerU VL¶
Vision-language model for layout-aware document extraction with specialized table and equation recognition.
Overview¶
| Tasks | Text Extraction, Layout Analysis |
| Backends | PyTorch, VLLM, MLX, API |
| Speed | 3-6s/page |
| Quality | Excellent (especially tables/equations) |
| VRAM | 3-4GB |
MinerU VL performs two-step extraction:
- Layout Detection - Detects 22+ element types (text, tables, equations, figures, code, etc.)
- Content Recognition - Extracts text/table/equation content from each region
Text Extraction¶
from omnidocs import Document
from omnidocs.tasks.text_extraction import MinerUVLTextExtractor
from omnidocs.tasks.text_extraction.mineruvl import MinerUVLTextPyTorchConfig
doc = Document.from_pdf("document.pdf")
extractor = MinerUVLTextExtractor(
backend=MinerUVLTextPyTorchConfig(device="cuda")
)
result = extractor.extract(doc.get_page(0), output_format="markdown")
print(result.content)
With Detailed Blocks¶
# Get both text output and detailed block information
result, blocks = extractor.extract_with_blocks(image, output_format="markdown")
for block in blocks:
print(f"{block.type}: {block.bbox}")
print(f" Content: {block.content[:50]}...")
Layout Analysis¶
from omnidocs.tasks.layout_extraction import MinerUVLLayoutDetector
from omnidocs.tasks.layout_extraction.mineruvl import MinerUVLLayoutPyTorchConfig
detector = MinerUVLLayoutDetector(
backend=MinerUVLLayoutPyTorchConfig(device="cuda")
)
result = detector.extract(image)
for box in result.bboxes:
print(f"{box.label}: {box.bbox}")
Detected Element Types¶
MinerU VL detects 22+ element types:
| Category | Types |
|---|---|
| Text | text, title, header, footer, page_number |
| Tables | table, table_caption, table_footnote |
| Math | equation, equation_block |
| Code | code, algorithm, code_caption |
| Images | image, image_caption, image_footnote |
| Other | list, ref_text, aside_text, phonetic |
Model Caching¶
MinerU VL supports intelligent model caching. When you create both a text extractor and layout detector, they share the same underlying model:
from omnidocs import get_cache_info
# First extractor loads the model (~4s)
text_extractor = MinerUVLTextExtractor(backend=MinerUVLTextMLXConfig())
# Second extractor reuses cached model (instant)
layout_detector = MinerUVLLayoutDetector(backend=MinerUVLLayoutMLXConfig())
# Check cache status
print(get_cache_info())
# {'num_entries': 1, 'entries': {..., 'ref_count': 2}}
Configure cache behavior:
from omnidocs import set_cache_config
# Limit cached models (default: 10)
set_cache_config(max_entries=5)
# Clear cache to free memory
from omnidocs import clear_cache
clear_cache()
Backend Configs¶
PyTorch (Local GPU)¶
from omnidocs.tasks.text_extraction.mineruvl import MinerUVLTextPyTorchConfig
config = MinerUVLTextPyTorchConfig(
model="opendatalab/MinerU2.5-2509-1.2B",
device="cuda", # "cuda", "cpu", "auto"
torch_dtype="float16", # "float16", "bfloat16", "float32"
use_flash_attention=False, # Use SDPA by default
)
VLLM (High Throughput)¶
from omnidocs.tasks.text_extraction.mineruvl import MinerUVLTextVLLMConfig
config = MinerUVLTextVLLMConfig(
model="opendatalab/MinerU2.5-2509-1.2B",
tensor_parallel_size=1,
gpu_memory_utilization=0.85,
enforce_eager=True,
)
MLX (Apple Silicon)¶
from omnidocs.tasks.text_extraction.mineruvl import MinerUVLTextMLXConfig
config = MinerUVLTextMLXConfig(
model="opendatalab/MinerU2.5-2509-1.2B",
max_tokens=4096,
)
API (VLLM Server)¶
from omnidocs.tasks.text_extraction.mineruvl import MinerUVLTextAPIConfig
config = MinerUVLTextAPIConfig(
server_url="http://localhost:8000",
model_name="opendatalab/MinerU2.5-2509-1.2B",
max_tokens=4096,
)
Output Formats¶
Markdown (Default)¶
result = extractor.extract(image, output_format="markdown")
# # Title
#
# This is paragraph text.
#
# | Header 1 | Header 2 |
# |----------|----------|
# | Cell 1 | Cell 2 |
#
# $$E = mc^2$$
HTML¶
result = extractor.extract(image, output_format="html")
# <h1>Title</h1>
# <p>This is paragraph text.</p>
# <table>...</table>
# <math>E = mc^2</math>
Comparison with Qwen¶
| Feature | MinerU VL | Qwen VL |
|---|---|---|
| Model Size | 1.2B | 2B-32B |
| VRAM | 3-4GB | 4-64GB |
| Table Quality | Excellent (OTSL format) | Good |
| Equation Quality | Excellent (LaTeX) | Good |
| General Text | Good | Excellent |
| Speed | 3-6s/page | 2-3s/page |
Recommendation: Use MinerU VL for documents with complex tables and equations. Use Qwen for general-purpose text extraction.
Troubleshooting¶
CUDA out of memory
# MinerU VL is already small (1.2B), try reducing batch size
# or use MLX on Apple Silicon
config = MinerUVLTextMLXConfig()
Tables not rendering correctly
# Use HTML output format for better table rendering
result = extractor.extract(image, output_format="html")
Slow inference
# Use VLLM backend for production
config = MinerUVLTextVLLMConfig(
gpu_memory_utilization=0.9,
enforce_eager=True,
)
Attribution¶
MinerU VL utilities include code adapted from mineru-vl-utils, licensed under AGPL-3.0.