DotsOCR Text Extraction¶
Model Overview¶
DotsOCR (Deep Object Text Segmentation OCR) is a specialized Vision-Language Model designed specifically for document understanding with built-in layout analysis. Unlike general-purpose VLMs, DotsOCR outputs structured information about document layout while extracting text content.
Model ID: rednote-hilab/dots.ocr Repository: DotsOCR on HuggingFace Architecture: Vision Encoder + Language Model Training Focus: Academic papers, technical documents, PDFs
Key Capabilities¶
- Layout-Aware Extraction: Detects 11 document element categories with bounding boxes
- Multi-Format Text: Different formats per category (Markdown, LaTeX, HTML)
- Fast Inference: 50-100% faster than general-purpose VLMs
- Normalized Coordinates: All bboxes in 0-1024 range (scale-independent)
- Reading Order: Maintains document reading order
- Format-Specific Output:
- Text/Title/Section-header: Markdown
- Formula: LaTeX
- Table: HTML
- Picture: Bounding box only (no text)
Limitations¶
- PyTorch and VLLM backends only (no MLX, no API)
- Optimized for academic/technical documents (less good for forms, invoices)
- Fixed layout categories (cannot add custom categories)
- Requires GPU (minimum 16GB VRAM for 8B variant)
- Output is JSON-focused (not raw markdown like Qwen)
Supported Backends¶
DotsOCR supports 2 inference backends:
| Backend | Use Case | Performance | Setup |
|---|---|---|---|
| PyTorch | Single document, development | 50-100 tok/s | Simple GPU setup |
| VLLM | Batch processing, production | 150-300 tok/s | Multi-GPU cluster |
No MLX or API backends available.
Installation & Configuration¶
Basic Installation¶
# Install with PyTorch backend
pip install omnidocs[pytorch]
# Or with VLLM for batching
pip install omnidocs[vllm]
PyTorch Backend Configuration¶
from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRPyTorchConfig
config = DotsOCRPyTorchConfig(
model="rednote-hilab/dots.ocr",
device="cuda",
torch_dtype="bfloat16",
trust_remote_code=True,
device_map="auto",
attn_implementation="flash_attention_2", # Recommended
)
extractor = DotsOCRTextExtractor(backend=config)
PyTorch Config Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
model |
str | "rednote-hilab/dots.ocr" | HuggingFace model ID |
device |
str | "cuda" | Device: "cuda", "mps", "cpu" |
torch_dtype |
str | "bfloat16" | Data type: "float16", "bfloat16", "float32" |
trust_remote_code |
bool | True | Allow custom model code from HuggingFace |
device_map |
str | "auto" | Model parallelism: "auto", "balanced", "sequential" |
attn_implementation |
str | "flash_attention_2" | Attention type: "eager", "flash_attention_2", "sdpa" |
VLLM Backend Configuration¶
from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRVLLMConfig
config = DotsOCRVLLMConfig(
model="rednote-hilab/dots.ocr",
tensor_parallel_size=1, # Use 2+ for large models
gpu_memory_utilization=0.85,
max_model_len=4096,
)
extractor = DotsOCRTextExtractor(backend=config)
VLLM Config Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
model |
str | Required | HuggingFace model ID |
tensor_parallel_size |
int | 1 | Number of GPUs for parallelism |
gpu_memory_utilization |
float | 0.85 | GPU memory usage (0.1-1.0) |
max_model_len |
int | None | Max context length in tokens |
Layout Categories (11 Fixed)¶
DotsOCR recognizes exactly 11 layout element categories:
| Category | Description | Text Format | Typical Content |
|---|---|---|---|
| Title | Document/section title | Markdown | "Introduction", "Chapter 2" |
| Section-header | Subsection heading | Markdown | "3.1 Method Overview" |
| Text | Body paragraph | Markdown | Main content paragraphs |
| List-item | Bulleted/numbered item | Markdown | "1. First point", "• Item" |
| Table | Tabular data | HTML | <table><tr><td>...</td>... |
| Formula | Mathematical equation | LaTeX | $E=mc^2$ or display math |
| Figure | Image/figure/diagram | None | Bounding box only |
| Caption | Figure/table caption | Markdown | "Fig 1: System Overview" |
| Footnote | Footer note | Markdown | Explanatory footnotes |
| Page-header | Page header text | Markdown | Page number, document title |
| Page-footer | Page footer text | Markdown | Page number, author name |
Usage Examples¶
Basic Layout-Aware Extraction¶
from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRPyTorchConfig
from PIL import Image
# Initialize extractor
config = DotsOCRPyTorchConfig(
model="rednote-hilab/dots.ocr",
device="cuda",
)
extractor = DotsOCRTextExtractor(backend=config)
# Load document
image = Image.open("paper.png")
# Extract with layout information
result = extractor.extract(
image,
include_layout=True, # Returns DotsOCRTextOutput
)
# Access layout elements
print(f"Found {result.num_layout_elements} layout elements")
for elem in result.layout:
print(f" {elem.category} @ {elem.bbox}: {elem.text[:50]}...")
Output Format Examples¶
# Default: DotsOCRTextOutput with layout
result = extractor.extract(image)
# Access structured layout
for elem in result.layout:
category = elem.category # "Title", "Text", "Table", etc.
bbox = elem.bbox # [x1, y1, x2, y2] (0-1024 normalized)
text = elem.text # Content (formatted per category)
confidence = elem.confidence # Detection confidence
print(result.content) # Full text (Markdown)
print(result.format) # "markdown" (fixed)
print(result.has_layout) # True
print(result.content_length) # Total character count
print(result.image_width) # Source image width
print(result.image_height) # Source image height
Category-Specific Processing¶
# Extract only formulas (as LaTeX)
formulas = [
elem for elem in result.layout
if elem.category == "Formula"
]
for formula in formulas:
print(f"Formula @ {formula.bbox}:")
print(formula.text) # LaTeX format
print()
# Extract tables (as HTML)
tables = [
elem for elem in result.layout
if elem.category == "Table"
]
for table in tables:
print(f"Table @ {table.bbox}:")
print(table.text) # HTML table
print()
# Extract all text content (non-figure)
text_elements = [
elem for elem in result.layout
if elem.category not in ["Figure", "Page-header", "Page-footer"]
]
full_text = "\n".join(elem.text for elem in text_elements)
print(full_text) # Cleaned text without layout markers
Bounding Box Operations¶
# Access normalized bounding boxes (0-1024 scale)
for elem in result.layout:
x1, y1, x2, y2 = elem.bbox
width = x2 - x1
height = y2 - y1
area = width * height
print(f"{elem.category}: {width}x{height} at ({x1}, {y1})")
# Filter elements by region (e.g., top half of page)
top_half = [
elem for elem in result.layout
if elem.bbox[1] < 512 # y1 < midpoint
]
# Filter by size
large_elements = [
elem for elem in result.layout
if (elem.bbox[2] - elem.bbox[0]) * (elem.bbox[3] - elem.bbox[1]) > 102400
]
Batch Processing with VLLM¶
from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRVLLMConfig
from PIL import Image
import json
# Initialize with VLLM
config = DotsOCRVLLMConfig(
model="rednote-hilab/dots.ocr",
tensor_parallel_size=2,
gpu_memory_utilization=0.8,
)
extractor = DotsOCRTextExtractor(backend=config)
# Process multiple documents
documents = ["doc1.png", "doc2.png", "doc3.png"]
results = []
for doc_path in documents:
image = Image.open(doc_path)
result = extractor.extract(image, include_layout=True)
results.append({
"file": doc_path,
"elements": len(result.layout),
"content_length": result.content_length,
"layout": [
{
"category": elem.category,
"bbox": elem.bbox,
"text_length": len(elem.text) if elem.text else 0,
}
for elem in result.layout
]
})
# Save results
with open("extraction_results.json", "w") as f:
json.dump(results, f, indent=2)
Performance Characteristics¶
Memory Requirements¶
| Model | Framework | VRAM | Batch Size |
|---|---|---|---|
| DotsOCR | PyTorch | 16 GB | 1 (single doc) |
| DotsOCR | VLLM | 20 GB | 2-4 |
| DotsOCR | VLLM (2-GPU) | 20 GB (per GPU) | 6-10 |
Inference Speed¶
| Setup | Speed | Throughput |
|---|---|---|
| PyTorch (single A10) | 50-80 tok/s | ~400-600 chars/s |
| VLLM (single A10) | 150-200 tok/s | ~1200-1600 chars/s |
| VLLM (2x A10) | 250-350 tok/s | ~2000-2800 chars/s |
Typical Processing Times¶
| Document | Tokens | Time (PyTorch) | Time (VLLM) |
|---|---|---|---|
| Single page | 1000-2000 | 12-25s | 5-10s |
| 5 pages | 5000-10000 | 60-130s | 20-40s |
| 10 pages | 10000-20000 | 130-260s | 40-80s |
Troubleshooting¶
Memory Errors¶
Symptom: RuntimeError: CUDA out of memory
Solutions:
# 1. Use CPU (slow but works)
config = DotsOCRPyTorchConfig(device="cpu")
# 2. Reduce image size before processing
from PIL import Image
image = Image.open("document.png")
image.thumbnail((2048, 2048)) # Resize if larger
# 3. Use VLLM with memory management
config = DotsOCRVLLMConfig(
gpu_memory_utilization=0.7, # Reduced from 0.85
max_model_len=2048, # Reduced from 4096
)
Layout Parsing Errors¶
Symptom: ValueError: Invalid layout JSON structure
Solution:
# Check raw output for issues
result = extractor.extract(image, include_layout=True)
if result.error:
print(f"Extraction error: {result.error}")
print(f"Raw output: {result.raw_output[:500]}...")
# Ensure image is valid
if image.size[0] < 256 or image.size[1] < 256:
print("Image too small for reliable layout detection")
Missing Layout Categories¶
Symptom: Some expected elements not detected
Solutions:
# Check what was detected
detected_categories = set(
elem.category for elem in result.layout
)
print(f"Found: {detected_categories}")
# Element may be below confidence threshold
# Access raw output to see low-confidence detections
print(result.raw_output)
# Try with different preprocessing
from PIL import ImageEnhance
image = Image.open("document.png")
enhancer = ImageEnhance.Contrast(image)
image = enhancer.enhance(1.3)
result = extractor.extract(image)
DotsOCR vs Other Models¶
DotsOCR vs Qwen3-VL¶
| Feature | DotsOCR | Qwen3-VL |
|---|---|---|
| Layout Info | Detailed (11 cats) | Basic |
| Output Format | JSON + Markdown | Markdown/HTML |
| Speed | Fast | Medium |
| Text Quality | Good | Excellent |
| Multilingual | Limited | Excellent (25+ langs) |
| Backends | PyTorch, VLLM | PyTorch, VLLM, MLX, API |
| Best For | Layout analysis | Text quality |
Choose DotsOCR if: You need precise layout information for post-processing Choose Qwen if: You need high-quality text in multiple languages
When to Use DotsOCR¶
Ideal scenarios: - Academic papers with structured layouts - Technical documents with formulas and tables - Batch processing with layout analysis - When you need bounding boxes for each element
Not ideal for: - Handwritten documents (use Surya) - Forms with complex fields (use specialized form parser) - Real-time single-document processing (overhead > benefit) - Custom layout categories needed
API Reference¶
DotsOCRTextExtractor.extract()¶
def extract(
image: Union[Image.Image, np.ndarray, str, Path],
include_layout: bool = True,
output_format: str = "markdown",
) -> DotsOCRTextOutput:
"""
Extract text with layout from document image.
Args:
image: Input image (PIL Image, numpy array, or path)
include_layout: Include layout elements with bboxes (default: True)
output_format: "markdown" or "json" (fixed)
Returns:
DotsOCRTextOutput with layout elements and text
"""
DotsOCRTextOutput Properties¶
result = extractor.extract(image)
# Layout information
result.layout # List[LayoutElement]
result.has_layout # True
result.num_layout_elements # int
# Text content
result.content # Full text (Markdown)
result.format # "markdown"
result.content_length # Characters
# Element categories
result.layout_categories # List of 11 categories
# Metadata
result.image_width # Source image width
result.image_height # Source image height
result.truncated # Output hit max tokens
result.error # Error message if any
result.raw_output # Raw model JSON
LayoutElement Properties¶
for elem in result.layout:
elem.category # "Title", "Text", "Table", etc.
elem.bbox # [x1, y1, x2, y2] (0-1024)
elem.text # Content (Markdown/LaTeX/HTML)
elem.confidence # float (0-1) - detection confidence
Advanced Usage¶
Post-Processing: Extract Figures¶
# Get all figures with their captions
figures = {}
for elem in result.layout:
if elem.category == "Figure":
bbox = elem.bbox
figures[str(bbox)] = {
"bbox": bbox,
"caption": None,
}
elif elem.category == "Caption":
# Find nearest figure
# (could implement spatial matching here)
pass
for fig_bbox, fig_data in figures.items():
print(f"Figure @ {fig_data['bbox']}")
print(f" Caption: {fig_data['caption']}")
Export to Structured Format¶
import json
from dataclasses import asdict
# Convert to JSON-serializable format
output_data = {
"document": {
"width": result.image_width,
"height": result.image_height,
},
"elements": [
{
"category": elem.category,
"bbox": {
"x1": elem.bbox[0],
"y1": elem.bbox[1],
"x2": elem.bbox[2],
"y2": elem.bbox[3],
},
"text": elem.text,
"confidence": elem.confidence,
}
for elem in result.layout
]
}
# Save
with open("layout_analysis.json", "w") as f:
json.dump(output_data, f, indent=2)
Visualization with Bounding Boxes¶
from PIL import Image, ImageDraw
# Load original image
image = Image.open("document.png")
img_w, img_h = image.size
# Create visualization
viz = image.copy()
draw = ImageDraw.Draw(viz)
# Color map for categories
colors = {
"Title": "red",
"Text": "blue",
"Table": "orange",
"Formula": "purple",
"Figure": "green",
}
# Draw bounding boxes
for elem in result.layout:
# Convert from 0-1024 to pixel coordinates
bbox = [
(elem.bbox[0] / 1024) * img_w,
(elem.bbox[1] / 1024) * img_h,
(elem.bbox[2] / 1024) * img_w,
(elem.bbox[3] / 1024) * img_h,
]
color = colors.get(elem.category, "gray")
draw.rectangle(bbox, outline=color, width=3)
draw.text((bbox[0], bbox[1] - 15), elem.category, fill=color)
# Save
viz.save("layout_visualization.png")
See Also¶
- Qwen3-VL Text Extraction - For pure text quality
- DotsOCR Repository
- Comparison Guide - Model selection matrix