Skip to content

DotsOCR Text Extraction

Model Overview

DotsOCR (Deep Object Text Segmentation OCR) is a specialized Vision-Language Model designed specifically for document understanding with built-in layout analysis. Unlike general-purpose VLMs, DotsOCR outputs structured information about document layout while extracting text content.

Model ID: rednote-hilab/dots.ocr Repository: DotsOCR on HuggingFace Architecture: Vision Encoder + Language Model Training Focus: Academic papers, technical documents, PDFs

Key Capabilities

  • Layout-Aware Extraction: Detects 11 document element categories with bounding boxes
  • Multi-Format Text: Different formats per category (Markdown, LaTeX, HTML)
  • Fast Inference: 50-100% faster than general-purpose VLMs
  • Normalized Coordinates: All bboxes in 0-1024 range (scale-independent)
  • Reading Order: Maintains document reading order
  • Format-Specific Output:
  • Text/Title/Section-header: Markdown
  • Formula: LaTeX
  • Table: HTML
  • Picture: Bounding box only (no text)

Limitations

  • PyTorch and VLLM backends only (no MLX, no API)
  • Optimized for academic/technical documents (less good for forms, invoices)
  • Fixed layout categories (cannot add custom categories)
  • Requires GPU (minimum 16GB VRAM for 8B variant)
  • Output is JSON-focused (not raw markdown like Qwen)

Supported Backends

DotsOCR supports 2 inference backends:

Backend Use Case Performance Setup
PyTorch Single document, development 50-100 tok/s Simple GPU setup
VLLM Batch processing, production 150-300 tok/s Multi-GPU cluster

No MLX or API backends available.


Installation & Configuration

Basic Installation

# Install with PyTorch backend
pip install omnidocs[pytorch]

# Or with VLLM for batching
pip install omnidocs[vllm]

PyTorch Backend Configuration

from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRPyTorchConfig

config = DotsOCRPyTorchConfig(
    model="rednote-hilab/dots.ocr",
    device="cuda",
    torch_dtype="bfloat16",
    trust_remote_code=True,
    device_map="auto",
    attn_implementation="flash_attention_2",  # Recommended
)

extractor = DotsOCRTextExtractor(backend=config)

PyTorch Config Parameters:

Parameter Type Default Description
model str "rednote-hilab/dots.ocr" HuggingFace model ID
device str "cuda" Device: "cuda", "mps", "cpu"
torch_dtype str "bfloat16" Data type: "float16", "bfloat16", "float32"
trust_remote_code bool True Allow custom model code from HuggingFace
device_map str "auto" Model parallelism: "auto", "balanced", "sequential"
attn_implementation str "flash_attention_2" Attention type: "eager", "flash_attention_2", "sdpa"

VLLM Backend Configuration

from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRVLLMConfig

config = DotsOCRVLLMConfig(
    model="rednote-hilab/dots.ocr",
    tensor_parallel_size=1,  # Use 2+ for large models
    gpu_memory_utilization=0.85,
    max_model_len=4096,
)

extractor = DotsOCRTextExtractor(backend=config)

VLLM Config Parameters:

Parameter Type Default Description
model str Required HuggingFace model ID
tensor_parallel_size int 1 Number of GPUs for parallelism
gpu_memory_utilization float 0.85 GPU memory usage (0.1-1.0)
max_model_len int None Max context length in tokens

Layout Categories (11 Fixed)

DotsOCR recognizes exactly 11 layout element categories:

Category Description Text Format Typical Content
Title Document/section title Markdown "Introduction", "Chapter 2"
Section-header Subsection heading Markdown "3.1 Method Overview"
Text Body paragraph Markdown Main content paragraphs
List-item Bulleted/numbered item Markdown "1. First point", "• Item"
Table Tabular data HTML <table><tr><td>...</td>...
Formula Mathematical equation LaTeX $E=mc^2$ or display math
Figure Image/figure/diagram None Bounding box only
Caption Figure/table caption Markdown "Fig 1: System Overview"
Footnote Footer note Markdown Explanatory footnotes
Page-header Page header text Markdown Page number, document title
Page-footer Page footer text Markdown Page number, author name

Usage Examples

Basic Layout-Aware Extraction

from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRPyTorchConfig
from PIL import Image

# Initialize extractor
config = DotsOCRPyTorchConfig(
    model="rednote-hilab/dots.ocr",
    device="cuda",
)
extractor = DotsOCRTextExtractor(backend=config)

# Load document
image = Image.open("paper.png")

# Extract with layout information
result = extractor.extract(
    image,
    include_layout=True,  # Returns DotsOCRTextOutput
)

# Access layout elements
print(f"Found {result.num_layout_elements} layout elements")
for elem in result.layout:
    print(f"  {elem.category} @ {elem.bbox}: {elem.text[:50]}...")

Output Format Examples

# Default: DotsOCRTextOutput with layout
result = extractor.extract(image)

# Access structured layout
for elem in result.layout:
    category = elem.category  # "Title", "Text", "Table", etc.
    bbox = elem.bbox          # [x1, y1, x2, y2] (0-1024 normalized)
    text = elem.text          # Content (formatted per category)
    confidence = elem.confidence  # Detection confidence

print(result.content)        # Full text (Markdown)
print(result.format)         # "markdown" (fixed)
print(result.has_layout)     # True
print(result.content_length) # Total character count
print(result.image_width)    # Source image width
print(result.image_height)   # Source image height

Category-Specific Processing

# Extract only formulas (as LaTeX)
formulas = [
    elem for elem in result.layout
    if elem.category == "Formula"
]

for formula in formulas:
    print(f"Formula @ {formula.bbox}:")
    print(formula.text)  # LaTeX format
    print()

# Extract tables (as HTML)
tables = [
    elem for elem in result.layout
    if elem.category == "Table"
]

for table in tables:
    print(f"Table @ {table.bbox}:")
    print(table.text)  # HTML table
    print()

# Extract all text content (non-figure)
text_elements = [
    elem for elem in result.layout
    if elem.category not in ["Figure", "Page-header", "Page-footer"]
]

full_text = "\n".join(elem.text for elem in text_elements)
print(full_text)  # Cleaned text without layout markers

Bounding Box Operations

# Access normalized bounding boxes (0-1024 scale)
for elem in result.layout:
    x1, y1, x2, y2 = elem.bbox
    width = x2 - x1
    height = y2 - y1
    area = width * height

    print(f"{elem.category}: {width}x{height} at ({x1}, {y1})")

# Filter elements by region (e.g., top half of page)
top_half = [
    elem for elem in result.layout
    if elem.bbox[1] < 512  # y1 < midpoint
]

# Filter by size
large_elements = [
    elem for elem in result.layout
    if (elem.bbox[2] - elem.bbox[0]) * (elem.bbox[3] - elem.bbox[1]) > 102400
]

Batch Processing with VLLM

from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRVLLMConfig
from PIL import Image
import json

# Initialize with VLLM
config = DotsOCRVLLMConfig(
    model="rednote-hilab/dots.ocr",
    tensor_parallel_size=2,
    gpu_memory_utilization=0.8,
)
extractor = DotsOCRTextExtractor(backend=config)

# Process multiple documents
documents = ["doc1.png", "doc2.png", "doc3.png"]
results = []

for doc_path in documents:
    image = Image.open(doc_path)
    result = extractor.extract(image, include_layout=True)

    results.append({
        "file": doc_path,
        "elements": len(result.layout),
        "content_length": result.content_length,
        "layout": [
            {
                "category": elem.category,
                "bbox": elem.bbox,
                "text_length": len(elem.text) if elem.text else 0,
            }
            for elem in result.layout
        ]
    })

# Save results
with open("extraction_results.json", "w") as f:
    json.dump(results, f, indent=2)

Performance Characteristics

Memory Requirements

Model Framework VRAM Batch Size
DotsOCR PyTorch 16 GB 1 (single doc)
DotsOCR VLLM 20 GB 2-4
DotsOCR VLLM (2-GPU) 20 GB (per GPU) 6-10

Inference Speed

Setup Speed Throughput
PyTorch (single A10) 50-80 tok/s ~400-600 chars/s
VLLM (single A10) 150-200 tok/s ~1200-1600 chars/s
VLLM (2x A10) 250-350 tok/s ~2000-2800 chars/s

Typical Processing Times

Document Tokens Time (PyTorch) Time (VLLM)
Single page 1000-2000 12-25s 5-10s
5 pages 5000-10000 60-130s 20-40s
10 pages 10000-20000 130-260s 40-80s

Troubleshooting

Memory Errors

Symptom: RuntimeError: CUDA out of memory

Solutions:

# 1. Use CPU (slow but works)
config = DotsOCRPyTorchConfig(device="cpu")

# 2. Reduce image size before processing
from PIL import Image
image = Image.open("document.png")
image.thumbnail((2048, 2048))  # Resize if larger

# 3. Use VLLM with memory management
config = DotsOCRVLLMConfig(
    gpu_memory_utilization=0.7,  # Reduced from 0.85
    max_model_len=2048,  # Reduced from 4096
)

Layout Parsing Errors

Symptom: ValueError: Invalid layout JSON structure

Solution:

# Check raw output for issues
result = extractor.extract(image, include_layout=True)

if result.error:
    print(f"Extraction error: {result.error}")
    print(f"Raw output: {result.raw_output[:500]}...")

# Ensure image is valid
if image.size[0] < 256 or image.size[1] < 256:
    print("Image too small for reliable layout detection")

Missing Layout Categories

Symptom: Some expected elements not detected

Solutions:

# Check what was detected
detected_categories = set(
    elem.category for elem in result.layout
)
print(f"Found: {detected_categories}")

# Element may be below confidence threshold
# Access raw output to see low-confidence detections
print(result.raw_output)

# Try with different preprocessing
from PIL import ImageEnhance
image = Image.open("document.png")
enhancer = ImageEnhance.Contrast(image)
image = enhancer.enhance(1.3)
result = extractor.extract(image)

DotsOCR vs Other Models

DotsOCR vs Qwen3-VL

Feature DotsOCR Qwen3-VL
Layout Info Detailed (11 cats) Basic
Output Format JSON + Markdown Markdown/HTML
Speed Fast Medium
Text Quality Good Excellent
Multilingual Limited Excellent (25+ langs)
Backends PyTorch, VLLM PyTorch, VLLM, MLX, API
Best For Layout analysis Text quality

Choose DotsOCR if: You need precise layout information for post-processing Choose Qwen if: You need high-quality text in multiple languages

When to Use DotsOCR

Ideal scenarios: - Academic papers with structured layouts - Technical documents with formulas and tables - Batch processing with layout analysis - When you need bounding boxes for each element

Not ideal for: - Handwritten documents (use Surya) - Forms with complex fields (use specialized form parser) - Real-time single-document processing (overhead > benefit) - Custom layout categories needed


API Reference

DotsOCRTextExtractor.extract()

def extract(
    image: Union[Image.Image, np.ndarray, str, Path],
    include_layout: bool = True,
    output_format: str = "markdown",
) -> DotsOCRTextOutput:
    """
    Extract text with layout from document image.

    Args:
        image: Input image (PIL Image, numpy array, or path)
        include_layout: Include layout elements with bboxes (default: True)
        output_format: "markdown" or "json" (fixed)

    Returns:
        DotsOCRTextOutput with layout elements and text
    """

DotsOCRTextOutput Properties

result = extractor.extract(image)

# Layout information
result.layout                   # List[LayoutElement]
result.has_layout              # True
result.num_layout_elements     # int

# Text content
result.content                 # Full text (Markdown)
result.format                  # "markdown"
result.content_length          # Characters

# Element categories
result.layout_categories       # List of 11 categories

# Metadata
result.image_width            # Source image width
result.image_height           # Source image height
result.truncated              # Output hit max tokens
result.error                  # Error message if any
result.raw_output             # Raw model JSON

LayoutElement Properties

for elem in result.layout:
    elem.category        # "Title", "Text", "Table", etc.
    elem.bbox            # [x1, y1, x2, y2] (0-1024)
    elem.text            # Content (Markdown/LaTeX/HTML)
    elem.confidence      # float (0-1) - detection confidence

Advanced Usage

Post-Processing: Extract Figures

# Get all figures with their captions
figures = {}
for elem in result.layout:
    if elem.category == "Figure":
        bbox = elem.bbox
        figures[str(bbox)] = {
            "bbox": bbox,
            "caption": None,
        }
    elif elem.category == "Caption":
        # Find nearest figure
        # (could implement spatial matching here)
        pass

for fig_bbox, fig_data in figures.items():
    print(f"Figure @ {fig_data['bbox']}")
    print(f"  Caption: {fig_data['caption']}")

Export to Structured Format

import json
from dataclasses import asdict

# Convert to JSON-serializable format
output_data = {
    "document": {
        "width": result.image_width,
        "height": result.image_height,
    },
    "elements": [
        {
            "category": elem.category,
            "bbox": {
                "x1": elem.bbox[0],
                "y1": elem.bbox[1],
                "x2": elem.bbox[2],
                "y2": elem.bbox[3],
            },
            "text": elem.text,
            "confidence": elem.confidence,
        }
        for elem in result.layout
    ]
}

# Save
with open("layout_analysis.json", "w") as f:
    json.dump(output_data, f, indent=2)

Visualization with Bounding Boxes

from PIL import Image, ImageDraw

# Load original image
image = Image.open("document.png")
img_w, img_h = image.size

# Create visualization
viz = image.copy()
draw = ImageDraw.Draw(viz)

# Color map for categories
colors = {
    "Title": "red",
    "Text": "blue",
    "Table": "orange",
    "Formula": "purple",
    "Figure": "green",
}

# Draw bounding boxes
for elem in result.layout:
    # Convert from 0-1024 to pixel coordinates
    bbox = [
        (elem.bbox[0] / 1024) * img_w,
        (elem.bbox[1] / 1024) * img_h,
        (elem.bbox[2] / 1024) * img_w,
        (elem.bbox[3] / 1024) * img_h,
    ]

    color = colors.get(elem.category, "gray")
    draw.rectangle(bbox, outline=color, width=3)
    draw.text((bbox[0], bbox[1] - 15), elem.category, fill=color)

# Save
viz.save("layout_visualization.png")

See Also