Model Comparison & Selection Guide¶

A comprehensive comparison of all models available in OmniDocs to help you choose the right tool for your use case.

Text Extraction Models¶

Models for extracting text content with optional formatting (Markdown/HTML/JSON).

Feature Comparison¶

Feature	Qwen3-VL	DotsOCR	Nanonuts
Model Size	2B-32B	~7B	~7B
Text Quality	Excellent	Very Good	Very Good
Layout Info	Basic	Detailed (11 cats)	Not included
Speed	Medium	Fast	Fast
Memory	4-40 GB	16 GB	12 GB
Multilingual	Yes (25+)	Limited	English-focused
Backends	PyTorch, VLLM, MLX, API	PyTorch, VLLM	PyTorch, VLLM
Output Formats	Markdown, HTML	Markdown (with JSON layout)	Markdown
License	Apache 2.0	Open	Apache 2.0

Decision Matrix: Text Extraction¶

Use Case	Best Choice	Why
High-quality multilingual docs	Qwen3-VL	Best text quality, many languages
Need layout + text	DotsOCR	Detailed layout categories with text
Fast, English docs	Nanonuts	Fastest, good quality for English
Batch processing	DotsOCR + VLLM	Good speed with detailed output
Cloud/API deployment	Qwen3-VL (API)	Only option with API backend
Apple Silicon only	Qwen3-VL (MLX)	Only VLM with MLX support
High-throughput batch processing	DeepSeek-OCR-2 + VLLM	~2500 tok/s on A100, official VLLM support

Performance Comparison¶

Model	Speed (tok/s)	Quality	Cost
Qwen3-VL-2B	100-150	Good	Low (small)
Qwen3-VL-8B	50-100	Excellent	Medium
Qwen3-VL-32B	20-40	Outstanding	High
DotsOCR	80-120	Very Good	Medium
Nanonuts	150-200	Good	Medium
DeepSeek-OCR-2 (VLLM)	~2500	Excellent	Low-Medium
---

Layout Analysis Models¶

Models for detecting document structure and element regions.

Feature Comparison¶

Feature	DocLayout-YOLO	RT-DETR	Qwen Layout
Architecture	YOLOv10	DETR	Vision-Language
Speed	Very Fast	Fast	Medium
Categories	10 (fixed)	12+ (fixed)	Unlimited (custom)
Accuracy	Good	Excellent	Excellent
Memory	2-4 GB	4-8 GB	8-16 GB
Backends	PyTorch	PyTorch	PyTorch, VLLM, MLX, API
Custom Labels	No	No	Yes
GPU Required	Yes	Yes	Yes (practical)
Best For	Speed	Accuracy	Flexibility

Fixed Categories Comparison¶

DocLayout-YOLO (10): Title, Plain text, Figure, Figure caption, Table, Table caption, Table footnote, Formula, Formula caption, Abandon

RT-DETR (12+): Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Section-header, Table, Text, Title, (+ Extended: Document Index, Code, Checkboxes, Forms)

Qwen Layout: Standard labels (10) + unlimited custom labels per use case

Decision Matrix: Layout Analysis¶

Use Case	Best Choice	Why
Batch processing, speed critical	DocLayout-YOLO	Fastest (0.1-0.2s/page)
Academic papers, high precision	RT-DETR	Excellent accuracy on papers
Custom layout categories needed	Qwen Layout	Only option for custom labels
Web page layout	Qwen Layout	Better understanding of semantic regions
Form field detection	Qwen Layout	Can detect custom field types
Production pipeline	DocLayout-YOLO	Proven, fast, stable

Speed Comparison¶

Model	Per-Page Speed	Device
DocLayout-YOLO	0.1-0.2s	Single A10 GPU
RT-DETR	0.3-0.5s	Single A10 GPU
Qwen Layout	2-5s	Single A10 GPU
Qwen Layout (VLLM)	0.5-1.5s	2x A10 GPU

OCR Models¶

Models for extracting text with character/word-level bounding boxes.

Feature Comparison¶

Feature	Tesseract	EasyOCR	PaddleOCR	Surya
Type	Traditional	Deep Learning	Deep Learning	Deep Learning
Speed	Slow (CPU)	Medium (GPU)	Very Fast	Medium
Languages	100+	80+	80+	Multi
Handwriting	Poor	Medium	Medium	Excellent
GPU Required	No	Yes	Yes	Yes
Memory	CPU	4-6 GB	2-4 GB	6-8 GB
Setup	System install	Python	Python	Python
Accuracy	High (printed)	Good	Excellent (CJK)	Best overall

Character Detection Accuracy¶

Model	Latin	Asian	Handwriting
Tesseract	95-99%	70-80%	30-50%
EasyOCR	90-96%	85-92%	60-70%
PaddleOCR	92-97%	94-99%	70-80%
Surya	94-98%	88-95%	85-90%

Decision Matrix: OCR¶

Use Case	Best Choice	Why
Printed English docs	Tesseract	Fastest (CPU), excellent accuracy
Mixed scripts/languages	PaddleOCR	Best for Asian languages
Handwritten documents	Surya	Best handwriting support
Cloud deployment	EasyOCR or PaddleOCR	Easier setup than Tesseract
No GPU available	Tesseract	Only CPU option
Real-time processing	PaddleOCR	Fastest GPU inference

Performance Comparison¶

Model	Speed	Accuracy	Cost
Tesseract	2-5s (CPU)	95-99% (printed)	Free
EasyOCR	1-2s (GPU)	90-96%	Free
PaddleOCR	0.3-1s (GPU)	92-99%	Free
Surya	1-3s (GPU)	94-98%	Free

Task-Specific Recommendations¶

Use Case: Academic Paper Processing¶

Goal: Extract text and layout from research papers

Recommended Pipeline: 1. Layout: DocLayout-YOLO (fast, accurate for papers) 2. Text: Qwen3-VL-8B (high quality, multilingual) 3. Optional: DotsOCR if detailed layout needed

Configuration:

from omnidocs.tasks.layout_extraction import DocLayoutYOLO, DocLayoutYOLOConfig
from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig

# Fast layout detection
layout = DocLayoutYOLO(config=DocLayoutYOLOConfig(device="cuda"))

# High-quality text extraction
extractor = QwenTextExtractor(
    backend=QwenTextPyTorchConfig(model="Qwen/Qwen3-VL-8B-Instruct")
)

# Process
layout_result = layout.extract(image)
text_result = extractor.extract(image)

Estimated Performance: 2-3 seconds per page, high accuracy

Use Case: Document Batch Processing¶

Goal: Extract text from 1000s of documents quickly

Recommended Pipeline: 1. Layout: DocLayout-YOLO (for batching) 2. Text: DotsOCR with VLLM (good quality, fast)

Configuration:

# VLLM for batching
from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRVLLMConfig

extractor = DotsOCRTextExtractor(
    backend=DotsOCRVLLMConfig(
        tensor_parallel_size=2,  # 2 GPUs
        gpu_memory_utilization=0.85,
    )
)

# Process 100+ documents per hour

Estimated Performance: 5-10k documents/hour on 2x A10 GPU

Use Case: Handwritten Document OCR¶

Goal: Extract text from handwritten documents

Recommended Pipeline: 1. OCR: Surya (best for handwriting) 2. Layout: Qwen Layout with custom labels (if needed)

Configuration:

from omnidocs.tasks.ocr_extraction import SuryaOCR, SuryaOCRConfig

ocr = SuryaOCR(config=SuryaOCRConfig(
    languages=["en"],
    det_model="en",
))

result = ocr.extract(image)

Estimated Performance: 85%+ handwriting accuracy

Use Case: Form Field Extraction¶

Goal: Extract text from form documents with custom field types

Recommended Pipeline: 1. Layout: Qwen Layout with custom labels for field types 2. OCR: Tesseract or EasyOCR per field

Configuration:

from omnidocs.tasks.layout_extraction import QwenLayoutDetector, CustomLabel
from omnidocs.tasks.layout_extraction.qwen import QwenLayoutPyTorchConfig

custom_labels = [
    CustomLabel(name="text_field"),
    CustomLabel(name="checkbox"),
    CustomLabel(name="signature_line"),
]

detector = QwenLayoutDetector(
    backend=QwenLayoutPyTorchConfig(device="cuda")
)

result = detector.extract(image, custom_labels=custom_labels)

Estimated Performance: Form processing in 5-10 seconds

Use Case: Multilingual Document Processing¶

Goal: Process documents in 20+ languages

Recommended Pipeline: 1. Text: Qwen3-VL-8B (25+ languages) 2. Fallback: PaddleOCR (80+ languages) for Asian scripts

Configuration:

# Qwen handles most languages
from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig

extractor = QwenTextExtractor(
    backend=QwenTextPyTorchConfig(model="Qwen/Qwen3-VL-8B-Instruct")
)

Supported: English, French, German, Spanish, Russian, Chinese, Japanese, Korean, Arabic, Hindi, Portuguese, Dutch, Polish, Turkish, Greek, Thai, Vietnamese, and more.

Use Case: Real-Time Document Processing¶

Goal: Process documents with <1 second latency

Recommended Pipeline: 1. Layout: DocLayout-YOLO (0.1-0.2s) 2. Text: Fast OCR or small VLM

Configuration:

# Fastest layout detection
from omnidocs.tasks.layout_extraction import DocLayoutYOLO, DocLayoutYOLOConfig

extractor = DocLayoutYOLO(config=DocLayoutYOLOConfig(
    device="cuda",
    img_size=768,  # Smaller for speed
))

result = extractor.extract(image)  # <200ms

Estimated Performance: 0.2-1 second per page with layout only

Use Case: Cloud Deployment (No GPU)¶

Goal: Deploy document processing in serverless environment

Recommended Pipeline: 1. Layout: Not practical (needs GPU) 2. Text: Use API backend or Tesseract (CPU) 3. OCR: Tesseract (CPU-friendly)

Configuration:

# API-based for cloud
from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextAPIConfig

extractor = QwenTextExtractor(
    backend=QwenTextAPIConfig(
        model="qwen3-vl-8b",
        api_key=os.getenv("QWEN_API_KEY"),
    )
)

# CPU-based OCR
from omnidocs.tasks.ocr_extraction import TesseractOCR, TesseractOCRConfig

ocr = TesseractOCR(config=TesseractOCRConfig(languages=["eng"]))

Estimated Cost: $0.01-0.10 per document via API

Performance Summary Table¶

Text Extraction Speed (larger batch)¶

Model	1 GPU	2 GPUs (VLLM)	Tokens/Sec
Qwen3-VL-2B	100-150	250-350	Per-doc
Qwen3-VL-8B	50-100	150-250	Per-doc
DotsOCR	80-120	200-300	Per-doc
Nanonuts	150-200	400-500	Per-doc

Layout Detection Speed¶

Model	Speed	Device
DocLayout-YOLO	0.1-0.2s	A10 GPU
RT-DETR	0.3-0.5s	A10 GPU
Qwen Layout (PyTorch)	2-5s	A10 GPU
Qwen Layout (VLLM)	0.5-1.5s	2x A10 GPU

OCR Speed¶

Model	Speed (1024x768)	Device
Tesseract	2-3s	CPU
EasyOCR	1-2s	GPU
PaddleOCR	0.3-1s	GPU
Surya	1-3s	GPU

Memory Requirements Summary¶

VRAM Requirements¶

Task	Minimal	Recommended	Optimal
Text (Qwen-8B)	8 GB	16 GB	24 GB
Layout (DocLayout)	2 GB	4 GB	8 GB
Layout (Qwen)	8 GB	16 GB	24 GB
OCR (GPU-based)	2 GB	4 GB	8 GB
Multi-task pipeline	16 GB	32 GB	40 GB

CPU Requirements¶

Model	CPU Load	Parallelization
Tesseract	Medium	Thread-based (4+ cores)
EasyOCR	Light	Not parallelizable
DotsOCR	Light	GPU-bound

Cost Analysis¶

Deployment Costs (per million documents)¶

Strategy	GPU Cost	Model Cost	Total
Self-hosted (PyTorch)	$500/month	Free	$6k/year
Self-hosted (VLLM batch)	$1000/month	Free	$12k/year
API-based	None	$1-2/doc	$1-2M
Hybrid (API + cached)	Minimal	$0.1-0.5/doc	$100k-500k

Development Time¶

Task	Effort	Models Needed
Simple extraction	1 hour	1 (any VLM)
Layout + text	2-4 hours	2 (layout + text)
Custom layout	4-8 hours	Qwen layout + fine-tuning
Production pipeline	1-2 weeks	3+ with batching, caching

Frequently Asked Questions¶

Q: Which model is fastest?¶

A: DocLayout-YOLO for layout (0.1-0.2s), PaddleOCR for OCR (0.3-1s), Nanonuts for text (50-80 tok/s)

Q: Which is most accurate?¶

A: Qwen3-VL-32B for text, Surya for handwriting, RT-DETR for layout

Q: Which requires least GPU?¶

A: DocLayout-YOLO (2-4 GB), Tesseract (CPU-only)

Q: Which supports most languages?¶

A: Tesseract (100+), Qwen (25+), PaddleOCR (80+)

Q: Which is cheapest to run?¶

A: Tesseract (free, CPU), DocLayout-YOLO (small GPU model)

Q: Best for real-time (sub-second)?¶

A: DocLayout-YOLO for layout only, or PaddleOCR for OCR

Q: Best for batch processing?¶

A: DotsOCR or Qwen with VLLM (2-4 GPUs)

Q: Can I run without GPU?¶

A: Yes - Tesseract (OCR) and API backends (text)

Q: Which is easiest to set up?¶

A: Qwen with PyTorch (single pip install)

Q: Production recommendation?¶

A: DocLayout-YOLO + Qwen3-VL-8B on 2x A10 GPU

Migration Guide¶

From Tesseract to Modern OCR¶

# Old: Tesseract only
from omnidocs.tasks.ocr_extraction import TesseractOCR

# New: Choose based on use case
from omnidocs.tasks.ocr_extraction import (
    TesseractOCR,  # Printed, many languages
    PaddleOCR,     # Speed, Asian languages
    SuryaOCR,      # Handwriting
)

From Single-Model to Pipeline¶

# Old: Text extraction only
text = extract_text(image)

# New: Layout + text pipeline
layout = detect_layout(image)  # Understand structure
text = extract_text(image)     # Extract content
# Combine results for better processing

From CPU to GPU¶

# Old: CPU-based
ocr = TesseractOCR()  # 2-3s per page

# New: GPU-accelerated
ocr = PaddleOCR()     # 0.3-1s per page (10x faster)

Model Comparison & Selection Guide¶

Text Extraction Models¶

Feature Comparison¶

Decision Matrix: Text Extraction¶

Performance Comparison¶

Layout Analysis Models¶

Feature Comparison¶

Fixed Categories Comparison¶

Decision Matrix: Layout Analysis¶

Speed Comparison¶

OCR Models¶

Feature Comparison¶

Character Detection Accuracy¶

Decision Matrix: OCR¶

Performance Comparison¶

Task-Specific Recommendations¶

Use Case: Academic Paper Processing¶

Use Case: Document Batch Processing¶

Use Case: Handwritten Document OCR¶

Use Case: Form Field Extraction¶

Use Case: Multilingual Document Processing¶

Use Case: Real-Time Document Processing¶

Use Case: Cloud Deployment (No GPU)¶

Performance Summary Table¶

Text Extraction Speed (larger batch)¶

Layout Detection Speed¶

OCR Speed¶

Memory Requirements Summary¶

VRAM Requirements¶

CPU Requirements¶

Cost Analysis¶

Deployment Costs (per million documents)¶

Development Time¶

Frequently Asked Questions¶

Q: Which model is fastest?¶

Q: Which is most accurate?¶

Q: Which requires least GPU?¶

Q: Which supports most languages?¶

Q: Which is cheapest to run?¶

Q: Best for real-time (sub-second)?¶

Q: Best for batch processing?¶

Q: Can I run without GPU?¶

Q: Which is easiest to set up?¶

Q: Production recommendation?¶

Migration Guide¶

From Tesseract to Modern OCR¶

From Single-Model to Pipeline¶

From CPU to GPU¶

See Also¶