OmniDocs Development Roadmap¶
📦 Target Model Support¶
Research Date: February 2026 Status: Comprehensive model research completed Models Ordered By: Release date (newest first within each size category)
🎯 Quick Reference: Model Capabilities & Backend Support¶
Comprehensive Model Comparison Table¶
| Model | Size | PyTorch | VLLM | MLX | OpenAI API | Tasks | Release |
|---|---|---|---|---|---|---|---|
| DeepSeek-OCR-2 | 3B | ✅ | ✅ | ✅ | ✅ | T, O, Tab, F | Jan 2026 |
| LightOnOCR-2-1B | 1B | ✅ | ✅ | ❌ | ❌ | T, O | Jan 2026 |
| LightOnOCR-2-1B-bbox | 1B | ✅ | ✅ | ❌ | ❌ | T, L, O | Jan 2026 |
| OCRFlux-3B | 3B | ✅ | ✅ | ❌ | ❌ | T, O | Jan 2026 |
| Qwen3-VL-2B | 2B | ✅ | ✅ | ✅ | ✅ | T, L, S, O, Tab | Oct 2025 |
| Qwen3-VL-4B | 4B | ✅ | ✅ | ✅ | ❌ | T, L, S, O, Tab | Oct 2025 |
| Qwen3-VL-8B | 8B | ✅ | ✅ | ✅ | ✅ | T, L, S, O, Tab, F | Oct 2025 |
| Qwen3-VL-32B | 32B | ✅ | ✅ | ✅ | ✅ | T, L, S, O, Tab, F | Oct 2025 |
| olmOCR-2-7B | 7B | ✅ | ✅ | ❌ | ✅ | T, O, Tab, F | Oct 2025 |
| PaddleOCR-VL | 900M | ✅ | ⚠️ | ❌ | ❌ | T, L, O, Tab, F | Oct 2025 |
| LightOnOCR-1B | 1B | ✅ | ❌ | ❌ | ❌ | T, O | Oct 2025 |
| Granite-Vision-3.3-2B | 2B | ✅ | ❌ | ❌ | ❌ | T, L, Tab, Chart | Jun 2025 |
| Gemma-3-4B-IT | 4B | ✅ | ❌ | ❌ | ✅ | T, S, O | 2025 |
| Granite-Docling-258M | 258M | ✅ | ⚠️ | ✅ | ❌ | T, L, Tab, F | Dec 2024 |
| dots.ocr | 1.7B | ✅ | ✅ | ❌ | ❌ | T, L, Tab, F, O | Dec 2024 |
| DeepSeek-OCR | 3B | ✅ | ✅ | ✅ | ✅ | T, O, Tab | Oct 2024 |
| Chandra | 9B | ✅ | ✅ | ❌ | ❌ | T, L, O, Tab, F | 2024 |
| MinerU2.5-2509-1.2B | 1.2B | ✅ | ✅ | ✅ | ❌ | T, L, Tab, F, O | Sep 2024 |
| GOT-OCR2.0 | 700M | ✅ | ❌ | ❌ | ❌ | T, O, F, Tab | Sep 2024 |
| Nanonets-OCR2-3B | 3B | ✅ | ✅ | ✅ | ❌ | T, F, O | 2024 |
| Qwen2.5-VL-3B | 3B | ✅ | ✅ | ✅ | ✅ | T, L, S, O | 2024 |
| Qwen2.5-VL-7B | 7B | ✅ | ✅ | ✅ | ✅ | T, L, S, O, Tab | 2024 |
| Qwen2.5-VL-32B | 32B | ✅ | ✅ | ✅ | ✅ | T, L, S, O, Tab | 2024 |
Legend: - Tasks: T=Text Extract, L=Layout, O=OCR, S=Structured, Tab=Table, F=Formula, Chart=Chart Understanding - ✅ = Fully supported | ⚠️ = Limited/Partial support | ❌ = Not supported
Backend Details¶
PyTorch Support¶
- All models support PyTorch via HuggingFace Transformers
- Primary development backend for all models
- Requirements:
transformers>=4.46,torch>=2.0
VLLM Support (High-Throughput Production)¶
Fully Supported (✅): - Qwen3-VL Series (vllm>=0.11.0) - Qwen2.5-VL Series - DeepSeek-OCR (official upstream) - dots.ocr (recommended, vllm>=0.9.1) - MinerU2.5 - olmOCR-2 (via olmOCR toolkit) - Chandra - LightOnOCR-2-1B (vllm>=0.11.1) - Nanonets-OCR2-3B
Limited Support (⚠️): - Granite-Docling-258M (untied weights required) - PaddleOCR-VL (possible but not officially confirmed)
Not Supported (❌): - GOT-OCR2.0 - Gemma-3-4B-IT - LightOnOCR-1B (legacy)
MLX Support (Apple Silicon M1/M2/M3+)¶
Fully Supported via mlx-community (✅): - Qwen3-VL Series - Collection - 2B, 4B, 8B, 32B (4-bit, 8-bit variants) - Qwen2.5-VL Series - Collection - 3B, 7B, 32B, 72B (4-bit, 8-bit variants) - DeepSeek-OCR - 4-bit, 8-bit - Granite-Docling-258M - Official MLX - MinerU2.5 - bf16 - Nanonets-OCR2-3B - 4-bit
Usage:
pip install mlx-vlm
python -m mlx_vlm.generate --model mlx-community/Qwen3-VL-8B-Instruct-4bit \
--prompt "Extract text from this document" --image doc.png
OpenAI-Compatible API Providers¶
OpenRouter (openrouter.ai): - ✅ Qwen3-VL-235B-A22B ($0.45/$3.50 per M tokens) - ✅ Qwen3-VL-30B-A3B - ✅ Qwen2.5-VL-3B (SOTA visual understanding) - ✅ Qwen2.5-VL-32B (structured outputs, math) - ✅ Qwen2.5-VL-72B (best overall)
Novita AI (novita.ai): - ✅ DeepSeek-OCR (Model Page) - ✅ Qwen2.5-VL-72B (OCR + scientific reasoning) - ✅ Qwen3-VL-8B ($0.08/$0.50 per M tokens)
Together AI (together.ai): - ✅ Various vision-language models - ✅ Lightweight models with multilingual support
Replicate (replicate.com): - ✅ Vision models collection - ✅ Pay-per-use inference
Others: - DeepInfra: olmOCR-2-7B - Parasail: olmOCR-2-7B - Cirrascale: olmOCR-2-7B
API Integration Example:
from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenAPIConfig
# OpenRouter
extractor = QwenTextExtractor(
backend=QwenAPIConfig(
model="qwen/qwen3-vl-8b-instruct",
api_key="YOUR_OPENROUTER_KEY",
base_url="https://openrouter.ai/api/v1"
)
)
# Novita AI
extractor = QwenTextExtractor(
backend=QwenAPIConfig(
model="novita/qwen3-vl-8b-instruct",
api_key="YOUR_NOVITA_KEY",
base_url="https://api.novita.ai/v3/openai"
)
)
Task Capability Matrix¶
| Task | Description | Model Count | Top Models |
|---|---|---|---|
| Text Extract (T) | Document → Markdown/HTML | 18 | LightOnOCR-2, Chandra, Qwen3-VL-8B |
| Layout (L) | Structure detection with bboxes | 8 | Qwen3-VL-8B, Chandra, MinerU2.5 |
| OCR (O) | Text + bbox coordinates | 15 | LightOnOCR-2, olmOCR-2, Chandra |
| Structured (S) | Schema-based extraction | 5 | Qwen3-VL (all), Qwen2.5-VL (all), Gemma-3 |
| Table (Tab) | Table detection/extraction | 12 | Qwen3-VL-8B, DeepSeek-OCR, olmOCR-2 |
| Formula (F) | Math expression recognition | 8 | Nanonets-OCR2, Qwen3-VL-8B, GOT-OCR2.0 |
Model Overview by Task Capability¶
Task Categories¶
| Task | Description | Model Count |
|---|---|---|
| text_extract | Document to Markdown/HTML conversion | 18 |
| layout | Document structure detection with bounding boxes | 8 |
| ocr | Text extraction with bbox coordinates | 6 |
| structured | Schema-based data extraction | 5 |
| table | Table detection and extraction | 4 |
| formula | Mathematical expression recognition | 3 |
🆕 Latest Models (January 2026)¶
DeepSeek-OCR-2¶
Released: January 27, 2026 | Parameters: 3B | License: MIT
HuggingFace: deepseek-ai/DeepSeek-OCR-2
Description: State-of-the-art 3B-parameter vision-language model with new DeepEncoder architecture. Unlike traditional OCR systems, DeepSeek OCR 2 focuses on image-to-text with stronger visual reasoning.
Key Features: - DeepEncoder: 380M vision encoder (80M SAM-base + 300M CLIP-large) - 97% precision at 10× visual token compression - ~60% accuracy at 20× compression - Strong document understanding beyond text extraction
Backends Supported: - ✅ PyTorch (HuggingFace Transformers) - ✅ VLLM (high throughput) - ✅ MLX (4-bit, 8-bit) - ✅ API (Novita AI, OpenRouter)
Tasks: text_extract, ocr, table, formula
Links: - Model Card - GitHub
LightOnOCR-2-1B¶
Released: January 19, 2026 | Parameters: 1B | License: Apache 2.0
HuggingFace: lightonai/LightOnOCR-2-1B
Description: Second-generation 1B-parameter end-to-end vision-language OCR model. SOTA conversion of PDF renders to clean text without multi-stage pipelines.
Key Features: - 83.2 on OlmOCR-Bench (SOTA, beats 9B Chandra) - 5.7 pages/second on H100 (~493K pages/day) - <$0.01 per 1,000 pages at cloud pricing - Bbox variant for figure/image localization
Model Variants:
- LightOnOCR-2-1B - Default OCR
- LightOnOCR-2-1B-bbox - Best localization
- LightOnOCR-2-1B-bbox-soup - Balanced
Backends Supported: - ✅ PyTorch (transformers from source) - ✅ VLLM (vllm>=0.11.1)
Tasks: text_extract, ocr, layout (bbox variants)
OCRFlux-3B¶
Released: January 2026 | Parameters: 3B | License: Apache 2.0
HuggingFace: Fine-tuned from Qwen2.5-VL-3B-Instruct
Description: Multimodal LLM for converting PDFs and images to clean Markdown. Runs efficiently on consumer hardware (GTX 3090).
Key Features: - Compact 3B architecture - Clean Markdown output - Consumer GPU compatible - Based on Qwen2.5-VL
Backends Supported: - ✅ PyTorch - ✅ VLLM
Tasks: text_extract, ocr
🎯 Core Models (By Size & Release Date)¶
Ultra-Compact Models (<1B Parameters)¶
IBM Granite-Docling-258M¶
Released: December 2024 | Parameters: 258M | License: Apache 2.0
HuggingFace: ibm-granite/granite-docling-258M
Description: Ultra-compact vision-language model (VLM) for converting documents to machine-readable formats while fully preserving layout, tables, equations, and lists. Built on Idefics3 architecture with siglip2-base-patch16-512 vision encoder and Granite 165M LLM.
Key Features: - End-to-end document understanding at 258M parameters - Handles inline/floating math, code, table structure - Rivals systems several times its size - Extremely cost-effective
Backends Supported: - ✅ PyTorch (HuggingFace Transformers) - ✅ MLX (Apple Silicon) - ibm-granite/granite-docling-258M-mlx - ✅ WebGPU - Demo Space
Integration:
Dependencies: transformers, torch, pillow, docling
Tasks: text_extract, layout, table, formula
Links: - Model Card - MLX Version - Demo Space - Official Docs - Collection
2. stepfun-ai GOT-OCR2.0¶
Released: September 2024 | Parameters: 700M | License: Apache 2.0
HuggingFace: stepfun-ai/GOT-OCR2_0
Description: General OCR Theory model for multilingual OCR on plain documents, scene text, formatted documents, tables, charts, mathematical formulas, geometric shapes, molecular formulas, and sheet music.
Key Features: - Interactive OCR with region-specific recognition (coordinate or color-based) - Plain text OCR + formatted text OCR (markdown, LaTeX) - Multi-page document processing - Wide range of specialized content types
Model Variations: - stepfun-ai/GOT-OCR2_0 - Original with custom code - stepfun-ai/GOT-OCR-2.0-hf - HuggingFace-native transformers integration
Backends Supported: - ✅ PyTorch (HuggingFace Transformers) - ✅ Custom inference pipeline
Dependencies: transformers, torch, pillow
Tasks: text_extract, ocr, formula, table
Links: - Model Card - HF-Native Version
Compact Models (1-2B Parameters)¶
3. rednote-hilab dots.ocr¶
Released: December 2024 | Parameters: 1.7B | License: MIT
HuggingFace: rednote-hilab/dots.ocr
Description: Multilingual documents parsing model based on 1.7B LLM with SOTA performance. Provides faster inference than many high-performing models based on larger foundations.
Key Features: - Task switching via prompt alteration only - Competitive detection vs traditional models (DocLayout-YOLO) - Built-in VLLM support for high throughput - Released with paper arXiv:2512.02498
Model Variations: - rednote-hilab/dots.ocr - Full model - rednote-hilab/dots.ocr.base - Base variant
Backends Supported: - ✅ PyTorch (HuggingFace Transformers) - ✅ VLLM (Recommended for production) - vLLM 0.9.1+
Dependencies: transformers, torch, vllm>=0.9.1 (recommended)
Tasks: text_extract, layout, table, formula, ocr
Links: - Model Card - GitHub - Live Demo - Paper - Collection
4. PaddlePaddle PaddleOCR-VL¶
Released: October 2025 | Parameters: 900M | License: Apache 2.0
HuggingFace: PaddlePaddle/PaddleOCR-VL
Description: Ultra-compact multilingual documents parsing VLM with SOTA performance. Integrates NaViT-style dynamic resolution visual encoder with ERNIE-4.5-0.3B language model.
Key Features: - Supports 109 languages - Excels in recognizing complex elements (text, tables, formulas, charts) - Minimal resource consumption - Fast inference speeds - SOTA in page-level parsing and element-level recognition
Backends Supported: - ✅ PyTorch (HuggingFace Transformers - officially integrated) - ✅ PaddlePaddle framework
Dependencies: transformers, torch, paddlepaddle
Tasks: text_extract, layout, ocr, table, formula
Links: - Model Card - Online Demo - Collection - Transformers Docs - GitHub - PaddleOCR
5. LightOn AI LightOnOCR Series¶
Released: January 2025 (v2), October 2025 (v1) | Parameters: 1B | License: Apache 2.0
HuggingFace Models: - lightonai/LightOnOCR-2-1B - Recommended for OCR - lightonai/LightOnOCR-2-1B-bbox - Best localization - lightonai/LightOnOCR-2-1B-bbox-soup - Balanced OCR + bbox - lightonai/LightOnOCR-1B-1025 - Legacy v1
Description: Compact, end-to-end vision-language model for OCR and document understanding. State-of-the-art accuracy in its weight class while being several times faster than larger VLMs.
Key Features: - LightOnOCR-2-1B: SOTA on OlmOCR-Bench (83.2 ± 0.9), outperforms Chandra-9B - Performance: 3.3× faster than Chandra, 1.7× faster than OlmOCR, 5× faster than dots.ocr - Variants: OCR-only, bbox-capable (figure/image localization), and balanced checkpoints - Paper: arXiv:2601.14251
Model Comparison: | Model | Use Case | Bbox Support | |-------|----------|--------------| | LightOnOCR-2-1B | Default for PDF→Text/Markdown | ❌ | | LightOnOCR-2-1B-bbox | Best localization of figures/images | ✅ Best | | LightOnOCR-2-1B-bbox-soup | Balanced OCR + localization | ✅ Balanced |
Backends Supported: - ✅ PyTorch (HuggingFace Transformers - upstream support) - ⚠️ Requires transformers from source for v2 (not yet in stable release)
Quantized Versions: - GGUF format
Dependencies: transformers>=4.48 (from source for v2), torch, pillow
Tasks: text_extract, ocr, layout (bbox variants only)
Links: - LightOnOCR-2 Blog - LightOnOCR-1 Blog - Demo Space - Paper (arXiv) - Organization
6. opendatalab MinerU2.5¶
Released: September 2024 | Parameters: 1.2B | License: Apache 2.0
HuggingFace: opendatalab/MinerU2.5-2509-1.2B
Description: Decoupled vision-language model for efficient high-resolution document parsing with state-of-the-art accuracy and low computational overhead.
Key Features: - Two-stage parsing: global layout analysis on downsampled images → fine-grained content recognition on native-resolution crops - Outperforms Gemini-2.5 Pro, Qwen2.5-VL-72B, GPT-4o, MonkeyOCR, dots.ocr, PP-StructureV3 - Large-scale diverse data engine for pretraining/fine-tuning - New performance records in text, formula, table recognition, and reading order
Model Variations: - opendatalab/MinerU2.5-2509-1.2B - Official model - mlx-community/MinerU2.5-2509-1.2B-bf16 - MLX for Apple Silicon - Mungert/MinerU2.5-2509-1.2B-GGUF - GGUF quantized
Backends Supported: - ✅ PyTorch (HuggingFace Transformers) - ✅ VLLM (with OpenAI API specs) - ✅ MLX (Apple Silicon)
Dependencies: transformers, torch, vllm (optional)
Tasks: text_extract, layout, table, formula, ocr
Links: - Model Card - Paper (arXiv:2509.22186) - MLX Version - GGUF Version
Small Models (2-4B Parameters)¶
7. Qwen3-VL-2B-Instruct¶
Released: October 2025 | Parameters: 2B | License: Apache 2.0
HuggingFace: Qwen/Qwen3-VL-2B-Instruct
Description: Multimodal LLM from Alibaba Cloud's Qwen team with comprehensive upgrades: superior text understanding/generation, deeper visual perception/reasoning, extended context, and stronger agent interaction.
Key Features: - Dense and MoE architectures that scale from edge to cloud - Instruct and reasoning-enhanced "Thinking" editions - Enhanced spatial and video dynamics comprehension - Part of Qwen3-VL multimodal retrieval framework (arXiv:2601.04720, 2026)
Model Variations: - Qwen/Qwen3-VL-2B-Instruct - Instruction-tuned - Qwen/Qwen3-VL-2B-Thinking - Reasoning-enhanced - Qwen/Qwen3-VL-2B-Instruct-GGUF - Quantized GGUF
Backends Supported: - ✅ PyTorch (HuggingFace Transformers) - ✅ VLLM - ✅ MLX (via mlx-community) - ✅ API (via cloud providers)
Dependencies: transformers>=4.46, torch, qwen-vl-utils
Tasks: text_extract, layout, structured, ocr, table
Links: - Model Card - GitHub - Collection - GGUF Version
8. DeepSeek-OCR¶
Released: October 2024 | Parameters: ~3B | License: MIT
HuggingFace: deepseek-ai/DeepSeek-OCR
Description: High-accuracy OCR model from DeepSeek-AI for extracting text from complex visual inputs (documents, screenshots, receipts, natural scenes).
Key Features: - Built for real-world documents: PDFs, forms, tables, handwritten/noisy text - Outputs clean, structured Markdown - VLLM support upstream - ~2500 tokens/s on A100 with vLLM - Paper: arXiv:2510.18234
Model Variations: - deepseek-ai/DeepSeek-OCR - Official BF16 (~6.7 GB) - NexaAI/DeepSeek-OCR-GGUF - Quantized GGUF
Backends Supported: - ✅ PyTorch (HuggingFace Transformers) - ✅ VLLM (officially supported)
Requirements:
- Python 3.12.9 + CUDA 11.8
- torch==2.6.0, transformers==4.46.3, flash-attn==2.7.3
- L4 / A100 GPUs (≥16 GB VRAM)
Dependencies: transformers, torch, vllm, flash-attn, einops
Tasks: text_extract, ocr, table
Links: - Model Card - GitHub - GGUF Version - Demo Space
9. Nanonets-OCR2-3B¶
Released: 2024 | Parameters: 3B | License: Apache 2.0
HuggingFace: nanonets/Nanonets-OCR2-3B
Description: State-of-the-art image-to-markdown OCR model that transforms documents into structured markdown with intelligent content recognition and semantic tagging, optimized for LLM downstream processing.
Key Features: - LaTeX equation recognition (inline $...$ and display $$...$$) - Intelligent image description with structured tags (logos, charts, graphs) - 125K context window - ~7.53 GB model size
Model Variations: - nanonets/Nanonets-OCR2-3B - Full BF16 - Mungert/Nanonets-OCR2-3B-GGUF - GGUF quantized - mlx-community/Nanonets-OCR2-3B-4bit - MLX 4-bit - yasserrmd/Nanonets-OCR2-3B - Ollama format
Backends Supported: - ✅ PyTorch (HuggingFace Transformers) - ✅ MLX (Apple Silicon) - ✅ Ollama
Dependencies: transformers, torch, pillow
Tasks: text_extract, formula, ocr
Links: - Model Card - GGUF Version - MLX 4-bit - Ollama
10. Qwen3-VL-4B-Instruct¶
Released: October 2025 | Parameters: 4B | License: Apache 2.0
HuggingFace: Qwen/Qwen3-VL-4B-Instruct
Description: Mid-size Qwen3-VL model with balanced performance and efficiency. Part of comprehensive multimodal model series with text understanding, visual reasoning, and agent capabilities.
Model Variations: - Qwen/Qwen3-VL-4B-Instruct - Instruction-tuned - Qwen/Qwen3-VL-4B-Thinking - Reasoning-enhanced
Backends Supported: - ✅ PyTorch (HuggingFace Transformers) - ✅ VLLM - ✅ MLX (via mlx-community) - ✅ API (via cloud providers)
Dependencies: transformers>=4.46, torch, qwen-vl-utils
Tasks: text_extract, layout, structured, ocr, table
Links: - Collection - GitHub
11. Google Gemma-3-4B-IT¶
Released: 2025 | Parameters: 4B | License: Gemma License
HuggingFace: google/gemma-3-4b-it
Description: Lightweight, state-of-the-art multimodal model from Google built from same research/technology as Gemini. Handles text and image input, generates text output.
Key Features: - 128K context window - Multilingual support (140+ languages) - SigLIP image encoder (896×896 square images) - Gemma-3-4B-IT beats Gemma-2-27B-IT on benchmarks
Model Variations: - google/gemma-3-4b-it - Instruction-tuned (vision-capable) - google/gemma-3-4b-pt - Pre-trained base - google/gemma-3-4b-it-qat-q4_0-gguf - Quantized GGUF - bartowski/google_gemma-3-4b-it-GGUF - Community GGUF
Backends Supported: - ✅ PyTorch (HuggingFace Transformers) - ✅ Google AI SDK - ✅ API (Google AI Studio)
Dependencies: transformers>=4.46, torch, pillow
Tasks: text_extract, structured, ocr
Links: - Model Card - Blog Post - Transformers Docs - Google Docs - DeepMind Page
Medium Models (7-9B Parameters)¶
12. allenai olmOCR-2-7B-1025¶
Released: October 2025 | Parameters: 7B | License: Apache 2.0
HuggingFace: allenai/olmOCR-2-7B-1025
Description: State-of-the-art OCR for English-language digitized print documents. Fine-tuned from Qwen2.5-VL-7B-Instruct using olmOCR-mix-1025 dataset + GRPO RL training.
Key Features: - 82.4 points on olmOCR-Bench (SOTA for real-world documents) - Substantial improvements where OCR often fails (math equations, tables, tricky cases) - Boosted via reinforcement learning (GRPO)
Model Variations: - allenai/olmOCR-2-7B-1025 - Full BF16 version - allenai/olmOCR-2-7B-1025-FP8 - Recommended FP8 quantized (practical use except fine-tuning) - bartowski/allenai_olmOCR-2-7B-1025-GGUF - GGUF quantized - richardyoung/olmOCR-2-7B-1025-GGUF - Alternative GGUF
Backends Supported: - ✅ PyTorch (HuggingFace Transformers) - ✅ VLLM (recommended via olmOCR toolkit) - ✅ API (DeepInfra, Parasail, Cirrascale)
Best Usage: Via olmOCR toolkit with VLLM for efficient inference at scale (millions of documents).
Dependencies: transformers, torch, vllm, olmocr (toolkit
Tasks: text_extract, ocr, table, formula
Links: - Model Card - FP8 Version - Blog Post - GGUF (bartowski)
13. Qwen3-VL-8B-Instruct¶
Released: October 2025 | Parameters: 8B | License: Apache 2.0
HuggingFace: Qwen/Qwen3-VL-8B-Instruct
Description: Primary model in Qwen3-VL series with optimal balance of performance and efficiency. Enhanced document parsing over Qwen2.5-VL with improved visual perception, text understanding, and advanced reasoning.
Key Features: - Custom layout label support (flexible VLM) - Extended context length - Enhanced spatial and video comprehension - Stronger agent interaction capabilities
Model Variations: - Qwen/Qwen3-VL-8B-Instruct - Instruction-tuned - Qwen/Qwen3-VL-8B-Thinking - Reasoning-enhanced
Backends Supported: - ✅ PyTorch (HuggingFace Transformers) - ✅ VLLM - ✅ MLX (Apple Silicon) - mlx-community/Qwen3-VL-8B-Instruct-4bit - ✅ API (Novita AI, OpenRouter, etc.)
API Providers: - Novita AI: Context 131K tokens, Max output 33K tokens - Pricing: $0.08/M input tokens, $0.50/M output tokens
Dependencies: transformers>=4.46, torch, qwen-vl-utils, vllm (optional)
Tasks: text_extract, layout, structured, ocr, table, formula
Links: - Model Card - Collection - GitHub - MLX 4-bit
14. datalab-to Chandra¶
Released: 2024 | Parameters: 9B | License: Apache 2.0
HuggingFace: datalab-to/chandra
Description: OCR model handling complex tables, forms, and handwriting with full layout preservation. Uses Qwen3VL for document understanding.
Key Features: - 83.1 ± 0.9 overall on OlmOCR benchmark (outperforms DeepSeek OCR, dots.ocr, olmOCR) - Strong grounding capabilities - Supports 40+ languages - Layout-aware output with bbox coordinates for every text block, table, and image - Outputs in HTML, Markdown, and JSON with detailed layout
Use Cases: - Handwritten forms - Mathematical notation - Multi-column layouts - Complex tables
Backends Supported: - ✅ PyTorch (HuggingFace Transformers) - ✅ VLLM (production throughput)
Installation:
Model Variations: - datalab-to/chandra - Official model - noctrex/Chandra-OCR-GGUF - GGUF quantized
Dependencies: transformers, torch, vllm (optional), chandra-ocr
Tasks: text_extract, layout, ocr, table, formula
Links: - Model Card - GitHub - Blog Post - DeepWiki Docs - GGUF Version
Large Models (32B+ Parameters)¶
15. Qwen3-VL-32B-Instruct¶
Released: October 2025 | Parameters: 32B | License: Apache 2.0
HuggingFace: Qwen/Qwen3-VL-32B-Instruct
Description: Largest Qwen3-VL model with maximum performance for complex document understanding and multimodal reasoning tasks.
Key Features: - Superior performance on complex documents - Extended context length - Enhanced reasoning capabilities - Production-grade for demanding applications
Model Variations: - Qwen/Qwen3-VL-32B-Instruct - Instruction-tuned - Qwen/Qwen3-VL-32B-Thinking - Reasoning-enhanced
Backends Supported: - ✅ PyTorch (HuggingFace Transformers) - ✅ VLLM (recommended for production) - ✅ API (cloud providers)
GPU Requirements: A100 40GB+ or multi-GPU setup
Dependencies: transformers>=4.46, torch, qwen-vl-utils, vllm
Tasks: text_extract, layout, structured, ocr, table, formula
Links: - Model Card - Collection - GitHub
Specialized Models¶
16. docling-project/docling-models¶
Released: 2024 | Parameters: Various | License: Apache 2.0
HuggingFace: docling-project/docling-models
Description: Collection of models powering the Docling PDF document conversion package. Includes layout detection (RT-DETR) and table structure recognition (TableFormer).
Models Included: 1. Layout Model: RT-DETR for detecting document components - Labels: Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Section-header, Table, Text, Title 2. TableFormer Model: Table structure identification from images
Note: Superseded by granite-docling-258M for end-to-end document conversion (receives updates and support).
Backends Supported: - ✅ PyTorch (via Docling library)
Integration:
Dependencies: docling, transformers, torch
Tasks: layout, table
Links: - Model Card - Vision Models Docs - SmolDocling (legacy)
📦 Optional Models (Legacy/Alternative)¶
Qwen 2.5-VL Series (Previous Generation)¶
Qwen2.5-VL-3B-Instruct¶
Released: 2024 | Parameters: 3B | License: Apache 2.0
HuggingFace: Qwen/Qwen2.5-VL-3B-Instruct
Description: Previous generation Qwen VLM with strong visual understanding, agentic capabilities, video understanding (1+ hour), and structured outputs.
Key Features: - Analyzes texts, charts, icons, graphics, layouts - Visual agent capabilities (computer use, phone use) - Video comprehension with temporal segment pinpointing - ViT architecture with SwiGLU and RMSNorm - Dynamic resolution + dynamic FPS sampling
Backends Supported: - ✅ PyTorch (HuggingFace Transformers) - ✅ VLLM - ✅ MLX - ✅ API
Dependencies: transformers, torch, qwen-vl-utils
Tasks: text_extract, layout, structured, ocr
Links: - Model Card - Collection
Qwen2.5-VL-7B-Instruct¶
Released: 2024 | Parameters: 7B | License: Apache 2.0
HuggingFace: Qwen/Qwen2.5-VL-7B-Instruct
Description: Mid-size Qwen2.5-VL model with same capabilities as 3B variant but enhanced performance.
Model Variations: - Qwen/Qwen2.5-VL-7B-Instruct - Official - unsloth/Qwen2.5-VL-7B-Instruct-GGUF - GGUF quantized - nvidia/Qwen2.5-VL-7B-Instruct-NVFP4 - NVIDIA FP4 optimized
Backends Supported: - ✅ PyTorch (HuggingFace Transformers) - ✅ VLLM - ✅ MLX - ✅ API
Dependencies: transformers, torch, qwen-vl-utils
Tasks: text_extract, layout, structured, ocr, table
Links: - Model Card - Collection - GGUF Version
📊 Model Comparison Summary¶
By Release Date (2024-2026)¶
| Model | Release | Params | Benchmark Score |
|---|---|---|---|
| LightOnOCR-2-1B | Jan 2025 | 1B | 83.2 (OlmOCR) |
| dots.ocr | Dec 2024 | 1.7B | 79.1 (OlmOCR) |
| Granite-Docling-258M | Dec 2024 | 258M | N/A |
| Chandra | 2024 | 9B | 83.1 (OlmOCR) |
| Qwen3-VL Series | Oct 2025 | 2-32B | SOTA |
| PaddleOCR-VL | Oct 2025 | 900M | SOTA |
| olmOCR-2-7B | Oct 2025 | 7B | 82.4 (OlmOCR) |
| DeepSeek-OCR | Oct 2024 | 3B | 75.4 (OlmOCR) |
| GOT-OCR2.0 | Sep 2024 | 700M | N/A |
| MinerU2.5 | Sep 2024 | 1.2B | SOTA |
By Performance (OlmOCR-Bench)¶
| Rank | Model | Score | Params |
|---|---|---|---|
| 1 | LightOnOCR-2-1B | 83.2 ± 0.9 | 1B |
| 2 | Chandra | 83.1 ± 0.9 | 9B |
| 3 | olmOCR-2-7B | 82.4 | 7B |
| 4 | dots.ocr | 79.1 | 1.7B |
| 5 | olmOCR (v1) | 78.5 | 7B |
| 6 | DeepSeek-OCR | 75.4 ± 1.0 | 3B |
By Speed (Relative Performance)¶
| Model | Speed Multiplier | Params |
|---|---|---|
| LightOnOCR-2-1B | Fastest baseline | 1B |
| PaddleOCR-VL | 1.73× slower | 900M |
| DeepSeek-OCR (vLLM) | 1.73× slower | 3B |
| olmOCR-2 | 1.7× slower | 7B |
| Chandra | 3.3× slower | 9B |
| dots.ocr | 5× slower | 1.7B |
🔧 Backend Support Matrix¶
| Model | PyTorch | VLLM | MLX | API | GGUF |
|---|---|---|---|---|---|
| Granite-Docling-258M | ✅ | ❌ | ✅ | ❌ | ❌ |
| dots.ocr | ✅ | ✅ | ❌ | ❌ | ❌ |
| GOT-OCR2.0 | ✅ | ❌ | ❌ | ❌ | ❌ |
| PaddleOCR-VL | ✅ | ❌ | ❌ | ❌ | ❌ |
| MinerU2.5 | ✅ | ✅ | ✅ | ❌ | ✅ |
| LightOnOCR-2-1B | ✅ | ❌ | ❌ | ❌ | ✅ |
| Qwen3-VL (all) | ✅ | ✅ | ✅ | ✅ | ✅ |
| DeepSeek-OCR | ✅ | ✅ | ❌ | ❌ | ✅ |
| Nanonets-OCR2-3B | ✅ | ❌ | ✅ | ❌ | ✅ |
| Gemma-3-4B-IT | ✅ | ❌ | ❌ | ✅ | ✅ |
| olmOCR-2-7B | ✅ | ✅ | ❌ | ✅ | ✅ |
| Chandra | ✅ | ✅ | ❌ | ❌ | ✅ |
| Qwen2.5-VL (all) | ✅ | ✅ | ✅ | ✅ | ✅ |
📚 Recommended Model Selection Guide¶
By Use Case¶
| Use Case | Recommended Model | Why |
|---|---|---|
| Edge/Mobile Deployment | Granite-Docling-258M | Ultra-compact (258M), MLX support |
| Fast OCR (CPU) | LightOnOCR-2-1B | Fastest in class, SOTA accuracy |
| Multilingual Documents | PaddleOCR-VL | 109 languages, minimal resources |
| High-Throughput Serving | dots.ocr + VLLM | Built for VLLM, fast inference |
| Best Accuracy (English) | LightOnOCR-2-1B or Chandra | SOTA on OlmOCR-Bench |
| Custom Layout Detection | Qwen3-VL-8B | Flexible VLM with prompt-based labels |
| Production Balanced | Qwen3-VL-8B or olmOCR-2-7B | Performance + reliability |
| Complex Documents | Chandra or Qwen3-VL-32B | Handles tables, forms, handwriting |
| Apple Silicon (M1/M2/M3) | Granite-Docling-258M (MLX) | Native MLX optimization |
| Cost-Effective API | Qwen3-VL-8B (Novita) | $0.08/M tokens input |
🚀 Quick Start Examples¶
Ultra-Compact (258M) - Granite-Docling¶
from omnidocs.tasks.text_extraction import GraniteDoclingOCR, GraniteDoclingConfig
extractor = GraniteDoclingOCR(
config=GraniteDoclingConfig(device="cuda")
)
result = extractor.extract(image, output_format="markdown")
Fastest OCR (1B) - LightOnOCR-2¶
from omnidocs.tasks.text_extraction import LightOnOCRExtractor, LightOnOCRConfig
extractor = LightOnOCRExtractor(
config=LightOnOCRConfig(
model="lightonai/LightOnOCR-2-1B",
device="cuda"
)
)
result = extractor.extract(image, output_format="markdown")
High-Throughput (1.7B) - dots.ocr + VLLM¶
from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRVLLMConfig
extractor = DotsOCRTextExtractor(
backend=DotsOCRVLLMConfig(
model="rednote-hilab/dots.ocr",
tensor_parallel_size=1,
gpu_memory_utilization=0.9
)
)
result = extractor.extract(image, output_format="markdown")
Best Accuracy (7-9B) - olmOCR-2 or Chandra¶
from omnidocs.tasks.text_extraction import OlmOCRExtractor, ChandraTextExtractor
from omnidocs.tasks.text_extraction.olm import OlmOCRVLLMConfig
from omnidocs.tasks.text_extraction.chandra import ChandraPyTorchConfig
# Option 1: olmOCR-2-7B with VLLM
extractor = OlmOCRExtractor(
backend=OlmOCRVLLMConfig(
model="allenai/olmOCR-2-7B-1025-FP8",
tensor_parallel_size=1
)
)
# Option 2: Chandra-9B
extractor = ChandraTextExtractor(
backend=ChandraPyTorchConfig(
model="datalab-to/chandra",
device="cuda"
)
)
Flexible Custom Layouts (8B) - Qwen3-VL¶
from omnidocs.tasks.layout_extraction import QwenLayoutDetector
from omnidocs.tasks.layout_extraction.qwen import QwenPyTorchConfig
layout = QwenLayoutDetector(
backend=QwenPyTorchConfig(
model="Qwen/Qwen3-VL-8B-Instruct",
device="cuda"
)
)
result = layout.extract(
image,
custom_labels=["code_block", "sidebar", "diagram"]
)
🎯 Current Focus: Layout Analysis Models¶
Phase 1: Multi-Backend VLM Integration¶
1. Qwen3-VL-8B-Instruct Integration¶
Status: 🟡 In Progress
Integrate Qwen3-VL-8B-Instruct for flexible layout detection with custom label support across all backends.
Key Features: - Enhanced document parsing over Qwen2.5-VL - Improved visual perception and text understanding - Advanced reasoning capabilities - Custom layout label support
Implementation Checklist:¶
- [ ] HuggingFace/PyTorch Backend (
QwenLayoutDetector+QwenPyTorchConfig)
Model: Qwen/Qwen3-VL-8B-Instruct
Config Class: omnidocs/tasks/layout_analysis/qwen/pytorch.py
class QwenPyTorchConfig(BaseModel):
model: str = "Qwen/Qwen3-VL-8B-Instruct"
device: str = "cuda"
torch_dtype: Literal["auto", "float16", "bfloat16"] = "auto"
attn_implementation: Optional[str] = None # "flash_attention_2" if available
cache_dir: Optional[str] = None
Dependencies:
- torch, transformers
- qwen-vl-utils (model-specific utility)
Reference Implementation: See scripts/layout/modal_qwen3_vl_layout.py in the repository
Testing: - Validate on synthetic document images - Compare detection accuracy with ground truth - Test custom label support
- [ ] VLLM Backend (
QwenVLLMConfig)
Model: Qwen/Qwen3-VL-8B-Instruct
Config Class: omnidocs/tasks/layout_analysis/qwen/vllm.py
class QwenVLLMConfig(BaseModel):
model: str = "Qwen/Qwen3-VL-8B-Instruct"
tensor_parallel_size: int = 1
gpu_memory_utilization: float = 0.9
max_model_len: Optional[int] = None
trust_remote_code: bool = True
Dependencies:
- vllm>=0.4.0
- torch>=2.0
Use Case: High-throughput batch processing (10+ documents/second)
Modal Config:
- GPU: A10G:1 (minimum), A100:1 (recommended for production)
- Image: VLLM GPU Image with flash-attn
Testing: - Benchmark throughput vs PyTorch - Validate output consistency - Test batch processing
- [ ] MLX Backend (
QwenMLXConfig)
Model: mlx-community/Qwen3-VL-8B-Instruct-4bit
Config Class: omnidocs/tasks/layout_analysis/qwen/mlx.py
class QwenMLXConfig(BaseModel):
model: str = "mlx-community/Qwen3-VL-8B-Instruct-4bit"
quantization: Literal["4bit", "8bit"] = "4bit"
max_tokens: int = 4096
Dependencies:
- mlx>=0.10
- mlx-lm>=0.10
Platform: Apple Silicon only (M1/M2/M3+)
Use Case: Local development and testing on macOS
Note: ⚠️ DO NOT deploy MLX to Modal - local development only
- [ ] API Backend (
QwenAPIConfig)
Model: qwen3-vl-8b-instruct
Config Class: omnidocs/tasks/layout_analysis/qwen/api.py
class QwenAPIConfig(BaseModel):
model: str = "novita/qwen3-vl-8b-instruct"
api_key: str
base_url: Optional[str] = None
max_tokens: int = 4096
temperature: float = 0.1
Provider: Novita AI - Context Length: 131K tokens - Max Output: 33K tokens - Pricing: - Input: $0.08/M tokens - Output: $0.50/M tokens
Dependencies:
- litellm>=1.30
- openai>=1.0
Use Case: - Serverless deployments - No GPU infrastructure required - Cost-effective for low-volume processing
- [ ] Main Extractor Class (
omnidocs/tasks/layout_analysis/qwen.py)
Implement unified QwenLayoutDetector class:
from typing import Union, List, Optional
from PIL import Image
from .base import BaseLayoutExtractor
from .models import LayoutOutput
from .qwen import (
QwenPyTorchConfig,
QwenVLLMConfig,
QwenMLXConfig,
QwenAPIConfig,
)
QwenBackendConfig = Union[
QwenPyTorchConfig,
QwenVLLMConfig,
QwenMLXConfig,
QwenAPIConfig,
]
class QwenLayoutDetector(BaseLayoutExtractor):
"""Flexible VLM-based layout detector with custom label support."""
def __init__(self, backend: QwenBackendConfig):
self.backend_config = backend
self._backend = self._create_backend()
def extract(
self,
image: Image.Image,
custom_labels: Optional[List[str]] = None,
) -> LayoutOutput:
"""
Detect layout elements with optional custom labels.
Args:
image: PIL Image
custom_labels: Optional custom layout categories
Default: ["title", "paragraph", "table", "figure",
"caption", "formula", "list"]
Returns:
LayoutOutput with detected bounding boxes
"""
# Implementation...
- [ ] Integration Tests
Test suite covering: - All backend configurations - Custom label functionality - Cross-backend output consistency - Edge cases (empty images, single elements, complex layouts)
-
[ ] Documentation
-
API reference with examples for each backend
- Performance comparison table (PyTorch vs VLLM vs MLX vs API)
- Migration guide from Qwen2.5-VL
-
Custom label usage examples
-
[ ] Modal Deployment Script
Create production-ready deployment:
- scripts/layout_omnidocs/modal_qwen_layout_vllm_online.py
- Web endpoint for layout detection API
- Batch processing support
- Monitoring and logging
Phase 2: Additional Layout Models¶
2. RT-DETR Layout Detector¶
- [ ] Single-Backend Implementation (PyTorch only)
- Model:
RT-DETR(Facebook AI) - Fixed label support (COCO-based)
- Real-time detection optimization
3. Surya Layout Detector¶
- [ ] Single-Backend Implementation (PyTorch only)
- Model:
vikp/surya_layout - Multi-language document support
- Optimized for speed
4. Florence-2 Layout Detector¶
- [ ] Multi-Backend Implementation
- HuggingFace/PyTorch backend
- API backend (Microsoft Azure)
- Object detection + dense captioning
🔮 Future Phases¶
Additional task categories will be added after layout analysis is complete:
- OCR Extraction: Surya-OCR, PaddleOCR, Qwen-OCR
- Text Extraction: VLM-based Markdown/HTML extraction
- Table Extraction: Table Transformer, Surya-Table
- Math Expression Extraction: UniMERNet, Surya-Math
- Advanced Features: Reading order, image captioning, chart understanding
- Package & Distribution: PyPI publishing, comprehensive documentation
🎯 Success Metrics (Layout Analysis)¶
Performance Targets¶
| Metric | Target | Current |
|---|---|---|
| Layout Detection Accuracy (mAP) | >90% | TBD |
| Inference Speed (PyTorch) | <2s per page | TBD |
| Inference Speed (VLLM) | <0.5s per page | TBD |
| Custom Label Support | 100% functional | TBD |
Quality Targets¶
- [ ] Type hints coverage: 100%
- [ ] Docstring coverage: 100%
- [ ] Test coverage: >80%
- [ ] All backends tested on production data
- [ ] Cross-backend output consistency validated
🔧 Infrastructure¶
Modal Deployment Standards¶
Consistency Requirements (as per CLAUDE.md):
- Volume Name:
omnidocs - Secret Name:
adithya-hf-wandb - CUDA Version:
12.4.0-devel-ubuntu22.04 - Python Version:
3.11(3.12 for Qwen3-VL) - Cache Directory:
/data/.cache(HuggingFace) - Model Cache:
/data/omnidocs_models - Dependency Management:
.uv_pip_install()(NO version pinning)
GPU Configurations¶
| GPU | Use Case | Cost (est.) |
|---|---|---|
A10G:1 |
Development & Testing | $0.60/hr |
A100:1 |
Production Inference | $3.00/hr |
A100:2 |
High-Throughput VLLM | $6.00/hr |
📚 References¶
Design Documents¶
- Backend Architecture - Core design principles (see
IMPLEMENTATION_PLAN/BACKEND_ARCHITECTURE.md) - Developer Experience (DevEx) - API design and patterns (see
IMPLEMENTATION_PLAN/DEVEX.md) - Claude Development Guide - Implementation standards (see
CLAUDE.mdin repo root)
External Resources¶
📝 Notes¶
Implementation Order Rationale¶
- Qwen3-VL Priority: Multi-backend support demonstrates v2.0 architecture
- RT-DETR: Fast fixed-label detection for production use
- Surya: Multi-language support and speed optimization
- Florence-2: Microsoft's advanced VLM capabilities
Breaking Changes from v1.0¶
- String-based factory pattern removed (use class imports)
- Document class is now stateless (doesn't store results)
- Config classes are model-specific (not generic)
- Backend selection via config type (not string parameter)
Last Updated: January 21, 2026 Maintainer: Adithya S Kolavi Version: 2.0.0-dev