OmniDocs Development Roadmap¶

📦 Target Model Support¶

Research Date: February 2026 Status: Comprehensive model research completed Models Ordered By: Release date (newest first within each size category)

🎯 Quick Reference: Model Capabilities & Backend Support¶

Comprehensive Model Comparison Table¶

Model	Size	PyTorch	VLLM	MLX	OpenAI API	Tasks	Release
DeepSeek-OCR-2	3B	✅	✅	✅	✅	T, O, Tab, F	Jan 2026
LightOnOCR-2-1B	1B	✅	✅	❌	❌	T, O	Jan 2026
LightOnOCR-2-1B-bbox	1B	✅	✅	❌	❌	T, L, O	Jan 2026
OCRFlux-3B	3B	✅	✅	❌	❌	T, O	Jan 2026
Qwen3-VL-2B	2B	✅	✅	✅	✅	T, L, S, O, Tab	Oct 2025
Qwen3-VL-4B	4B	✅	✅	✅	❌	T, L, S, O, Tab	Oct 2025
Qwen3-VL-8B	8B	✅	✅	✅	✅	T, L, S, O, Tab, F	Oct 2025
Qwen3-VL-32B	32B	✅	✅	✅	✅	T, L, S, O, Tab, F	Oct 2025
olmOCR-2-7B	7B	✅	✅	❌	✅	T, O, Tab, F	Oct 2025
PaddleOCR-VL	900M	✅	⚠️	❌	❌	T, L, O, Tab, F	Oct 2025
LightOnOCR-1B	1B	✅	❌	❌	❌	T, O	Oct 2025
Granite-Vision-3.3-2B	2B	✅	❌	❌	❌	T, L, Tab, Chart	Jun 2025
Gemma-3-4B-IT	4B	✅	❌	❌	✅	T, S, O	2025
Granite-Docling-258M	258M	✅	⚠️	✅	❌	T, L, Tab, F	Dec 2024
dots.ocr	1.7B	✅	✅	❌	❌	T, L, Tab, F, O	Dec 2024
DeepSeek-OCR	3B	✅	✅	✅	✅	T, O, Tab	Oct 2024
Chandra	9B	✅	✅	❌	❌	T, L, O, Tab, F	2024
MinerU2.5-2509-1.2B	1.2B	✅	✅	✅	❌	T, L, Tab, F, O	Sep 2024
GOT-OCR2.0	700M	✅	❌	❌	❌	T, O, F, Tab	Sep 2024
Nanonets-OCR2-3B	3B	✅	✅	✅	❌	T, F, O	2024
Qwen2.5-VL-3B	3B	✅	✅	✅	✅	T, L, S, O	2024
Qwen2.5-VL-7B	7B	✅	✅	✅	✅	T, L, S, O, Tab	2024
Qwen2.5-VL-32B	32B	✅	✅	✅	✅	T, L, S, O, Tab	2024

Legend: - Tasks: T=Text Extract, L=Layout, O=OCR, S=Structured, Tab=Table, F=Formula, Chart=Chart Understanding - ✅ = Fully supported | ⚠️ = Limited/Partial support | ❌ = Not supported

Backend Details¶

PyTorch Support¶

All models support PyTorch via HuggingFace Transformers
Primary development backend for all models
Requirements: transformers>=4.46, torch>=2.0

VLLM Support (High-Throughput Production)¶

Fully Supported (✅): - Qwen3-VL Series (vllm>=0.11.0) - Qwen2.5-VL Series - DeepSeek-OCR (official upstream) - dots.ocr (recommended, vllm>=0.9.1) - MinerU2.5 - olmOCR-2 (via olmOCR toolkit) - Chandra - LightOnOCR-2-1B (vllm>=0.11.1) - Nanonets-OCR2-3B

Limited Support (⚠️): - Granite-Docling-258M (untied weights required) - PaddleOCR-VL (possible but not officially confirmed)

Not Supported (❌): - GOT-OCR2.0 - Gemma-3-4B-IT - LightOnOCR-1B (legacy)

MLX Support (Apple Silicon M1/M2/M3+)¶

Fully Supported via mlx-community (✅): - Qwen3-VL Series - Collection - 2B, 4B, 8B, 32B (4-bit, 8-bit variants) - Qwen2.5-VL Series - Collection - 3B, 7B, 32B, 72B (4-bit, 8-bit variants) - DeepSeek-OCR - 4-bit, 8-bit - Granite-Docling-258M - Official MLX - MinerU2.5 - bf16 - Nanonets-OCR2-3B - 4-bit

Usage:

pip install mlx-vlm
python -m mlx_vlm.generate --model mlx-community/Qwen3-VL-8B-Instruct-4bit \
  --prompt "Extract text from this document" --image doc.png

OpenAI-Compatible API Providers¶

OpenRouter (openrouter.ai): - ✅ Qwen3-VL-235B-A22B ($0.45/$3.50 per M tokens) - ✅ Qwen3-VL-30B-A3B - ✅ Qwen2.5-VL-3B (SOTA visual understanding) - ✅ Qwen2.5-VL-32B (structured outputs, math) - ✅ Qwen2.5-VL-72B (best overall)

Novita AI (novita.ai): - ✅ DeepSeek-OCR (Model Page) - ✅ Qwen2.5-VL-72B (OCR + scientific reasoning) - ✅ Qwen3-VL-8B ($0.08/$0.50 per M tokens)

Together AI (together.ai): - ✅ Various vision-language models - ✅ Lightweight models with multilingual support

Replicate (replicate.com): - ✅ Vision models collection - ✅ Pay-per-use inference

Others: - DeepInfra: olmOCR-2-7B - Parasail: olmOCR-2-7B - Cirrascale: olmOCR-2-7B

API Integration Example:

from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenAPIConfig

# OpenRouter
extractor = QwenTextExtractor(
    backend=QwenAPIConfig(
        model="qwen/qwen3-vl-8b-instruct",
        api_key="YOUR_OPENROUTER_KEY",
        base_url="https://openrouter.ai/api/v1"
    )
)

# Novita AI
extractor = QwenTextExtractor(
    backend=QwenAPIConfig(
        model="novita/qwen3-vl-8b-instruct",
        api_key="YOUR_NOVITA_KEY",
        base_url="https://api.novita.ai/v3/openai"
    )
)

Task Capability Matrix¶

Task	Description	Model Count	Top Models
Text Extract (T)	Document → Markdown/HTML	18	LightOnOCR-2, Chandra, Qwen3-VL-8B
Layout (L)	Structure detection with bboxes	8	Qwen3-VL-8B, Chandra, MinerU2.5
OCR (O)	Text + bbox coordinates	15	LightOnOCR-2, olmOCR-2, Chandra
Structured (S)	Schema-based extraction	5	Qwen3-VL (all), Qwen2.5-VL (all), Gemma-3
Table (Tab)	Table detection/extraction	12	Qwen3-VL-8B, DeepSeek-OCR, olmOCR-2
Formula (F)	Math expression recognition	8	Nanonets-OCR2, Qwen3-VL-8B, GOT-OCR2.0

Model Overview by Task Capability¶

Task Categories¶

Task	Description	Model Count
text_extract	Document to Markdown/HTML conversion	18
layout	Document structure detection with bounding boxes	8
ocr	Text extraction with bbox coordinates	6
structured	Schema-based data extraction	5
table	Table detection and extraction	4
formula	Mathematical expression recognition	3

🆕 Latest Models (January 2026)¶

DeepSeek-OCR-2¶

Released: January 27, 2026 | Parameters: 3B | License: MIT

HuggingFace: deepseek-ai/DeepSeek-OCR-2

Description: State-of-the-art 3B-parameter vision-language model with new DeepEncoder architecture. Unlike traditional OCR systems, DeepSeek OCR 2 focuses on image-to-text with stronger visual reasoning.

Key Features: - DeepEncoder: 380M vision encoder (80M SAM-base + 300M CLIP-large) - 97% precision at 10× visual token compression - ~60% accuracy at 20× compression - Strong document understanding beyond text extraction

Backends Supported: - ✅ PyTorch (HuggingFace Transformers) - ✅ VLLM (high throughput) - ✅ MLX (4-bit, 8-bit) - ✅ API (Novita AI, OpenRouter)

Tasks: text_extract, ocr, table, formula

Links: - Model Card - GitHub

LightOnOCR-2-1B¶

Released: January 19, 2026 | Parameters: 1B | License: Apache 2.0

HuggingFace: lightonai/LightOnOCR-2-1B

Description: Second-generation 1B-parameter end-to-end vision-language OCR model. SOTA conversion of PDF renders to clean text without multi-stage pipelines.

Key Features: - 83.2 on OlmOCR-Bench (SOTA, beats 9B Chandra) - 5.7 pages/second on H100 (~493K pages/day) - <$0.01 per 1,000 pages at cloud pricing - Bbox variant for figure/image localization

Model Variants: - LightOnOCR-2-1B - Default OCR - LightOnOCR-2-1B-bbox - Best localization - LightOnOCR-2-1B-bbox-soup - Balanced

Backends Supported: - ✅ PyTorch (transformers from source) - ✅ VLLM (vllm>=0.11.1)

Tasks: text_extract, ocr, layout (bbox variants)

Links: - Blog Post - Demo

OCRFlux-3B¶

Released: January 2026 | Parameters: 3B | License: Apache 2.0

HuggingFace: Fine-tuned from Qwen2.5-VL-3B-Instruct

Description: Multimodal LLM for converting PDFs and images to clean Markdown. Runs efficiently on consumer hardware (GTX 3090).

Key Features: - Compact 3B architecture - Clean Markdown output - Consumer GPU compatible - Based on Qwen2.5-VL

Backends Supported: - ✅ PyTorch - ✅ VLLM

Tasks: text_extract, ocr

🎯 Core Models (By Size & Release Date)¶

Ultra-Compact Models (<1B Parameters)¶

IBM Granite-Docling-258M¶

Released: December 2024 | Parameters: 258M | License: Apache 2.0

HuggingFace: ibm-granite/granite-docling-258M

Description: Ultra-compact vision-language model (VLM) for converting documents to machine-readable formats while fully preserving layout, tables, equations, and lists. Built on Idefics3 architecture with siglip2-base-patch16-512 vision encoder and Granite 165M LLM.

Key Features: - End-to-end document understanding at 258M parameters - Handles inline/floating math, code, table structure - Rivals systems several times its size - Extremely cost-effective

Backends Supported: - ✅ PyTorch (HuggingFace Transformers) - ✅ MLX (Apple Silicon) - ibm-granite/granite-docling-258M-mlx - ✅ WebGPU - Demo Space

Integration:

pip install docling  # Automatically downloads model

Dependencies: transformers, torch, pillow, docling

Tasks: text_extract, layout, table, formula

Links: - Model Card - MLX Version - Demo Space - Official Docs - Collection

2. stepfun-ai GOT-OCR2.0¶

Released: September 2024 | Parameters: 700M | License: Apache 2.0

HuggingFace: stepfun-ai/GOT-OCR2_0

Description: General OCR Theory model for multilingual OCR on plain documents, scene text, formatted documents, tables, charts, mathematical formulas, geometric shapes, molecular formulas, and sheet music.

Key Features: - Interactive OCR with region-specific recognition (coordinate or color-based) - Plain text OCR + formatted text OCR (markdown, LaTeX) - Multi-page document processing - Wide range of specialized content types

Model Variations: - stepfun-ai/GOT-OCR2_0 - Original with custom code - stepfun-ai/GOT-OCR-2.0-hf - HuggingFace-native transformers integration

Backends Supported: - ✅ PyTorch (HuggingFace Transformers) - ✅ Custom inference pipeline

Dependencies: transformers, torch, pillow

Tasks: text_extract, ocr, formula, table

Links: - Model Card - HF-Native Version

Compact Models (1-2B Parameters)¶

3. rednote-hilab dots.ocr¶

Released: December 2024 | Parameters: 1.7B | License: MIT

HuggingFace: rednote-hilab/dots.ocr

Description: Multilingual documents parsing model based on 1.7B LLM with SOTA performance. Provides faster inference than many high-performing models based on larger foundations.

Key Features: - Task switching via prompt alteration only - Competitive detection vs traditional models (DocLayout-YOLO) - Built-in VLLM support for high throughput - Released with paper arXiv:2512.02498

Model Variations: - rednote-hilab/dots.ocr - Full model - rednote-hilab/dots.ocr.base - Base variant

Backends Supported: - ✅ PyTorch (HuggingFace Transformers) - ✅ VLLM (Recommended for production) - vLLM 0.9.1+

Dependencies: transformers, torch, vllm>=0.9.1 (recommended)

Tasks: text_extract, layout, table, formula, ocr

Links: - Model Card - GitHub - Live Demo - Paper - Collection

4. PaddlePaddle PaddleOCR-VL¶

Released: October 2025 | Parameters: 900M | License: Apache 2.0

HuggingFace: PaddlePaddle/PaddleOCR-VL

Description: Ultra-compact multilingual documents parsing VLM with SOTA performance. Integrates NaViT-style dynamic resolution visual encoder with ERNIE-4.5-0.3B language model.

Key Features: - Supports 109 languages - Excels in recognizing complex elements (text, tables, formulas, charts) - Minimal resource consumption - Fast inference speeds - SOTA in page-level parsing and element-level recognition

Backends Supported: - ✅ PyTorch (HuggingFace Transformers - officially integrated) - ✅ PaddlePaddle framework

Dependencies: transformers, torch, paddlepaddle

Tasks: text_extract, layout, ocr, table, formula

Links: - Model Card - Online Demo - Collection - Transformers Docs - GitHub - PaddleOCR

5. LightOn AI LightOnOCR Series¶

Released: January 2025 (v2), October 2025 (v1) | Parameters: 1B | License: Apache 2.0

HuggingFace Models: - lightonai/LightOnOCR-2-1B - Recommended for OCR - lightonai/LightOnOCR-2-1B-bbox - Best localization - lightonai/LightOnOCR-2-1B-bbox-soup - Balanced OCR + bbox - lightonai/LightOnOCR-1B-1025 - Legacy v1

Description: Compact, end-to-end vision-language model for OCR and document understanding. State-of-the-art accuracy in its weight class while being several times faster than larger VLMs.

Key Features: - LightOnOCR-2-1B: SOTA on OlmOCR-Bench (83.2 ± 0.9), outperforms Chandra-9B - Performance: 3.3× faster than Chandra, 1.7× faster than OlmOCR, 5× faster than dots.ocr - Variants: OCR-only, bbox-capable (figure/image localization), and balanced checkpoints - Paper: arXiv:2601.14251

Model Comparison: | Model | Use Case | Bbox Support | |-------|----------|--------------| | LightOnOCR-2-1B | Default for PDF→Text/Markdown | ❌ | | LightOnOCR-2-1B-bbox | Best localization of figures/images | ✅ Best | | LightOnOCR-2-1B-bbox-soup | Balanced OCR + localization | ✅ Balanced |

Backends Supported: - ✅ PyTorch (HuggingFace Transformers - upstream support) - ⚠️ Requires transformers from source for v2 (not yet in stable release)

Quantized Versions: - GGUF format

Dependencies: transformers>=4.48 (from source for v2), torch, pillow

Tasks: text_extract, ocr, layout (bbox variants only)

Links: - LightOnOCR-2 Blog - LightOnOCR-1 Blog - Demo Space - Paper (arXiv) - Organization

6. opendatalab MinerU2.5¶

Released: September 2024 | Parameters: 1.2B | License: Apache 2.0

HuggingFace: opendatalab/MinerU2.5-2509-1.2B

Description: Decoupled vision-language model for efficient high-resolution document parsing with state-of-the-art accuracy and low computational overhead.

Key Features: - Two-stage parsing: global layout analysis on downsampled images → fine-grained content recognition on native-resolution crops - Outperforms Gemini-2.5 Pro, Qwen2.5-VL-72B, GPT-4o, MonkeyOCR, dots.ocr, PP-StructureV3 - Large-scale diverse data engine for pretraining/fine-tuning - New performance records in text, formula, table recognition, and reading order

Model Variations: - opendatalab/MinerU2.5-2509-1.2B - Official model - mlx-community/MinerU2.5-2509-1.2B-bf16 - MLX for Apple Silicon - Mungert/MinerU2.5-2509-1.2B-GGUF - GGUF quantized

Backends Supported: - ✅ PyTorch (HuggingFace Transformers) - ✅ VLLM (with OpenAI API specs) - ✅ MLX (Apple Silicon)

Dependencies: transformers, torch, vllm (optional)

Tasks: text_extract, layout, table, formula, ocr

Links: - Model Card - Paper (arXiv:2509.22186) - MLX Version - GGUF Version

Small Models (2-4B Parameters)¶

7. Qwen3-VL-2B-Instruct¶

Released: October 2025 | Parameters: 2B | License: Apache 2.0

HuggingFace: Qwen/Qwen3-VL-2B-Instruct

Description: Multimodal LLM from Alibaba Cloud's Qwen team with comprehensive upgrades: superior text understanding/generation, deeper visual perception/reasoning, extended context, and stronger agent interaction.

Key Features: - Dense and MoE architectures that scale from edge to cloud - Instruct and reasoning-enhanced "Thinking" editions - Enhanced spatial and video dynamics comprehension - Part of Qwen3-VL multimodal retrieval framework (arXiv:2601.04720, 2026)

Model Variations: - Qwen/Qwen3-VL-2B-Instruct - Instruction-tuned - Qwen/Qwen3-VL-2B-Thinking - Reasoning-enhanced - Qwen/Qwen3-VL-2B-Instruct-GGUF - Quantized GGUF

Backends Supported: - ✅ PyTorch (HuggingFace Transformers) - ✅ VLLM - ✅ MLX (via mlx-community) - ✅ API (via cloud providers)

Dependencies: transformers>=4.46, torch, qwen-vl-utils

Tasks: text_extract, layout, structured, ocr, table

Links: - Model Card - GitHub - Collection - GGUF Version

8. DeepSeek-OCR¶

Released: October 2024 | Parameters: ~3B | License: MIT

HuggingFace: deepseek-ai/DeepSeek-OCR

Description: High-accuracy OCR model from DeepSeek-AI for extracting text from complex visual inputs (documents, screenshots, receipts, natural scenes).

Key Features: - Built for real-world documents: PDFs, forms, tables, handwritten/noisy text - Outputs clean, structured Markdown - VLLM support upstream - ~2500 tokens/s on A100 with vLLM - Paper: arXiv:2510.18234

Model Variations: - deepseek-ai/DeepSeek-OCR - Official BF16 (~6.7 GB) - NexaAI/DeepSeek-OCR-GGUF - Quantized GGUF

Backends Supported: - ✅ PyTorch (HuggingFace Transformers) - ✅ VLLM (officially supported)

Requirements: - Python 3.12.9 + CUDA 11.8 - torch==2.6.0, transformers==4.46.3, flash-attn==2.7.3 - L4 / A100 GPUs (≥16 GB VRAM)

Dependencies: transformers, torch, vllm, flash-attn, einops

Tasks: text_extract, ocr, table

Links: - Model Card - GitHub - GGUF Version - Demo Space

9. Nanonets-OCR2-3B¶

Released: 2024 | Parameters: 3B | License: Apache 2.0

HuggingFace: nanonets/Nanonets-OCR2-3B

Description: State-of-the-art image-to-markdown OCR model that transforms documents into structured markdown with intelligent content recognition and semantic tagging, optimized for LLM downstream processing.

Key Features: - LaTeX equation recognition (inline $...$ and display $$...$$) - Intelligent image description with structured tags (logos, charts, graphs) - 125K context window - ~7.53 GB model size

Model Variations: - nanonets/Nanonets-OCR2-3B - Full BF16 - Mungert/Nanonets-OCR2-3B-GGUF - GGUF quantized - mlx-community/Nanonets-OCR2-3B-4bit - MLX 4-bit - yasserrmd/Nanonets-OCR2-3B - Ollama format

Backends Supported: - ✅ PyTorch (HuggingFace Transformers) - ✅ MLX (Apple Silicon) - ✅ Ollama

Dependencies: transformers, torch, pillow

Tasks: text_extract, formula, ocr

Links: - Model Card - GGUF Version - MLX 4-bit - Ollama

10. Qwen3-VL-4B-Instruct¶

Released: October 2025 | Parameters: 4B | License: Apache 2.0

HuggingFace: Qwen/Qwen3-VL-4B-Instruct

Description: Mid-size Qwen3-VL model with balanced performance and efficiency. Part of comprehensive multimodal model series with text understanding, visual reasoning, and agent capabilities.

Model Variations: - Qwen/Qwen3-VL-4B-Instruct - Instruction-tuned - Qwen/Qwen3-VL-4B-Thinking - Reasoning-enhanced

Backends Supported: - ✅ PyTorch (HuggingFace Transformers) - ✅ VLLM - ✅ MLX (via mlx-community) - ✅ API (via cloud providers)

Dependencies: transformers>=4.46, torch, qwen-vl-utils

Tasks: text_extract, layout, structured, ocr, table

Links: - Collection - GitHub

11. Google Gemma-3-4B-IT¶

Released: 2025 | Parameters: 4B | License: Gemma License

HuggingFace: google/gemma-3-4b-it

Description: Lightweight, state-of-the-art multimodal model from Google built from same research/technology as Gemini. Handles text and image input, generates text output.

Key Features: - 128K context window - Multilingual support (140+ languages) - SigLIP image encoder (896×896 square images) - Gemma-3-4B-IT beats Gemma-2-27B-IT on benchmarks

Model Variations: - google/gemma-3-4b-it - Instruction-tuned (vision-capable) - google/gemma-3-4b-pt - Pre-trained base - google/gemma-3-4b-it-qat-q4_0-gguf - Quantized GGUF - bartowski/google_gemma-3-4b-it-GGUF - Community GGUF

Backends Supported: - ✅ PyTorch (HuggingFace Transformers) - ✅ Google AI SDK - ✅ API (Google AI Studio)

Dependencies: transformers>=4.46, torch, pillow

Tasks: text_extract, structured, ocr

Links: - Model Card - Blog Post - Transformers Docs - Google Docs - DeepMind Page

Medium Models (7-9B Parameters)¶

12. allenai olmOCR-2-7B-1025¶

Released: October 2025 | Parameters: 7B | License: Apache 2.0

HuggingFace: allenai/olmOCR-2-7B-1025

Description: State-of-the-art OCR for English-language digitized print documents. Fine-tuned from Qwen2.5-VL-7B-Instruct using olmOCR-mix-1025 dataset + GRPO RL training.

Key Features: - 82.4 points on olmOCR-Bench (SOTA for real-world documents) - Substantial improvements where OCR often fails (math equations, tables, tricky cases) - Boosted via reinforcement learning (GRPO)

Model Variations: - allenai/olmOCR-2-7B-1025 - Full BF16 version - allenai/olmOCR-2-7B-1025-FP8 - Recommended FP8 quantized (practical use except fine-tuning) - bartowski/allenai_olmOCR-2-7B-1025-GGUF - GGUF quantized - richardyoung/olmOCR-2-7B-1025-GGUF - Alternative GGUF

Backends Supported: - ✅ PyTorch (HuggingFace Transformers) - ✅ VLLM (recommended via olmOCR toolkit) - ✅ API (DeepInfra, Parasail, Cirrascale)

Best Usage: Via olmOCR toolkit with VLLM for efficient inference at scale (millions of documents).

Dependencies: transformers, torch, vllm, olmocr (toolkit Tasks: text_extract, ocr, table, formula

Links: - Model Card - FP8 Version - Blog Post - GGUF (bartowski)

13. Qwen3-VL-8B-Instruct¶

Released: October 2025 | Parameters: 8B | License: Apache 2.0

HuggingFace: Qwen/Qwen3-VL-8B-Instruct

Description: Primary model in Qwen3-VL series with optimal balance of performance and efficiency. Enhanced document parsing over Qwen2.5-VL with improved visual perception, text understanding, and advanced reasoning.

Key Features: - Custom layout label support (flexible VLM) - Extended context length - Enhanced spatial and video comprehension - Stronger agent interaction capabilities

Model Variations: - Qwen/Qwen3-VL-8B-Instruct - Instruction-tuned - Qwen/Qwen3-VL-8B-Thinking - Reasoning-enhanced

Backends Supported: - ✅ PyTorch (HuggingFace Transformers) - ✅ VLLM - ✅ MLX (Apple Silicon) - mlx-community/Qwen3-VL-8B-Instruct-4bit - ✅ API (Novita AI, OpenRouter, etc.)

API Providers: - Novita AI: Context 131K tokens, Max output 33K tokens - Pricing: $0.08/M input tokens, $0.50/M output tokens

Dependencies: transformers>=4.46, torch, qwen-vl-utils, vllm (optional)

Tasks: text_extract, layout, structured, ocr, table, formula

Links: - Model Card - Collection - GitHub - MLX 4-bit

14. datalab-to Chandra¶

Released: 2024 | Parameters: 9B | License: Apache 2.0

HuggingFace: datalab-to/chandra

Description: OCR model handling complex tables, forms, and handwriting with full layout preservation. Uses Qwen3VL for document understanding.

Key Features: - 83.1 ± 0.9 overall on OlmOCR benchmark (outperforms DeepSeek OCR, dots.ocr, olmOCR) - Strong grounding capabilities - Supports 40+ languages - Layout-aware output with bbox coordinates for every text block, table, and image - Outputs in HTML, Markdown, and JSON with detailed layout

Use Cases: - Handwritten forms - Mathematical notation - Multi-column layouts - Complex tables

Backends Supported: - ✅ PyTorch (HuggingFace Transformers) - ✅ VLLM (production throughput)

Installation:

pip install chandra-ocr

Model Variations: - datalab-to/chandra - Official model - noctrex/Chandra-OCR-GGUF - GGUF quantized

Dependencies: transformers, torch, vllm (optional), chandra-ocr

Tasks: text_extract, layout, ocr, table, formula

Links: - Model Card - GitHub - Blog Post - DeepWiki Docs - GGUF Version

Large Models (32B+ Parameters)¶

15. Qwen3-VL-32B-Instruct¶

Released: October 2025 | Parameters: 32B | License: Apache 2.0

HuggingFace: Qwen/Qwen3-VL-32B-Instruct

Description: Largest Qwen3-VL model with maximum performance for complex document understanding and multimodal reasoning tasks.

Key Features: - Superior performance on complex documents - Extended context length - Enhanced reasoning capabilities - Production-grade for demanding applications

Model Variations: - Qwen/Qwen3-VL-32B-Instruct - Instruction-tuned - Qwen/Qwen3-VL-32B-Thinking - Reasoning-enhanced

Backends Supported: - ✅ PyTorch (HuggingFace Transformers) - ✅ VLLM (recommended for production) - ✅ API (cloud providers)

GPU Requirements: A100 40GB+ or multi-GPU setup

Dependencies: transformers>=4.46, torch, qwen-vl-utils, vllm

Tasks: text_extract, layout, structured, ocr, table, formula

Links: - Model Card - Collection - GitHub

Specialized Models¶

16. docling-project/docling-models¶

Released: 2024 | Parameters: Various | License: Apache 2.0

HuggingFace: docling-project/docling-models

Description: Collection of models powering the Docling PDF document conversion package. Includes layout detection (RT-DETR) and table structure recognition (TableFormer).

Models Included: 1. Layout Model: RT-DETR for detecting document components - Labels: Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Section-header, Table, Text, Title 2. TableFormer Model: Table structure identification from images

Note: Superseded by granite-docling-258M for end-to-end document conversion (receives updates and support).

Backends Supported: - ✅ PyTorch (via Docling library)

Integration:

pip install docling

Dependencies: docling, transformers, torch

Tasks: layout, table

Links: - Model Card - Vision Models Docs - SmolDocling (legacy)

📦 Optional Models (Legacy/Alternative)¶

Qwen 2.5-VL Series (Previous Generation)¶

Qwen2.5-VL-3B-Instruct¶

Released: 2024 | Parameters: 3B | License: Apache 2.0

HuggingFace: Qwen/Qwen2.5-VL-3B-Instruct

Description: Previous generation Qwen VLM with strong visual understanding, agentic capabilities, video understanding (1+ hour), and structured outputs.

Key Features: - Analyzes texts, charts, icons, graphics, layouts - Visual agent capabilities (computer use, phone use) - Video comprehension with temporal segment pinpointing - ViT architecture with SwiGLU and RMSNorm - Dynamic resolution + dynamic FPS sampling

Backends Supported: - ✅ PyTorch (HuggingFace Transformers) - ✅ VLLM - ✅ MLX - ✅ API

Dependencies: transformers, torch, qwen-vl-utils

Tasks: text_extract, layout, structured, ocr

Links: - Model Card - Collection

Qwen2.5-VL-7B-Instruct¶

Released: 2024 | Parameters: 7B | License: Apache 2.0

HuggingFace: Qwen/Qwen2.5-VL-7B-Instruct

Description: Mid-size Qwen2.5-VL model with same capabilities as 3B variant but enhanced performance.

Model Variations: - Qwen/Qwen2.5-VL-7B-Instruct - Official - unsloth/Qwen2.5-VL-7B-Instruct-GGUF - GGUF quantized - nvidia/Qwen2.5-VL-7B-Instruct-NVFP4 - NVIDIA FP4 optimized

Backends Supported: - ✅ PyTorch (HuggingFace Transformers) - ✅ VLLM - ✅ MLX - ✅ API

Dependencies: transformers, torch, qwen-vl-utils

Tasks: text_extract, layout, structured, ocr, table

Links: - Model Card - Collection - GGUF Version

📊 Model Comparison Summary¶

By Release Date (2024-2026)¶

Model	Release	Params	Benchmark Score
LightOnOCR-2-1B	Jan 2025	1B	83.2 (OlmOCR)
dots.ocr	Dec 2024	1.7B	79.1 (OlmOCR)
Granite-Docling-258M	Dec 2024	258M	N/A
Chandra	2024	9B	83.1 (OlmOCR)
Qwen3-VL Series	Oct 2025	2-32B	SOTA
PaddleOCR-VL	Oct 2025	900M	SOTA
olmOCR-2-7B	Oct 2025	7B	82.4 (OlmOCR)
DeepSeek-OCR	Oct 2024	3B	75.4 (OlmOCR)
GOT-OCR2.0	Sep 2024	700M	N/A
MinerU2.5	Sep 2024	1.2B	SOTA

By Performance (OlmOCR-Bench)¶

Rank	Model	Score	Params
1	LightOnOCR-2-1B	83.2 ± 0.9	1B
2	Chandra	83.1 ± 0.9	9B
3	olmOCR-2-7B	82.4	7B
4	dots.ocr	79.1	1.7B
5	olmOCR (v1)	78.5	7B
6	DeepSeek-OCR	75.4 ± 1.0	3B

By Speed (Relative Performance)¶

Model	Speed Multiplier	Params
LightOnOCR-2-1B	Fastest baseline	1B
PaddleOCR-VL	1.73× slower	900M
DeepSeek-OCR (vLLM)	1.73× slower	3B
olmOCR-2	1.7× slower	7B
Chandra	3.3× slower	9B
dots.ocr	5× slower	1.7B

🔧 Backend Support Matrix¶

Model	PyTorch	VLLM	MLX	API	GGUF
Granite-Docling-258M	✅	❌	✅	❌	❌
dots.ocr	✅	✅	❌	❌	❌
GOT-OCR2.0	✅	❌	❌	❌	❌
PaddleOCR-VL	✅	❌	❌	❌	❌
MinerU2.5	✅	✅	✅	❌	✅
LightOnOCR-2-1B	✅	❌	❌	❌	✅
Qwen3-VL (all)	✅	✅	✅	✅	✅
DeepSeek-OCR	✅	✅	❌	❌	✅
Nanonets-OCR2-3B	✅	❌	✅	❌	✅
Gemma-3-4B-IT	✅	❌	❌	✅	✅
olmOCR-2-7B	✅	✅	❌	✅	✅
Chandra	✅	✅	❌	❌	✅
Qwen2.5-VL (all)	✅	✅	✅	✅	✅

📚 Recommended Model Selection Guide¶

By Use Case¶

Use Case	Recommended Model	Why
Edge/Mobile Deployment	Granite-Docling-258M	Ultra-compact (258M), MLX support
Fast OCR (CPU)	LightOnOCR-2-1B	Fastest in class, SOTA accuracy
Multilingual Documents	PaddleOCR-VL	109 languages, minimal resources
High-Throughput Serving	dots.ocr + VLLM	Built for VLLM, fast inference
Best Accuracy (English)	LightOnOCR-2-1B or Chandra	SOTA on OlmOCR-Bench
Custom Layout Detection	Qwen3-VL-8B	Flexible VLM with prompt-based labels
Production Balanced	Qwen3-VL-8B or olmOCR-2-7B	Performance + reliability
Complex Documents	Chandra or Qwen3-VL-32B	Handles tables, forms, handwriting
Apple Silicon (M1/M2/M3)	Granite-Docling-258M (MLX)	Native MLX optimization
Cost-Effective API	Qwen3-VL-8B (Novita)	$0.08/M tokens input

🚀 Quick Start Examples¶

Ultra-Compact (258M) - Granite-Docling¶

from omnidocs.tasks.text_extraction import GraniteDoclingOCR, GraniteDoclingConfig

extractor = GraniteDoclingOCR(
    config=GraniteDoclingConfig(device="cuda")
)
result = extractor.extract(image, output_format="markdown")

Fastest OCR (1B) - LightOnOCR-2¶

from omnidocs.tasks.text_extraction import LightOnOCRExtractor, LightOnOCRConfig

extractor = LightOnOCRExtractor(
    config=LightOnOCRConfig(
        model="lightonai/LightOnOCR-2-1B",
        device="cuda"
    )
)
result = extractor.extract(image, output_format="markdown")

High-Throughput (1.7B) - dots.ocr + VLLM¶

from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRVLLMConfig

extractor = DotsOCRTextExtractor(
    backend=DotsOCRVLLMConfig(
        model="rednote-hilab/dots.ocr",
        tensor_parallel_size=1,
        gpu_memory_utilization=0.9
    )
)
result = extractor.extract(image, output_format="markdown")

Best Accuracy (7-9B) - olmOCR-2 or Chandra¶

from omnidocs.tasks.text_extraction import OlmOCRExtractor, ChandraTextExtractor
from omnidocs.tasks.text_extraction.olm import OlmOCRVLLMConfig
from omnidocs.tasks.text_extraction.chandra import ChandraPyTorchConfig

# Option 1: olmOCR-2-7B with VLLM
extractor = OlmOCRExtractor(
    backend=OlmOCRVLLMConfig(
        model="allenai/olmOCR-2-7B-1025-FP8",
        tensor_parallel_size=1
    )
)

# Option 2: Chandra-9B
extractor = ChandraTextExtractor(
    backend=ChandraPyTorchConfig(
        model="datalab-to/chandra",
        device="cuda"
    )
)

Flexible Custom Layouts (8B) - Qwen3-VL¶

from omnidocs.tasks.layout_extraction import QwenLayoutDetector
from omnidocs.tasks.layout_extraction.qwen import QwenPyTorchConfig

layout = QwenLayoutDetector(
    backend=QwenPyTorchConfig(
        model="Qwen/Qwen3-VL-8B-Instruct",
        device="cuda"
    )
)

result = layout.extract(
    image,
    custom_labels=["code_block", "sidebar", "diagram"]
)

🎯 Current Focus: Layout Analysis Models¶

Phase 1: Multi-Backend VLM Integration¶

1. Qwen3-VL-8B-Instruct Integration¶

Status: 🟡 In Progress

Integrate Qwen3-VL-8B-Instruct for flexible layout detection with custom label support across all backends.

Key Features: - Enhanced document parsing over Qwen2.5-VL - Improved visual perception and text understanding - Advanced reasoning capabilities - Custom layout label support

Implementation Checklist:¶

[ ] HuggingFace/PyTorch Backend (QwenLayoutDetector + QwenPyTorchConfig)

Model: Qwen/Qwen3-VL-8B-Instruct

Config Class: omnidocs/tasks/layout_analysis/qwen/pytorch.py

class QwenPyTorchConfig(BaseModel):
    model: str = "Qwen/Qwen3-VL-8B-Instruct"
    device: str = "cuda"
    torch_dtype: Literal["auto", "float16", "bfloat16"] = "auto"
    attn_implementation: Optional[str] = None  # "flash_attention_2" if available
    cache_dir: Optional[str] = None

Dependencies: - torch, transformers - qwen-vl-utils (model-specific utility)

Reference Implementation: See scripts/layout/modal_qwen3_vl_layout.py in the repository

Testing: - Validate on synthetic document images - Compare detection accuracy with ground truth - Test custom label support

[ ] VLLM Backend (QwenVLLMConfig)

Model: Qwen/Qwen3-VL-8B-Instruct

Config Class: omnidocs/tasks/layout_analysis/qwen/vllm.py

class QwenVLLMConfig(BaseModel):
    model: str = "Qwen/Qwen3-VL-8B-Instruct"
    tensor_parallel_size: int = 1
    gpu_memory_utilization: float = 0.9
    max_model_len: Optional[int] = None
    trust_remote_code: bool = True

Dependencies: - vllm>=0.4.0 - torch>=2.0

Use Case: High-throughput batch processing (10+ documents/second)

Modal Config: - GPU: A10G:1 (minimum), A100:1 (recommended for production) - Image: VLLM GPU Image with flash-attn

Testing: - Benchmark throughput vs PyTorch - Validate output consistency - Test batch processing

[ ] MLX Backend (QwenMLXConfig)

Model: mlx-community/Qwen3-VL-8B-Instruct-4bit

Config Class: omnidocs/tasks/layout_analysis/qwen/mlx.py

class QwenMLXConfig(BaseModel):
    model: str = "mlx-community/Qwen3-VL-8B-Instruct-4bit"
    quantization: Literal["4bit", "8bit"] = "4bit"
    max_tokens: int = 4096

Dependencies: - mlx>=0.10 - mlx-lm>=0.10

Platform: Apple Silicon only (M1/M2/M3+)

Use Case: Local development and testing on macOS

Note: ⚠️ DO NOT deploy MLX to Modal - local development only

[ ] API Backend (QwenAPIConfig)

Model: qwen3-vl-8b-instruct

Config Class: omnidocs/tasks/layout_analysis/qwen/api.py

class QwenAPIConfig(BaseModel):
    model: str = "novita/qwen3-vl-8b-instruct"
    api_key: str
    base_url: Optional[str] = None
    max_tokens: int = 4096
    temperature: float = 0.1

Provider: Novita AI - Context Length: 131K tokens - Max Output: 33K tokens - Pricing: - Input: $0.08/M tokens - Output: $0.50/M tokens

Dependencies: - litellm>=1.30 - openai>=1.0

Use Case: - Serverless deployments - No GPU infrastructure required - Cost-effective for low-volume processing

[ ] Main Extractor Class (omnidocs/tasks/layout_analysis/qwen.py)

Implement unified QwenLayoutDetector class:

from typing import Union, List, Optional
from PIL import Image
from .base import BaseLayoutExtractor
from .models import LayoutOutput
from .qwen import (
    QwenPyTorchConfig,
    QwenVLLMConfig,
    QwenMLXConfig,
    QwenAPIConfig,
)

QwenBackendConfig = Union[
    QwenPyTorchConfig,
    QwenVLLMConfig,
    QwenMLXConfig,
    QwenAPIConfig,
]

class QwenLayoutDetector(BaseLayoutExtractor):
    """Flexible VLM-based layout detector with custom label support."""

    def __init__(self, backend: QwenBackendConfig):
        self.backend_config = backend
        self._backend = self._create_backend()

    def extract(
        self,
        image: Image.Image,
        custom_labels: Optional[List[str]] = None,
    ) -> LayoutOutput:
        """
        Detect layout elements with optional custom labels.

        Args:
            image: PIL Image
            custom_labels: Optional custom layout categories
                Default: ["title", "paragraph", "table", "figure",
                         "caption", "formula", "list"]

        Returns:
            LayoutOutput with detected bounding boxes
        """
        # Implementation...

[ ] Integration Tests

Test suite covering: - All backend configurations - Custom label functionality - Cross-backend output consistency - Edge cases (empty images, single elements, complex layouts)

[ ] Documentation
API reference with examples for each backend
Performance comparison table (PyTorch vs VLLM vs MLX vs API)
Migration guide from Qwen2.5-VL
Custom label usage examples
[ ] Modal Deployment Script

Create production-ready deployment: - scripts/layout_omnidocs/modal_qwen_layout_vllm_online.py - Web endpoint for layout detection API - Batch processing support - Monitoring and logging

Phase 2: Additional Layout Models¶

2. RT-DETR Layout Detector¶

[ ] Single-Backend Implementation (PyTorch only)
Model: RT-DETR (Facebook AI)
Fixed label support (COCO-based)
Real-time detection optimization

3. Surya Layout Detector¶

[ ] Single-Backend Implementation (PyTorch only)
Model: vikp/surya_layout
Multi-language document support
Optimized for speed

4. Florence-2 Layout Detector¶

[ ] Multi-Backend Implementation
HuggingFace/PyTorch backend
API backend (Microsoft Azure)
Object detection + dense captioning

🔮 Future Phases¶

Additional task categories will be added after layout analysis is complete:

OCR Extraction: Surya-OCR, PaddleOCR, Qwen-OCR
Text Extraction: VLM-based Markdown/HTML extraction
Table Extraction: Table Transformer, Surya-Table
Math Expression Extraction: UniMERNet, Surya-Math
Advanced Features: Reading order, image captioning, chart understanding
Package & Distribution: PyPI publishing, comprehensive documentation

🎯 Success Metrics (Layout Analysis)¶

Performance Targets¶

Metric	Target	Current
Layout Detection Accuracy (mAP)	>90%	TBD
Inference Speed (PyTorch)	<2s per page	TBD
Inference Speed (VLLM)	<0.5s per page	TBD
Custom Label Support	100% functional	TBD

Quality Targets¶

[ ] Type hints coverage: 100%
[ ] Docstring coverage: 100%
[ ] Test coverage: >80%
[ ] All backends tested on production data
[ ] Cross-backend output consistency validated

🔧 Infrastructure¶

Consistency Requirements (as per CLAUDE.md):

Volume Name: omnidocs
Secret Name: adithya-hf-wandb
CUDA Version: 12.4.0-devel-ubuntu22.04
Python Version: 3.11 (3.12 for Qwen3-VL)
Cache Directory: /data/.cache (HuggingFace)
Model Cache: /data/omnidocs_models
Dependency Management: .uv_pip_install() (NO version pinning)

GPU Configurations¶

GPU	Use Case	Cost (est.)
`A10G:1`	Development & Testing	$0.60/hr
`A100:1`	Production Inference	$3.00/hr
`A100:2`	High-Throughput VLLM	$6.00/hr

📚 References¶

Design Documents¶

Backend Architecture - Core design principles (see IMPLEMENTATION_PLAN/BACKEND_ARCHITECTURE.md)
Developer Experience (DevEx) - API design and patterns (see IMPLEMENTATION_PLAN/DEVEX.md)
Claude Development Guide - Implementation standards (see CLAUDE.md in repo root)

External Resources¶

📝 Notes¶

Implementation Order Rationale¶

Qwen3-VL Priority: Multi-backend support demonstrates v2.0 architecture
RT-DETR: Fast fixed-label detection for production use
Surya: Multi-language support and speed optimization
Florence-2: Microsoft's advanced VLM capabilities

Breaking Changes from v1.0¶

String-based factory pattern removed (use class imports)
Document class is now stateless (doesn't store results)
Config classes are model-specific (not generic)
Backend selection via config type (not string parameter)

Last Updated: January 21, 2026 Maintainer: Adithya S Kolavi Version: 2.0.0-dev

OmniDocs Development Roadmap¶

📦 Target Model Support¶

🎯 Quick Reference: Model Capabilities & Backend Support¶

Comprehensive Model Comparison Table¶

Backend Details¶

PyTorch Support¶

VLLM Support (High-Throughput Production)¶

MLX Support (Apple Silicon M1/M2/M3+)¶

OpenAI-Compatible API Providers¶

Task Capability Matrix¶

Model Overview by Task Capability¶

Task Categories¶

🆕 Latest Models (January 2026)¶

DeepSeek-OCR-2¶

LightOnOCR-2-1B¶

OCRFlux-3B¶

🎯 Core Models (By Size & Release Date)¶

Ultra-Compact Models (<1B Parameters)¶

IBM Granite-Docling-258M¶

2. stepfun-ai GOT-OCR2.0¶

Compact Models (1-2B Parameters)¶

3. rednote-hilab dots.ocr¶

4. PaddlePaddle PaddleOCR-VL¶

5. LightOn AI LightOnOCR Series¶

6. opendatalab MinerU2.5¶

Small Models (2-4B Parameters)¶

7. Qwen3-VL-2B-Instruct¶

8. DeepSeek-OCR¶

9. Nanonets-OCR2-3B¶

10. Qwen3-VL-4B-Instruct¶

11. Google Gemma-3-4B-IT¶

Medium Models (7-9B Parameters)¶

12. allenai olmOCR-2-7B-1025¶

13. Qwen3-VL-8B-Instruct¶

14. datalab-to Chandra¶

Large Models (32B+ Parameters)¶

15. Qwen3-VL-32B-Instruct¶

Specialized Models¶

16. docling-project/docling-models¶

📦 Optional Models (Legacy/Alternative)¶

Qwen 2.5-VL Series (Previous Generation)¶

Qwen2.5-VL-3B-Instruct¶

Qwen2.5-VL-7B-Instruct¶

📊 Model Comparison Summary¶

By Release Date (2024-2026)¶

By Performance (OlmOCR-Bench)¶

By Speed (Relative Performance)¶

🔧 Backend Support Matrix¶

📚 Recommended Model Selection Guide¶

By Use Case¶

🚀 Quick Start Examples¶

Ultra-Compact (258M) - Granite-Docling¶

Fastest OCR (1B) - LightOnOCR-2¶

High-Throughput (1.7B) - dots.ocr + VLLM¶

Best Accuracy (7-9B) - olmOCR-2 or Chandra¶

Flexible Custom Layouts (8B) - Qwen3-VL¶

🎯 Current Focus: Layout Analysis Models¶

Phase 1: Multi-Backend VLM Integration¶

1. Qwen3-VL-8B-Instruct Integration¶

Implementation Checklist:¶

Phase 2: Additional Layout Models¶

2. RT-DETR Layout Detector¶

3. Surya Layout Detector¶

4. Florence-2 Layout Detector¶

🔮 Future Phases¶

🎯 Success Metrics (Layout Analysis)¶

Performance Targets¶

Quality Targets¶

🔧 Infrastructure¶

Modal Deployment Standards¶

GPU Configurations¶

📚 References¶

Design Documents¶

External Resources¶

📝 Notes¶

Implementation Order Rationale¶

Breaking Changes from v1.0¶