Concepts¶

Understanding OmniDocs architecture, config patterns, backend system, and document model.

Architecture Overview¶

Core Flow¶

User Code
    ↓
Document.from_pdf() → PIL Images
    ↓
Extractor(config) → Load model with backend
    ↓
extractor.extract(image) → Pydantic output

Design Principles¶

Principle	What It Means
Unified API	`.extract()` for all tasks
Class imports	`from omnidocs.tasks.x import Model` (no string factories)
Type-safe configs	Pydantic validation, IDE autocomplete
Stateless Document	Document = source data, not results
Config = capability	Available configs show supported backends
Init vs Extract	Config sets hardware, extract sets task params

Component Architecture¶

omnidocs/
├── document.py            # Document class (stateless)
├── tasks/
│   ├── text_extraction/   # Text → Markdown/HTML
│   ├── layout_extraction/ # Structure detection
│   ├── ocr_extraction/    # Text + bounding boxes
│   ├── table_extraction/  # Table structure extraction
│   ├── reading_order/     # Logical reading sequence
│   └── ...
└── inference/
    ├── pytorch.py         # HuggingFace/torch
    ├── vllm.py            # High-throughput
    ├── mlx.py             # Apple Silicon
    └── api.py             # LiteLLM

Config Pattern¶

Single-Backend Models¶

Models that support only one backend (typically PyTorch):

from omnidocs.tasks.layout_analysis import DocLayoutYOLO, DocLayoutYOLOConfig

# Pattern: {Model}Config → config= parameter
layout = DocLayoutYOLO(
    config=DocLayoutYOLOConfig(
        device="cuda",
        confidence=0.25,
    )
)

Multi-Backend Models¶

Models that support multiple backends:

from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import (
    QwenPyTorchConfig,   # Local GPU
    QwenVLLMConfig,      # High throughput
    QwenMLXConfig,       # Apple Silicon
    QwenAPIConfig,       # Cloud API
)

# Pattern: {Model}{Backend}Config → backend= parameter
extractor = QwenTextExtractor(
    backend=QwenPyTorchConfig(device="cuda")
)

Config Naming¶

Model Type	Naming Pattern	Parameter
Single-backend	`{Model}Config`	`config=`
Multi-backend	`{Model}{Backend}Config`	`backend=`

What Goes Where¶

Init (config/backend) - Model name/path - Device (cuda, cpu, mps) - Quantization, dtype - Backend-specific settings

Extract (method params) - Output format (markdown, html) - Custom prompts - Task-specific options - Per-call settings

# Init: hardware/model setup
extractor = QwenTextExtractor(
    backend=QwenPyTorchConfig(
        model="Qwen/Qwen3-VL-8B",
        device="cuda",
        torch_dtype="bfloat16",
    )
)

# Extract: task parameters
result = extractor.extract(
    image,
    output_format="markdown",
    include_layout=True,
)

Backend System¶

Backend Comparison¶

Backend	Use Case	Requirements
PyTorch	Development, local GPU	CUDA 12+ or CPU
VLLM	Production, high throughput	NVIDIA GPU 24GB+
MLX	Apple Silicon development	M1/M2/M3 Mac
API	No GPU, cloud-first	API key + internet

Backend Selection¶

# PyTorch - development default
from omnidocs.tasks.text_extraction.qwen import QwenPyTorchConfig
backend = QwenPyTorchConfig(device="cuda", torch_dtype="bfloat16")

# VLLM - production throughput
from omnidocs.tasks.text_extraction.qwen import QwenVLLMConfig
backend = QwenVLLMConfig(tensor_parallel_size=2, gpu_memory_utilization=0.9)

# MLX - Apple Silicon
from omnidocs.tasks.text_extraction.qwen import QwenMLXConfig
backend = QwenMLXConfig(quantization="4bit")

# API - cloud
from omnidocs.tasks.text_extraction.qwen import QwenAPIConfig
backend = QwenAPIConfig(api_key="sk-...", base_url="https://...")

Switching Backends¶

OmniDocs makes it easy to switch - only the config changes:

# Development: PyTorch
extractor = QwenTextExtractor(
    backend=QwenPyTorchConfig(device="cuda")
)

# Production: switch to VLLM
extractor = QwenTextExtractor(
    backend=QwenVLLMConfig(tensor_parallel_size=2)
)

# Same .extract() API works for both
result = extractor.extract(image, output_format="markdown")

Discoverability¶

Available backends = importable config classes:

# Check what backends a model supports
from omnidocs.tasks.text_extraction.qwen import (
    QwenPyTorchConfig,  # ✓ PyTorch supported
    QwenVLLMConfig,     # ✓ VLLM supported
    QwenMLXConfig,      # ✓ MLX supported
    QwenAPIConfig,      # ✓ API supported
)

# If import fails → backend not supported for that model

Document Model¶

Design: Stateless¶

Document contains source data only, not analysis results.

Why? - Clean separation of concerns - User controls caching strategy - Memory efficient - Works with any workflow

doc = Document.from_pdf("file.pdf")  # Just loads PDF
result = extractor.extract(doc.get_page(0))  # User manages result

Loading Methods¶

from omnidocs import Document

# From file
doc = Document.from_pdf("file.pdf", dpi=150)

# From URL
doc = Document.from_url("https://example.com/doc.pdf")

# From bytes
doc = Document.from_bytes(pdf_bytes, filename="doc.pdf")

# From images
doc = Document.from_image("page.png")
doc = Document.from_images(["p1.png", "p2.png"])

Lazy Loading¶

Pages render on demand, then cache:

doc = Document.from_pdf("large.pdf")  # Fast: no rendering yet

page = doc.get_page(0)  # Renders now (~200ms)
page = doc.get_page(0)  # Cached: instant

Memory Management¶

# Efficient iteration (one page at a time)
for page in doc.iter_pages():
    result = extractor.extract(page)
    save(result)

# Clear cache for large documents
doc.clear_cache()        # All pages
doc.clear_cache(page=0)  # Specific page

# Context manager
with Document.from_pdf("file.pdf") as doc:
    # Use doc
    pass  # Auto-closes

Metadata¶

doc.page_count              # Number of pages
doc.metadata.source_type    # "file", "url", "bytes"
doc.metadata.file_name      # Filename
doc.metadata.file_size      # Size in bytes
doc.metadata.format         # "pdf", "png", etc.
doc.to_dict()               # Serialize metadata

Key Patterns¶

Pattern 1: Single Page¶

doc = Document.from_pdf("paper.pdf")
result = extractor.extract(doc.get_page(0))

Pattern 2: All Pages¶

for i, page in enumerate(doc.iter_pages()):
    result = extractor.extract(page)
    save(f"page_{i}.md", result.content)

Pattern 3: Memory Control¶

for i, page in enumerate(doc.iter_pages()):
    result = extractor.extract(page)
    save(result)
    if i % 10 == 0:
        doc.clear_cache()  # Free every 10 pages

Pattern 4: Environment-Based Backend¶

import os

if os.getenv("USE_VLLM"):
    backend = QwenVLLMConfig(tensor_parallel_size=2)
elif os.getenv("USE_API"):
    backend = QwenAPIConfig(api_key=os.getenv("API_KEY"))
else:
    backend = QwenPyTorchConfig(device="cuda")

extractor = QwenTextExtractor(backend=backend)

Model Cache¶

OmniDocs includes a unified model cache that automatically shares loaded models across extractors. When two extractors use the same underlying model (e.g., text extraction and layout detection both using Qwen3-VL), the model is loaded once and shared.

How It Works¶

from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextMLXConfig
from omnidocs.tasks.layout_extraction import QwenLayoutDetector
from omnidocs.tasks.layout_extraction.qwen import QwenLayoutMLXConfig

# First extractor loads the model (~30s)
text_extractor = QwenTextExtractor(backend=QwenTextMLXConfig())

# Second extractor reuses the cached model (instant)
layout_detector = QwenLayoutDetector(backend=QwenLayoutMLXConfig())

The cache normalizes config class names to detect sharing opportunities. QwenTextMLXConfig and QwenLayoutMLXConfig both resolve to the same cache key because they use the same model and backend settings.

Cache Features¶

Feature	Description
Cross-task sharing	Text + layout extractors share one model
LRU eviction	Oldest unused models evicted when cache is full
Reference counting	Models stay cached while any extractor uses them
Thread-safe	Safe for concurrent access
Runtime param exclusion	`max_tokens`, `temperature` don't affect cache key

Configuration¶

from omnidocs import set_cache_config, get_cache_info, clear_cache

# Set max cached models (default: 10)
set_cache_config(max_entries=5)

# Check what's cached
info = get_cache_info()
print(f"Cached models: {info['num_entries']}")

# Clear all cached models
clear_cache()

What Gets Shared¶

Models share cache when they have the same:

Model family (Qwen, MinerUVL, etc.)
Backend type (PyTorch, VLLM, MLX)
Model loading parameters (model name, device, dtype, GPU memory settings)

Parameters that only affect inference (max_tokens, temperature, max_new_tokens) are excluded from the cache key, so a text extractor with max_tokens=8192 and a layout detector with max_tokens=4096 still share the same model.

Supported Models¶

All models in OmniDocs support caching:

Model	Cross-task sharing
Qwen3-VL	Text + Layout share model
MinerU VL	Text + Layout share model
Nanonets OCR2	Single-task cache
Granite Docling	Single-task cache
DotsOCR	Single-task cache
RT-DETR	Single-task cache
DocLayout-YOLO	Single-task cache
PaddleOCR	Single-task cache
EasyOCR	Single-task cache
TableFormer	Single-task cache

For a detailed guide on using the cache, see the Model Cache Guide.

To control where models are downloaded on disk, see the Cache Management Guide.

Trade-offs¶

Choice	Option A	Option B
Speed vs Quality	2B model (fast)	8B+ model (accurate)
Setup vs Throughput	PyTorch (simple)	VLLM (10x faster)
Privacy vs Convenience	Local (private)	API (no setup)
Memory vs Speed	Lazy loading	Load all pages

Summary¶

Concept	Key Point
Architecture	Image → Extractor → Pydantic output
Configs	Single-backend: `config=`, Multi-backend: `backend=`
Backends	PyTorch (dev), VLLM (prod), MLX (Mac), API (cloud)
Document	Stateless, lazy-loaded, user manages results
Model Cache	Auto-shares models across extractors, LRU eviction