OmniDocs - Backend Architecture
Status: ✅ Design Complete Last Updated: January 20, 2026 Version: 2.0.0
Overview
OmniDocs supports 4 inference backends:
| Backend | Use Case | Platform | Key Dependencies |
|---|---|---|---|
| PyTorch | Default local inference | CPU/GPU | torch, transformers |
| VLLM | High-throughput serving | GPU only | vllm |
| MLX | Apple Silicon optimization | macOS M1/M2/M3+ | mlx, mlx-lm |
| API | Hosted models | Cloud | litellm |
Core Architecture Principles
Separation of Concerns: __init__ vs .extract()
OmniDocs maintains a clear separation between model initialization and runtime parameters:
__init__ (via config) - Model Setup & Verification
- Which model to use
- Which backend (PyTorch/VLLM/MLX/API)
- Model loading settings (device, dtype, quantization)
- Download and cache paths
- Model verification and validation
.extract() - Runtime Task Parameters
- Output format (markdown/html)
- Custom prompts
- Task-specific options (include_layout, custom_labels)
- Per-call inference settings
Example:
# Init: Model setup (happens once)
extractor = QwenTextExtractor(
backend=QwenPyTorchConfig(
model="Qwen/Qwen2-VL-7B", # Which model
device="cuda", # Where to run
torch_dtype="bfloat16", # How to load
)
)
# Extract: Runtime params (can vary per call)
result1 = extractor.extract(image1, output_format="markdown")
result2 = extractor.extract(image2, output_format="html", custom_prompt="...")
Design Principle: Model-Specific Configs
Each model defines its own config classes for supported backends. This provides:
- IDE Autocomplete - Only relevant parameters shown
- Type Safety - Pydantic validation at config creation
- Clear Discoverability - Config exists = backend supported
- No Abstraction Leakage - Each backend can have unique parameters
Config Class Structure
Single-Backend Model
Models with only one backend (e.g., DocLayoutYOLO = PyTorch only):
# omnidocs/tasks/layout_analysis/doc_layout_yolo.py
from pydantic import BaseModel, Field
from typing import Optional
from PIL import Image
class DocLayoutYOLOConfig(BaseModel):
"""Configuration for DocLayoutYOLO model."""
device: str = Field(default="cuda", description="Device to run on")
model_path: Optional[str] = Field(default=None, description="Custom model weights")
img_size: int = Field(default=1024, description="Input image size")
confidence: float = Field(default=0.25, ge=0.0, le=1.0)
class Config:
extra = "forbid" # Raise error on unknown params
class DocLayoutYOLO:
"""DocLayout-YOLO layout detector. PyTorch only."""
def __init__(self, config: DocLayoutYOLOConfig):
self.config = config
self._load_model()
def _load_model(self):
"""Load model with PyTorch."""
import torch
# Load model...
def extract(self, image: Image.Image) -> LayoutOutput:
"""Run layout detection."""
# Inference...
pass
Multi-Backend Model
Models with multiple backends (e.g., Qwen = PyTorch, VLLM, MLX, API):
# omnidocs/tasks/text_extraction/qwen.py
from typing import Union
from PIL import Image
# Import all backend configs
from omnidocs.tasks.text_extraction.qwen import (
QwenPyTorchConfig,
QwenVLLMConfig,
QwenMLXConfig,
QwenAPIConfig,
)
# Union type for all supported backends
QwenBackendConfig = Union[
QwenPyTorchConfig,
QwenVLLMConfig,
QwenMLXConfig,
QwenAPIConfig,
]
class QwenTextExtractor:
"""Qwen VLM text extractor. Supports PyTorch, VLLM, MLX, API backends."""
def __init__(self, backend: QwenBackendConfig):
self.backend_config = backend
self._backend = self._create_backend()
def _create_backend(self):
"""Create appropriate backend based on config type."""
if isinstance(self.backend_config, QwenPyTorchConfig):
from omnidocs.inference.pytorch import PyTorchInference
return PyTorchInference(self.backend_config)
elif isinstance(self.backend_config, QwenVLLMConfig):
from omnidocs.inference.vllm import VLLMInference
return VLLMInference(self.backend_config)
elif isinstance(self.backend_config, QwenMLXConfig):
from omnidocs.inference.mlx import MLXInference
return MLXInference(self.backend_config)
elif isinstance(self.backend_config, QwenAPIConfig):
from omnidocs.inference.api import APIInference
return APIInference(self.backend_config)
else:
raise TypeError(f"Unknown backend config: {type(self.backend_config)}")
def extract(
self,
image: Image.Image,
output_format: str = "markdown",
include_layout: bool = False,
custom_prompt: Optional[str] = None,
) -> TextOutput:
"""
Extract text from image.
Args:
image: PIL Image
output_format: "markdown" or "html"
include_layout: Include layout information
custom_prompt: Override default prompt
Returns:
TextOutput with extracted content
"""
prompt = custom_prompt or self._get_default_prompt(output_format, include_layout)
raw_output = self._backend.infer(image, prompt)
return self._postprocess(raw_output, output_format)
Backend Config Definitions
PyTorch Config
# omnidocs/tasks/text_extraction/qwen/pytorch.py
from pydantic import BaseModel, Field
from typing import Optional, Literal
class QwenPyTorchConfig(BaseModel):
"""PyTorch/HuggingFace backend configuration for Qwen."""
model: str = Field(..., description="HuggingFace model ID")
device: str = Field(default="cuda", description="Device (cuda/cpu)")
torch_dtype: Literal["float16", "bfloat16", "float32"] = Field(
default="bfloat16",
description="Torch dtype for model"
)
trust_remote_code: bool = Field(default=True)
device_map: Optional[str] = Field(default="auto")
max_memory: Optional[dict] = Field(default=None)
quantization: Optional[Literal["4bit", "8bit"]] = Field(default=None)
class Config:
extra = "forbid"
VLLM Config
# omnidocs/tasks/text_extraction/qwen/vllm.py
from pydantic import BaseModel, Field
from typing import Optional
class QwenVLLMConfig(BaseModel):
"""VLLM backend configuration for Qwen."""
model: str = Field(..., description="HuggingFace model ID")
tensor_parallel_size: int = Field(default=1, ge=1)
gpu_memory_utilization: float = Field(default=0.9, ge=0.1, le=1.0)
max_model_len: Optional[int] = Field(default=None)
enforce_eager: bool = Field(default=False)
trust_remote_code: bool = Field(default=True)
dtype: str = Field(default="bfloat16")
# VLLM-specific features
enable_prefix_caching: bool = Field(default=False)
enable_chunked_prefill: bool = Field(default=False)
class Config:
extra = "forbid"
MLX Config
# omnidocs/tasks/text_extraction/qwen/mlx.py
from pydantic import BaseModel, Field
from typing import Optional, Literal
class QwenMLXConfig(BaseModel):
"""MLX backend configuration for Qwen (Apple Silicon)."""
model: str = Field(..., description="MLX model path or HuggingFace ID")
quantization: Optional[Literal["4bit", "8bit"]] = Field(default=None)
max_tokens: int = Field(default=4096)
class Config:
extra = "forbid"
API Config
# omnidocs/tasks/text_extraction/qwen/api.py
from pydantic import BaseModel, Field
from typing import Optional, Dict
class QwenAPIConfig(BaseModel):
"""API backend configuration for Qwen (hosted or proxy)."""
model: str = Field(..., description="API model identifier")
api_key: str = Field(..., description="API key")
base_url: Optional[str] = Field(
default=None,
description="Custom API endpoint (for proxies)"
)
rate_limit: int = Field(default=10, ge=1, description="Requests per minute")
timeout: int = Field(default=30, ge=1, description="Request timeout in seconds")
max_retries: int = Field(default=3, ge=0)
custom_headers: Optional[Dict[str, str]] = Field(default=None)
class Config:
extra = "forbid"
Inference Utilities
The omnidocs/inference/ module contains shared utilities for each backend:
omnidocs/inference/
├── __init__.py
├── base.py # Base inference class
├── pytorch.py # PyTorch utilities
├── vllm.py # VLLM utilities
├── mlx.py # MLX utilities
└── api.py # LiteLLM/API utilities
Base Inference Class
# omnidocs/inference/base.py
from abc import ABC, abstractmethod
from typing import Any
from PIL import Image
class BaseInference(ABC):
"""Base class for inference backends."""
@abstractmethod
def load_model(self) -> None:
"""Load model into memory."""
pass
@abstractmethod
def infer(self, image: Image.Image, prompt: str) -> Any:
"""Run inference."""
pass
@abstractmethod
def unload(self) -> None:
"""Free resources."""
pass
PyTorch Inference
# omnidocs/inference/pytorch.py
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
from .base import BaseInference
class PyTorchInference(BaseInference):
"""PyTorch/HuggingFace inference backend."""
def __init__(self, config):
self.config = config
self.model = None
self.processor = None
self.load_model()
def load_model(self):
dtype_map = {
"float16": torch.float16,
"bfloat16": torch.bfloat16,
"float32": torch.float32,
}
self.processor = AutoProcessor.from_pretrained(
self.config.model,
trust_remote_code=self.config.trust_remote_code,
)
self.model = AutoModelForCausalLM.from_pretrained(
self.config.model,
torch_dtype=dtype_map[self.config.torch_dtype],
device_map=self.config.device_map,
trust_remote_code=self.config.trust_remote_code,
)
self.model.eval()
def infer(self, image: Image.Image, prompt: str):
inputs = self.processor(
text=prompt,
images=image,
return_tensors="pt",
).to(self.model.device)
with torch.no_grad():
outputs = self.model.generate(
**inputs,
max_new_tokens=4096,
)
return self.processor.decode(outputs[0], skip_special_tokens=True)
def unload(self):
if self.model:
del self.model
del self.processor
torch.cuda.empty_cache()
VLLM Inference
# omnidocs/inference/vllm.py
from PIL import Image
from .base import BaseInference
class VLLMInference(BaseInference):
"""VLLM inference backend."""
def __init__(self, config):
self.config = config
self.llm = None
self.load_model()
def load_model(self):
from vllm import LLM
self.llm = LLM(
model=self.config.model,
tensor_parallel_size=self.config.tensor_parallel_size,
gpu_memory_utilization=self.config.gpu_memory_utilization,
max_model_len=self.config.max_model_len,
enforce_eager=self.config.enforce_eager,
trust_remote_code=self.config.trust_remote_code,
dtype=self.config.dtype,
)
def infer(self, image: Image.Image, prompt: str):
from vllm import SamplingParams
sampling_params = SamplingParams(
max_tokens=4096,
temperature=0.0,
)
outputs = self.llm.generate(
{
"prompt": prompt,
"multi_modal_data": {"image": image},
},
sampling_params=sampling_params,
)
return outputs[0].outputs[0].text
def unload(self):
if self.llm:
del self.llm
API Inference
# omnidocs/inference/api.py
import base64
from io import BytesIO
from PIL import Image
from .base import BaseInference
class APIInference(BaseInference):
"""LiteLLM/API inference backend."""
def __init__(self, config):
self.config = config
self.load_model()
def load_model(self):
"""Validate API configuration."""
import litellm
# Configure LiteLLM
if self.config.base_url:
litellm.api_base = self.config.base_url
def infer(self, image: Image.Image, prompt: str):
import litellm
# Convert image to base64
buffered = BytesIO()
image.save(buffered, format="PNG")
img_base64 = base64.b64encode(buffered.getvalue()).decode()
response = litellm.completion(
model=self.config.model,
api_key=self.config.api_key,
base_url=self.config.base_url,
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{
"type": "image_url",
"image_url": f"data:image/png;base64,{img_base64}"
}
]
}
],
timeout=self.config.timeout,
num_retries=self.config.max_retries,
)
return response.choices[0].message.content
def unload(self):
pass # Nothing to unload
Dependency Management
pyproject.toml Structure
[project]
name = "omnidocs"
dependencies = [
"pydantic>=2.0",
"pillow>=10.0",
"numpy>=1.24",
]
[project.optional-dependencies]
# Individual backends
pytorch = [
"torch>=2.0",
"torchvision>=0.15",
"transformers>=4.40",
]
vllm = [
"vllm>=0.4.0",
"torch>=2.0",
]
mlx = [
"mlx>=0.10",
"mlx-lm>=0.10",
]
api = [
"litellm>=1.30",
"openai>=1.0",
]
# Convenience groups
local = ["omnidocs[pytorch]"]
all-local = ["omnidocs[pytorch,vllm,mlx]"]
all = ["omnidocs[pytorch,vllm,mlx,api]"]
# Development
dev = ["omnidocs[all]", "pytest", "black", "mypy"]
Installation Examples
# Minimal (no inference)
pip install omnidocs
# PyTorch only (most common)
pip install omnidocs[pytorch]
# High-throughput serving
pip install omnidocs[vllm]
# Apple Silicon
pip install omnidocs[mlx]
# API only (no local inference)
pip install omnidocs[api]
# Everything
pip install omnidocs[all]
Lazy Import Pattern
To avoid import errors when backends aren't installed:
# omnidocs/tasks/text_extraction/qwen.py
from typing import Union, TYPE_CHECKING
if TYPE_CHECKING:
from omnidocs.tasks.text_extraction.qwen import (
QwenPyTorchConfig,
QwenVLLMConfig,
QwenMLXConfig,
QwenAPIConfig,
)
class QwenTextExtractor:
def __init__(self, backend):
self.backend_config = backend
self._backend = None
self._load_backend()
def _load_backend(self):
"""Lazy load backend based on config type."""
config = self.backend_config
config_type = type(config).__name__
if config_type == "QwenPyTorchConfig":
try:
from omnidocs.inference.pytorch import PyTorchInference
except ImportError:
raise ImportError(
"PyTorch backend requires torch and transformers. "
"Install with: pip install omnidocs[pytorch]"
)
self._backend = PyTorchInference(config)
elif config_type == "QwenVLLMConfig":
try:
from omnidocs.inference.vllm import VLLMInference
except ImportError:
raise ImportError(
"VLLM backend requires vllm. "
"Install with: pip install omnidocs[vllm]"
)
self._backend = VLLMInference(config)
# ... etc
Error Handling
Config Validation
from pydantic import ValidationError
try:
config = QwenVLLMConfig(
model="Qwen/Qwen2-VL-7B",
tensor_parallel_size=-1, # Invalid!
)
except ValidationError as e:
print(e)
# tensor_parallel_size: Input should be greater than or equal to 1
Backend Not Installed
try:
extractor = QwenTextExtractor(
backend=QwenVLLMConfig(model="Qwen/Qwen2-VL-7B")
)
except ImportError as e:
print(e)
# VLLM backend requires vllm. Install with: pip install omnidocs[vllm]
Invalid Backend for Model
# DotsOCR doesn't support API
from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
# This import would fail because DotsOCRAPIConfig doesn't exist
# from omnidocs.tasks.text_extraction.dotsocr import DotsOCRAPIConfig
# User naturally discovers DotsOCR doesn't support API
# because there's no config class to import
Layout Detection: Fixed vs Flexible Models
Overview
Layout detection models in OmniDocs fall into two categories based on label flexibility:
| Category | Examples | Label Support | Implementation |
|---|---|---|---|
| Fixed Labels | DocLayoutYOLO, RT-DETR | Predefined only | Trained model classes |
| Flexible VLM | Qwen, Florence-2 | Custom via prompting | Vision-language models |
Fixed Label Models
Models: DocLayoutYOLO, RTDETRLayoutDetector, SuryaLayoutDetector
These models are trained on specific label sets (title, text, table, figure, etc.) and cannot detect custom elements.
# omnidocs/tasks/layout_analysis/doc_layout_yolo.py
class DocLayoutYOLO:
"""Fixed label layout detector. PyTorch only."""
FIXED_LABELS = ["title", "text", "list", "table", "figure", "caption", "formula"]
def __init__(self, config: DocLayoutYOLOConfig):
self.config = config
self._load_model()
def extract(self, image: Image.Image) -> LayoutOutput:
"""
Extract layout with predefined labels only.
Args:
image: PIL Image
Returns:
LayoutOutput with bboxes using FIXED_LABELS
"""
# Run YOLO detection
detections = self.model(image)
# Map to fixed labels
bboxes = []
for det in detections:
label = self.FIXED_LABELS[det.class_id]
bboxes.append(LayoutBox(label=label, bbox=det.bbox, confidence=det.conf))
return LayoutOutput(bboxes=bboxes)
Flexible VLM Models
Models: QwenLayoutDetector, Florence2LayoutDetector, VLMLayoutDetector
These models use vision-language prompting and can detect ANY custom layout elements.
# omnidocs/tasks/layout_analysis/qwen.py
from typing import Union, List, Optional
from omnidocs.tasks.layout_analysis.models import CustomLabel, LayoutOutput
class QwenLayoutDetector:
"""Flexible VLM layout detector. Supports custom labels."""
DEFAULT_LABELS = ["title", "text", "list", "table", "figure", "caption", "formula"]
def __init__(self, backend: QwenBackendConfig):
self.backend_config = backend
self._backend = self._create_backend()
def extract(
self,
image: Image.Image,
custom_labels: Optional[Union[List[str], List[CustomLabel]]] = None,
) -> LayoutOutput:
"""
Extract layout with flexible label support.
Args:
image: PIL Image
custom_labels:
- None: Use DEFAULT_LABELS
- List[str]: Simple custom label names
- List[CustomLabel]: Structured labels with metadata
Returns:
LayoutOutput with detected elements
"""
# Normalize labels
if custom_labels is None:
labels = [CustomLabel(name=name) for name in self.DEFAULT_LABELS]
else:
labels = self._normalize_labels(custom_labels)
# Build detection prompt
prompt = self._build_prompt(labels)
# Run VLM inference
raw_output = self._backend.infer(image, prompt)
# Parse results
return self._parse_detections(raw_output, labels)
def _normalize_labels(
self,
labels: Union[List[str], List[CustomLabel]]
) -> List[CustomLabel]:
"""Convert string labels to CustomLabel objects."""
normalized = []
for label in labels:
if isinstance(label, str):
normalized.append(CustomLabel(name=label))
elif isinstance(label, CustomLabel):
normalized.append(label)
return normalized
def _build_prompt(self, labels: List[CustomLabel]) -> str:
"""Build detection prompt from labels."""
label_descriptions = []
for label in labels:
if label.detection_prompt:
# Use custom detection prompt
label_descriptions.append(
f"- {label.name}: {label.detection_prompt}"
)
else:
# Use label name only
label_descriptions.append(f"- {label.name}")
prompt = f"""Detect the following layout elements in this document image:
{chr(10).join(label_descriptions)}
Return bounding boxes [x1, y1, x2, y2] for each detected element."""
return prompt
CustomLabel Definition
# omnidocs/tasks/layout_analysis/models.py
from pydantic import BaseModel, Field
from typing import Optional
class CustomLabel(BaseModel):
"""Custom layout label definition for flexible VLM models."""
name: str = Field(..., description="Label identifier (e.g., 'code_block')")
description: Optional[str] = Field(
default=None,
description="Human-readable description"
)
detection_prompt: Optional[str] = Field(
default=None,
description="Custom prompt hint for model to use during detection"
)
color: Optional[str] = Field(
default=None,
description="Visualization color (hex or name)"
)
class Config:
extra = "allow" # Users can add custom fields
Usage Examples
Fixed Model (Simple, Fast):
from omnidocs.tasks.layout_analysis import DocLayoutYOLO, DocLayoutYOLOConfig
layout = DocLayoutYOLO(config=DocLayoutYOLOConfig(device="cuda"))
result = layout.extract(image)
# Returns: title, text, table, figure (fixed set)
Flexible VLM (Simple Strings):
from omnidocs.tasks.layout_analysis import QwenLayoutDetector
from omnidocs.tasks.layout_analysis.qwen import QwenPyTorchConfig
layout = QwenLayoutDetector(
backend=QwenPyTorchConfig(model="Qwen/Qwen2-VL-7B")
)
# Detect custom elements
result = layout.extract(
image,
custom_labels=["code_block", "sidebar", "pull_quote"]
)
Flexible VLM (Structured Labels):
from omnidocs.tasks.layout_analysis import QwenLayoutDetector, CustomLabel
from omnidocs.tasks.layout_analysis.qwen import QwenPyTorchConfig
layout = QwenLayoutDetector(
backend=QwenPyTorchConfig(model="Qwen/Qwen2-VL-7B")
)
result = layout.extract(
image,
custom_labels=[
CustomLabel(
name="code_block",
description="Source code listings",
detection_prompt="Regions with monospace text and syntax highlighting",
color="#2ecc71",
),
CustomLabel(
name="sidebar",
description="Supplementary content boxes",
detection_prompt="Boxed regions with background color or borders",
color="#3498db",
),
]
)
# Access metadata
for box in result.bboxes:
print(f"{box.label.name}: {box.label.description}")
Benefits
| Feature | Fixed Models | Flexible VLMs |
|---|---|---|
| Speed | ⚡ Fast | 🐢 Slower (VLM inference) |
| Accuracy | ⭐⭐⭐ High (trained) | ⭐⭐ Good (prompted) |
| Custom Labels | ❌ No | ✅ Yes |
| Label Metadata | ❌ No | ✅ Yes (CustomLabel) |
| Detection Prompts | ❌ No | ✅ Yes |
| Extensibility | ❌ No | ✅ Yes (extra fields) |
| Use Case | Standard documents | Any document type |
Summary
Key Design Decisions
| Decision | Choice | Rationale |
|---|---|---|
| Config Pattern | Model-specific classes | IDE support, type safety |
| Backend Discovery | Import exists = supported | Obvious, no guessing |
| Lazy Imports | Load on use | Avoid dependency errors |
| Validation | Pydantic | Early error detection |
| Error Messages | Clear install instructions | Good UX |
Config Naming Convention
| Model Type | Config Location | Naming |
|---|---|---|
| Single-backend | Same file as model | {Model}Config |
| Multi-backend | Subfolder | {Model}{Backend}Config |
Parameter Naming
| Model Type | Parameter |
|---|---|
| Single-backend | config= |
| Multi-backend | backend= |
Last Updated: January 20, 2026 Status: ✅ Design Complete