Overview¶
OmniDocs Task Modules.
Each task module provides extractors for specific document processing tasks.
Available task modules
- layout_extraction: Detect document structure (titles, tables, figures, etc.)
- ocr_extraction: Extract text with bounding boxes from images
- text_extraction: Convert document images to HTML/Markdown
- table_extraction: Extract table structure and content
- reading_order: Determine logical reading sequence of document elements
layout_extraction
¶
Layout Extraction Module.
Provides extractors for detecting document layout elements such as titles, text blocks, figures, tables, formulas, and captions.
Available Extractors
- DocLayoutYOLO: YOLO-based layout detector (fast, accurate)
- RTDETRLayoutExtractor: Transformer-based detector (more categories)
- QwenLayoutDetector: VLM-based detector with custom label support (multi-backend)
- MinerUVLLayoutDetector: MinerU VL 1.2B layout detector (multi-backend)
Example
from omnidocs.tasks.layout_extraction import DocLayoutYOLO, DocLayoutYOLOConfig
extractor = DocLayoutYOLO(config=DocLayoutYOLOConfig(device="cuda"))
result = extractor.extract(image)
for box in result.bboxes:
print(f"{box.label.value}: {box.confidence:.2f}")
# VLM-based detection with custom labels
from omnidocs.tasks.layout_extraction import QwenLayoutDetector, CustomLabel
from omnidocs.tasks.layout_extraction.qwen import QwenLayoutPyTorchConfig
detector = QwenLayoutDetector(
backend=QwenLayoutPyTorchConfig(model="Qwen/Qwen3-VL-8B-Instruct")
)
result = detector.extract(image, custom_labels=["code_block", "sidebar"])
BaseLayoutExtractor
¶
Bases: ABC
Abstract base class for layout extractors.
All layout extraction models must inherit from this class and implement the required methods.
Example
extract
abstractmethod
¶
Run layout extraction on an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
LayoutOutput
|
LayoutOutput containing detected layout boxes with standardized labels |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If image format is not supported |
RuntimeError
|
If model is not loaded or inference fails |
Source code in omnidocs/tasks/layout_extraction/base.py
batch_extract
¶
batch_extract(
images: List[Union[Image, ndarray, str, Path]],
progress_callback: Optional[
Callable[[int, int], None]
] = None,
) -> List[LayoutOutput]
Run layout extraction on multiple images.
Default implementation loops over extract(). Subclasses can override for optimized batching.
| PARAMETER | DESCRIPTION |
|---|---|
images
|
List of images in any supported format
TYPE:
|
progress_callback
|
Optional function(current, total) for progress
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[LayoutOutput]
|
List of LayoutOutput in same order as input |
Examples:
Source code in omnidocs/tasks/layout_extraction/base.py
extract_document
¶
extract_document(
document: Document,
progress_callback: Optional[
Callable[[int, int], None]
] = None,
) -> List[LayoutOutput]
Run layout extraction on all pages of a document.
| PARAMETER | DESCRIPTION |
|---|---|
document
|
Document instance
TYPE:
|
progress_callback
|
Optional function(current, total) for progress
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[LayoutOutput]
|
List of LayoutOutput, one per page |
Examples:
Source code in omnidocs/tasks/layout_extraction/base.py
DocLayoutYOLO
¶
Bases: BaseLayoutExtractor
DocLayout-YOLO layout extractor.
A YOLO-based model optimized for document layout detection. Detects: title, text, figure, table, formula, captions, etc.
This is a single-backend model (PyTorch only).
Example
Initialize DocLayout-YOLO extractor.
| PARAMETER | DESCRIPTION |
|---|---|
config
|
Configuration object with device, model_path, etc.
TYPE:
|
Source code in omnidocs/tasks/layout_extraction/doc_layout_yolo.py
extract
¶
Run layout extraction on an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or path)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
LayoutOutput
|
LayoutOutput with detected layout boxes |
Source code in omnidocs/tasks/layout_extraction/doc_layout_yolo.py
DocLayoutYOLOConfig
¶
MinerUVLLayoutDetector
¶
Bases: BaseLayoutExtractor
MinerU VL layout detector.
Uses MinerU2.5-2509-1.2B for document layout detection. Detects 22+ element types including text, titles, tables, equations, figures, code, and more.
For full document extraction (layout + content), use MinerUVLTextExtractor from the text_extraction module instead.
Example
from omnidocs.tasks.layout_extraction import MinerUVLLayoutDetector
from omnidocs.tasks.layout_extraction.mineruvl import MinerUVLLayoutPyTorchConfig
detector = MinerUVLLayoutDetector(
backend=MinerUVLLayoutPyTorchConfig(device="cuda")
)
result = detector.extract(image)
for box in result.bboxes:
print(f"{box.label}: {box.confidence:.2f}")
Initialize MinerU VL layout detector.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration (PyTorch, VLLM, MLX, or API)
TYPE:
|
Source code in omnidocs/tasks/layout_extraction/mineruvl/detector.py
extract
¶
Detect layout elements in the image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
LayoutOutput
|
LayoutOutput with standardized labels and bounding boxes |
Source code in omnidocs/tasks/layout_extraction/mineruvl/detector.py
BoundingBox
¶
Bases: BaseModel
Bounding box coordinates in pixel space.
Coordinates follow the convention: (x1, y1) is top-left, (x2, y2) is bottom-right.
to_list
¶
to_xyxy
¶
to_xywh
¶
from_list
classmethod
¶
Create from [x1, y1, x2, y2] list.
Source code in omnidocs/tasks/layout_extraction/models.py
to_normalized
¶
Convert to normalized coordinates (0-1024 range).
Scales coordinates from absolute pixel values to a virtual 1024x1024 canvas. This provides consistent coordinates regardless of original image size.
| PARAMETER | DESCRIPTION |
|---|---|
image_width
|
Original image width in pixels
TYPE:
|
image_height
|
Original image height in pixels
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
BoundingBox
|
New BoundingBox with coordinates in 0-1024 range |
Example
Source code in omnidocs/tasks/layout_extraction/models.py
to_absolute
¶
Convert from normalized (0-1024) to absolute pixel coordinates.
| PARAMETER | DESCRIPTION |
|---|---|
image_width
|
Target image width in pixels
TYPE:
|
image_height
|
Target image height in pixels
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
BoundingBox
|
New BoundingBox with absolute pixel coordinates |
Source code in omnidocs/tasks/layout_extraction/models.py
CustomLabel
¶
Bases: BaseModel
Type-safe custom layout label definition for VLM-based models.
VLM models like Qwen3-VL support flexible custom labels beyond the standard LayoutLabel enum. Use this class to define custom labels with validation.
Example
from omnidocs.tasks.layout_extraction import CustomLabel
# Simple custom label
code_block = CustomLabel(name="code_block")
# With metadata
sidebar = CustomLabel(
name="sidebar",
description="Secondary content panel",
color="#9B59B6",
)
# Use with QwenLayoutDetector
result = detector.extract(image, custom_labels=[code_block, sidebar])
LabelMapping
¶
Base class for model-specific label mappings.
Each model maps its native labels to standardized LayoutLabel values.
Initialize label mapping.
| PARAMETER | DESCRIPTION |
|---|---|
mapping
|
Dict mapping model-specific labels to LayoutLabel enum values
TYPE:
|
Source code in omnidocs/tasks/layout_extraction/models.py
LayoutBox
¶
Bases: BaseModel
Single detected layout element with label, bounding box, and confidence.
to_dict
¶
Convert to dictionary representation.
Source code in omnidocs/tasks/layout_extraction/models.py
get_normalized_bbox
¶
Get bounding box in normalized (0-1024) coordinates.
| PARAMETER | DESCRIPTION |
|---|---|
image_width
|
Original image width
TYPE:
|
image_height
|
Original image height
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
BoundingBox
|
BoundingBox with normalized coordinates |
Source code in omnidocs/tasks/layout_extraction/models.py
LayoutLabel
¶
Bases: str, Enum
Standardized layout labels used across all layout extractors.
These provide a consistent vocabulary regardless of which model is used.
LayoutOutput
¶
Bases: BaseModel
Complete layout extraction results for a single image.
filter_by_label
¶
filter_by_confidence
¶
to_dict
¶
Convert to dictionary representation.
Source code in omnidocs/tasks/layout_extraction/models.py
sort_by_position
¶
Return a new LayoutOutput with boxes sorted by position.
| PARAMETER | DESCRIPTION |
|---|---|
top_to_bottom
|
If True, sort by y-coordinate (reading order)
TYPE:
|
Source code in omnidocs/tasks/layout_extraction/models.py
get_normalized_bboxes
¶
Get all bounding boxes in normalized (0-1024) coordinates.
| RETURNS | DESCRIPTION |
|---|---|
List[Dict]
|
List of dicts with normalized bbox coordinates and metadata. |
Example
Source code in omnidocs/tasks/layout_extraction/models.py
visualize
¶
visualize(
image: Image,
output_path: Optional[Union[str, Path]] = None,
show_labels: bool = True,
show_confidence: bool = True,
line_width: int = 3,
font_size: int = 12,
) -> Image.Image
Visualize layout detection results on the image.
Draws bounding boxes with labels and confidence scores on the image. Each layout category has a distinct color for easy identification.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
PIL Image to draw on (will be copied, not modified)
TYPE:
|
output_path
|
Optional path to save the visualization
TYPE:
|
show_labels
|
Whether to show label text
TYPE:
|
show_confidence
|
Whether to show confidence scores
TYPE:
|
line_width
|
Width of bounding box lines
TYPE:
|
font_size
|
Size of label text (note: uses default font)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Image
|
PIL Image with visualizations drawn |
Example
Source code in omnidocs/tasks/layout_extraction/models.py
load_json
classmethod
¶
Load a LayoutOutput instance from a JSON file.
Reads a JSON file and deserializes its contents into a LayoutOutput object. Uses Pydantic's model_validate_json for proper handling of nested objects.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to JSON file containing serialized LayoutOutput data. Can be string or pathlib.Path object.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
LayoutOutput
|
Deserialized layout output instance from file.
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
FileNotFoundError
|
If the specified file does not exist. |
UnicodeDecodeError
|
If file cannot be decoded as UTF-8. |
ValueError
|
If file contents are not valid JSON. |
ValidationError
|
If JSON data doesn't match LayoutOutput schema. |
Example
Found 5 elementsSource code in omnidocs/tasks/layout_extraction/models.py
save_json
¶
Save LayoutOutput instance to a JSON file.
Serializes the LayoutOutput object to JSON and writes it to a file. Automatically creates parent directories if they don't exist. Uses UTF-8 encoding for compatibility and proper handling of special characters.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path where JSON file should be saved. Can be string or pathlib.Path object. Parent directories will be created if they don't exist.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
None
|
None |
| RAISES | DESCRIPTION |
|---|---|
OSError
|
If file cannot be written due to permission or disk errors. |
TypeError
|
If file_path is not a string or Path object. |
Example
Source code in omnidocs/tasks/layout_extraction/models.py
QwenLayoutDetector
¶
Bases: BaseLayoutExtractor
Qwen3-VL Vision-Language Model layout detector.
A flexible VLM-based layout detector that supports custom labels. Unlike fixed-label models (DocLayoutYOLO, RT-DETR), Qwen can detect any document elements specified at runtime.
Supports PyTorch, VLLM, MLX, and API backends.
Example
from omnidocs.tasks.layout_extraction import QwenLayoutDetector, CustomLabel
from omnidocs.tasks.layout_extraction.qwen import QwenLayoutPyTorchConfig
# Initialize with PyTorch backend
detector = QwenLayoutDetector(
backend=QwenLayoutPyTorchConfig(model="Qwen/Qwen3-VL-8B-Instruct")
)
# Basic extraction with default labels
result = detector.extract(image)
# With custom labels (strings)
result = detector.extract(image, custom_labels=["code_block", "sidebar"])
# With typed custom labels
labels = [
CustomLabel(name="code_block", color="#E74C3C"),
CustomLabel(name="sidebar", description="Side panel content"),
]
result = detector.extract(image, custom_labels=labels)
Initialize Qwen layout detector.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration. One of: - QwenLayoutPyTorchConfig: PyTorch/HuggingFace backend - QwenLayoutVLLMConfig: VLLM high-throughput backend - QwenLayoutMLXConfig: MLX backend for Apple Silicon - QwenLayoutAPIConfig: API backend (OpenRouter, etc.)
TYPE:
|
Source code in omnidocs/tasks/layout_extraction/qwen/detector.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
custom_labels: Optional[
List[Union[str, CustomLabel]]
] = None,
) -> LayoutOutput
Run layout detection on an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file
TYPE:
|
custom_labels
|
Optional custom labels to detect. Can be: - None: Use default labels (title, text, table, figure, etc.) - List[str]: Simple label names ["code_block", "sidebar"] - List[CustomLabel]: Typed labels with metadata
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
LayoutOutput
|
LayoutOutput with detected layout boxes |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If model is not loaded |
ValueError
|
If image format is not supported |
Source code in omnidocs/tasks/layout_extraction/qwen/detector.py
RTDETRConfig
¶
RTDETRLayoutExtractor
¶
Bases: BaseLayoutExtractor
RT-DETR layout extractor using HuggingFace Transformers.
A transformer-based real-time detection model for document layout. Detects: title, text, table, figure, list, formula, captions, headers, footers.
This is a single-backend model (PyTorch/Transformers only).
Example
Initialize RT-DETR layout extractor.
| PARAMETER | DESCRIPTION |
|---|---|
config
|
Configuration object with device, model settings, etc.
TYPE:
|
Source code in omnidocs/tasks/layout_extraction/rtdetr.py
extract
¶
Run layout extraction on an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or path)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
LayoutOutput
|
LayoutOutput with detected layout boxes |
Source code in omnidocs/tasks/layout_extraction/rtdetr.py
180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 | |
VLMLayoutDetector
¶
Bases: BaseLayoutExtractor
Provider-agnostic VLM layout detector using litellm.
Works with any cloud VLM API: Gemini, OpenRouter, Azure, OpenAI, Anthropic, etc. Supports custom labels for flexible detection.
Example
from omnidocs.vlm import VLMAPIConfig
from omnidocs.tasks.layout_extraction import VLMLayoutDetector
config = VLMAPIConfig(model="gemini/gemini-2.5-flash")
detector = VLMLayoutDetector(config=config)
# Default labels
result = detector.extract("document.png")
# Custom labels
result = detector.extract("document.png", custom_labels=["code_block", "sidebar"])
Initialize VLM layout detector.
| PARAMETER | DESCRIPTION |
|---|---|
config
|
VLM API configuration with model and provider details.
TYPE:
|
Source code in omnidocs/tasks/layout_extraction/vlm.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
custom_labels: Optional[
List[Union[str, CustomLabel]]
] = None,
prompt: Optional[str] = None,
) -> LayoutOutput
Run layout detection on an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path).
TYPE:
|
custom_labels
|
Optional custom labels to detect. Can be: - None: Use default labels (title, text, table, figure, etc.) - List[str]: Simple label names ["code_block", "sidebar"] - List[CustomLabel]: Typed labels with metadata
TYPE:
|
prompt
|
Custom prompt. If None, builds a default detection prompt.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
LayoutOutput
|
LayoutOutput with detected layout boxes. |
Source code in omnidocs/tasks/layout_extraction/vlm.py
base
¶
Base class for layout extractors.
Defines the abstract interface that all layout extractors must implement.
BaseLayoutExtractor
¶
Bases: ABC
Abstract base class for layout extractors.
All layout extraction models must inherit from this class and implement the required methods.
Example
extract
abstractmethod
¶
Run layout extraction on an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
LayoutOutput
|
LayoutOutput containing detected layout boxes with standardized labels |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If image format is not supported |
RuntimeError
|
If model is not loaded or inference fails |
Source code in omnidocs/tasks/layout_extraction/base.py
batch_extract
¶
batch_extract(
images: List[Union[Image, ndarray, str, Path]],
progress_callback: Optional[
Callable[[int, int], None]
] = None,
) -> List[LayoutOutput]
Run layout extraction on multiple images.
Default implementation loops over extract(). Subclasses can override for optimized batching.
| PARAMETER | DESCRIPTION |
|---|---|
images
|
List of images in any supported format
TYPE:
|
progress_callback
|
Optional function(current, total) for progress
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[LayoutOutput]
|
List of LayoutOutput in same order as input |
Examples:
Source code in omnidocs/tasks/layout_extraction/base.py
extract_document
¶
extract_document(
document: Document,
progress_callback: Optional[
Callable[[int, int], None]
] = None,
) -> List[LayoutOutput]
Run layout extraction on all pages of a document.
| PARAMETER | DESCRIPTION |
|---|---|
document
|
Document instance
TYPE:
|
progress_callback
|
Optional function(current, total) for progress
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[LayoutOutput]
|
List of LayoutOutput, one per page |
Examples:
Source code in omnidocs/tasks/layout_extraction/base.py
doc_layout_yolo
¶
DocLayout-YOLO layout extractor.
A YOLO-based model for document layout detection, optimized for academic papers and technical documents.
Model: juliozhao/DocLayout-YOLO-DocStructBench
DocLayoutYOLOConfig
¶
DocLayoutYOLO
¶
Bases: BaseLayoutExtractor
DocLayout-YOLO layout extractor.
A YOLO-based model optimized for document layout detection. Detects: title, text, figure, table, formula, captions, etc.
This is a single-backend model (PyTorch only).
Example
Initialize DocLayout-YOLO extractor.
| PARAMETER | DESCRIPTION |
|---|---|
config
|
Configuration object with device, model_path, etc.
TYPE:
|
Source code in omnidocs/tasks/layout_extraction/doc_layout_yolo.py
extract
¶
Run layout extraction on an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or path)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
LayoutOutput
|
LayoutOutput with detected layout boxes |
Source code in omnidocs/tasks/layout_extraction/doc_layout_yolo.py
mineruvl
¶
MinerU VL layout detection module.
MinerU VL can be used for standalone layout detection, returning detected regions with types and bounding boxes.
For full document extraction (layout + content), use MinerUVLTextExtractor from the text_extraction module instead.
Example
from omnidocs.tasks.layout_extraction import MinerUVLLayoutDetector
from omnidocs.tasks.layout_extraction.mineruvl import MinerUVLLayoutPyTorchConfig
detector = MinerUVLLayoutDetector(
backend=MinerUVLLayoutPyTorchConfig(device="cuda")
)
result = detector.extract(image)
for box in result.bboxes:
print(f"{box.label}: {box.confidence:.2f}")
MinerUVLLayoutAPIConfig
¶
Bases: BaseModel
API backend config for MinerU VL layout detection.
Example
MinerUVLLayoutDetector
¶
Bases: BaseLayoutExtractor
MinerU VL layout detector.
Uses MinerU2.5-2509-1.2B for document layout detection. Detects 22+ element types including text, titles, tables, equations, figures, code, and more.
For full document extraction (layout + content), use MinerUVLTextExtractor from the text_extraction module instead.
Example
from omnidocs.tasks.layout_extraction import MinerUVLLayoutDetector
from omnidocs.tasks.layout_extraction.mineruvl import MinerUVLLayoutPyTorchConfig
detector = MinerUVLLayoutDetector(
backend=MinerUVLLayoutPyTorchConfig(device="cuda")
)
result = detector.extract(image)
for box in result.bboxes:
print(f"{box.label}: {box.confidence:.2f}")
Initialize MinerU VL layout detector.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration (PyTorch, VLLM, MLX, or API)
TYPE:
|
Source code in omnidocs/tasks/layout_extraction/mineruvl/detector.py
extract
¶
Detect layout elements in the image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
LayoutOutput
|
LayoutOutput with standardized labels and bounding boxes |
Source code in omnidocs/tasks/layout_extraction/mineruvl/detector.py
MinerUVLLayoutMLXConfig
¶
Bases: BaseModel
MLX backend config for MinerU VL layout detection on Apple Silicon.
Example
MinerUVLLayoutPyTorchConfig
¶
Bases: BaseModel
PyTorch/HuggingFace backend config for MinerU VL layout detection.
Example
MinerUVLLayoutVLLMConfig
¶
Bases: BaseModel
VLLM backend config for MinerU VL layout detection.
Example
api
¶
API backend configuration for MinerU VL layout detection.
MinerUVLLayoutAPIConfig
¶
Bases: BaseModel
API backend config for MinerU VL layout detection.
Example
detector
¶
MinerU VL layout detector.
Uses MinerU2.5-2509-1.2B for document layout detection. Detects 22+ element types including text, titles, tables, equations, figures, code.
MinerUVLLayoutDetector
¶
Bases: BaseLayoutExtractor
MinerU VL layout detector.
Uses MinerU2.5-2509-1.2B for document layout detection. Detects 22+ element types including text, titles, tables, equations, figures, code, and more.
For full document extraction (layout + content), use MinerUVLTextExtractor from the text_extraction module instead.
Example
from omnidocs.tasks.layout_extraction import MinerUVLLayoutDetector
from omnidocs.tasks.layout_extraction.mineruvl import MinerUVLLayoutPyTorchConfig
detector = MinerUVLLayoutDetector(
backend=MinerUVLLayoutPyTorchConfig(device="cuda")
)
result = detector.extract(image)
for box in result.bboxes:
print(f"{box.label}: {box.confidence:.2f}")
Initialize MinerU VL layout detector.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration (PyTorch, VLLM, MLX, or API)
TYPE:
|
Source code in omnidocs/tasks/layout_extraction/mineruvl/detector.py
extract
¶
Detect layout elements in the image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
LayoutOutput
|
LayoutOutput with standardized labels and bounding boxes |
Source code in omnidocs/tasks/layout_extraction/mineruvl/detector.py
mlx
¶
MLX backend configuration for MinerU VL layout detection (Apple Silicon).
MinerUVLLayoutMLXConfig
¶
Bases: BaseModel
MLX backend config for MinerU VL layout detection on Apple Silicon.
Example
pytorch
¶
PyTorch backend configuration for MinerU VL layout detection.
MinerUVLLayoutPyTorchConfig
¶
Bases: BaseModel
PyTorch/HuggingFace backend config for MinerU VL layout detection.
Example
models
¶
Pydantic models for layout extraction outputs.
Defines standardized output types and label enums for layout detection.
Coordinate Systems
- Absolute (default): Coordinates in pixels relative to original image size
- Normalized (0-1024): Coordinates scaled to 0-1024 range (virtual 1024x1024 canvas)
Use bbox.to_normalized(width, height) or output.get_normalized_bboxes()
to convert to normalized coordinates.
Example
LayoutLabel
¶
Bases: str, Enum
Standardized layout labels used across all layout extractors.
These provide a consistent vocabulary regardless of which model is used.
CustomLabel
¶
Bases: BaseModel
Type-safe custom layout label definition for VLM-based models.
VLM models like Qwen3-VL support flexible custom labels beyond the standard LayoutLabel enum. Use this class to define custom labels with validation.
Example
from omnidocs.tasks.layout_extraction import CustomLabel
# Simple custom label
code_block = CustomLabel(name="code_block")
# With metadata
sidebar = CustomLabel(
name="sidebar",
description="Secondary content panel",
color="#9B59B6",
)
# Use with QwenLayoutDetector
result = detector.extract(image, custom_labels=[code_block, sidebar])
LabelMapping
¶
Base class for model-specific label mappings.
Each model maps its native labels to standardized LayoutLabel values.
Initialize label mapping.
| PARAMETER | DESCRIPTION |
|---|---|
mapping
|
Dict mapping model-specific labels to LayoutLabel enum values
TYPE:
|
Source code in omnidocs/tasks/layout_extraction/models.py
BoundingBox
¶
Bases: BaseModel
Bounding box coordinates in pixel space.
Coordinates follow the convention: (x1, y1) is top-left, (x2, y2) is bottom-right.
to_list
¶
to_xyxy
¶
to_xywh
¶
from_list
classmethod
¶
Create from [x1, y1, x2, y2] list.
Source code in omnidocs/tasks/layout_extraction/models.py
to_normalized
¶
Convert to normalized coordinates (0-1024 range).
Scales coordinates from absolute pixel values to a virtual 1024x1024 canvas. This provides consistent coordinates regardless of original image size.
| PARAMETER | DESCRIPTION |
|---|---|
image_width
|
Original image width in pixels
TYPE:
|
image_height
|
Original image height in pixels
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
BoundingBox
|
New BoundingBox with coordinates in 0-1024 range |
Example
Source code in omnidocs/tasks/layout_extraction/models.py
to_absolute
¶
Convert from normalized (0-1024) to absolute pixel coordinates.
| PARAMETER | DESCRIPTION |
|---|---|
image_width
|
Target image width in pixels
TYPE:
|
image_height
|
Target image height in pixels
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
BoundingBox
|
New BoundingBox with absolute pixel coordinates |
Source code in omnidocs/tasks/layout_extraction/models.py
LayoutBox
¶
Bases: BaseModel
Single detected layout element with label, bounding box, and confidence.
to_dict
¶
Convert to dictionary representation.
Source code in omnidocs/tasks/layout_extraction/models.py
get_normalized_bbox
¶
Get bounding box in normalized (0-1024) coordinates.
| PARAMETER | DESCRIPTION |
|---|---|
image_width
|
Original image width
TYPE:
|
image_height
|
Original image height
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
BoundingBox
|
BoundingBox with normalized coordinates |
Source code in omnidocs/tasks/layout_extraction/models.py
LayoutOutput
¶
Bases: BaseModel
Complete layout extraction results for a single image.
filter_by_label
¶
filter_by_confidence
¶
to_dict
¶
Convert to dictionary representation.
Source code in omnidocs/tasks/layout_extraction/models.py
sort_by_position
¶
Return a new LayoutOutput with boxes sorted by position.
| PARAMETER | DESCRIPTION |
|---|---|
top_to_bottom
|
If True, sort by y-coordinate (reading order)
TYPE:
|
Source code in omnidocs/tasks/layout_extraction/models.py
get_normalized_bboxes
¶
Get all bounding boxes in normalized (0-1024) coordinates.
| RETURNS | DESCRIPTION |
|---|---|
List[Dict]
|
List of dicts with normalized bbox coordinates and metadata. |
Example
Source code in omnidocs/tasks/layout_extraction/models.py
visualize
¶
visualize(
image: Image,
output_path: Optional[Union[str, Path]] = None,
show_labels: bool = True,
show_confidence: bool = True,
line_width: int = 3,
font_size: int = 12,
) -> Image.Image
Visualize layout detection results on the image.
Draws bounding boxes with labels and confidence scores on the image. Each layout category has a distinct color for easy identification.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
PIL Image to draw on (will be copied, not modified)
TYPE:
|
output_path
|
Optional path to save the visualization
TYPE:
|
show_labels
|
Whether to show label text
TYPE:
|
show_confidence
|
Whether to show confidence scores
TYPE:
|
line_width
|
Width of bounding box lines
TYPE:
|
font_size
|
Size of label text (note: uses default font)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Image
|
PIL Image with visualizations drawn |
Example
Source code in omnidocs/tasks/layout_extraction/models.py
load_json
classmethod
¶
Load a LayoutOutput instance from a JSON file.
Reads a JSON file and deserializes its contents into a LayoutOutput object. Uses Pydantic's model_validate_json for proper handling of nested objects.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to JSON file containing serialized LayoutOutput data. Can be string or pathlib.Path object.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
LayoutOutput
|
Deserialized layout output instance from file.
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
FileNotFoundError
|
If the specified file does not exist. |
UnicodeDecodeError
|
If file cannot be decoded as UTF-8. |
ValueError
|
If file contents are not valid JSON. |
ValidationError
|
If JSON data doesn't match LayoutOutput schema. |
Example
Found 5 elementsSource code in omnidocs/tasks/layout_extraction/models.py
save_json
¶
Save LayoutOutput instance to a JSON file.
Serializes the LayoutOutput object to JSON and writes it to a file. Automatically creates parent directories if they don't exist. Uses UTF-8 encoding for compatibility and proper handling of special characters.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path where JSON file should be saved. Can be string or pathlib.Path object. Parent directories will be created if they don't exist.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
None
|
None |
| RAISES | DESCRIPTION |
|---|---|
OSError
|
If file cannot be written due to permission or disk errors. |
TypeError
|
If file_path is not a string or Path object. |
Example
Source code in omnidocs/tasks/layout_extraction/models.py
qwen
¶
Qwen3-VL backend configurations and detector for layout detection.
Available backends
- QwenLayoutPyTorchConfig: PyTorch/HuggingFace backend
- QwenLayoutVLLMConfig: VLLM high-throughput backend
- QwenLayoutMLXConfig: MLX backend for Apple Silicon
- QwenLayoutAPIConfig: API backend (OpenRouter, etc.)
Example
QwenLayoutAPIConfig
¶
Bases: BaseModel
API backend configuration for Qwen layout detection.
Uses litellm for provider-agnostic API access. Supports OpenRouter, Gemini, Azure, OpenAI, and any other litellm-compatible provider.
API keys can be passed directly or read from environment variables.
Example
# OpenRouter (reads OPENROUTER_API_KEY from env)
config = QwenLayoutAPIConfig(
model="openrouter/qwen/qwen3-vl-8b-instruct",
)
# With explicit key
config = QwenLayoutAPIConfig(
model="openrouter/qwen/qwen3-vl-8b-instruct",
api_key=os.environ["OPENROUTER_API_KEY"],
api_base="https://openrouter.ai/api/v1",
)
QwenLayoutDetector
¶
Bases: BaseLayoutExtractor
Qwen3-VL Vision-Language Model layout detector.
A flexible VLM-based layout detector that supports custom labels. Unlike fixed-label models (DocLayoutYOLO, RT-DETR), Qwen can detect any document elements specified at runtime.
Supports PyTorch, VLLM, MLX, and API backends.
Example
from omnidocs.tasks.layout_extraction import QwenLayoutDetector, CustomLabel
from omnidocs.tasks.layout_extraction.qwen import QwenLayoutPyTorchConfig
# Initialize with PyTorch backend
detector = QwenLayoutDetector(
backend=QwenLayoutPyTorchConfig(model="Qwen/Qwen3-VL-8B-Instruct")
)
# Basic extraction with default labels
result = detector.extract(image)
# With custom labels (strings)
result = detector.extract(image, custom_labels=["code_block", "sidebar"])
# With typed custom labels
labels = [
CustomLabel(name="code_block", color="#E74C3C"),
CustomLabel(name="sidebar", description="Side panel content"),
]
result = detector.extract(image, custom_labels=labels)
Initialize Qwen layout detector.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration. One of: - QwenLayoutPyTorchConfig: PyTorch/HuggingFace backend - QwenLayoutVLLMConfig: VLLM high-throughput backend - QwenLayoutMLXConfig: MLX backend for Apple Silicon - QwenLayoutAPIConfig: API backend (OpenRouter, etc.)
TYPE:
|
Source code in omnidocs/tasks/layout_extraction/qwen/detector.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
custom_labels: Optional[
List[Union[str, CustomLabel]]
] = None,
) -> LayoutOutput
Run layout detection on an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file
TYPE:
|
custom_labels
|
Optional custom labels to detect. Can be: - None: Use default labels (title, text, table, figure, etc.) - List[str]: Simple label names ["code_block", "sidebar"] - List[CustomLabel]: Typed labels with metadata
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
LayoutOutput
|
LayoutOutput with detected layout boxes |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If model is not loaded |
ValueError
|
If image format is not supported |
Source code in omnidocs/tasks/layout_extraction/qwen/detector.py
QwenLayoutMLXConfig
¶
Bases: BaseModel
MLX backend configuration for Qwen layout detection.
This backend uses MLX for Apple Silicon native inference. Best for local development and testing on macOS M1/M2/M3+. Requires: mlx, mlx-vlm
Note: This backend only works on Apple Silicon Macs. Do NOT use for Modal/cloud deployments.
QwenLayoutPyTorchConfig
¶
Bases: BaseModel
PyTorch/HuggingFace backend configuration for Qwen layout detection.
This backend uses the transformers library with PyTorch for local GPU inference. Requires: torch, transformers, accelerate, qwen-vl-utils
Example
QwenLayoutVLLMConfig
¶
Bases: BaseModel
VLLM backend configuration for Qwen layout detection.
This backend uses VLLM for high-throughput inference. Best for batch processing and production deployments. Requires: vllm, torch, transformers, qwen-vl-utils
Example
api
¶
API backend configuration for Qwen3-VL layout detection.
Uses litellm for provider-agnostic inference (OpenRouter, Gemini, Azure, etc.).
QwenLayoutAPIConfig
¶
Bases: BaseModel
API backend configuration for Qwen layout detection.
Uses litellm for provider-agnostic API access. Supports OpenRouter, Gemini, Azure, OpenAI, and any other litellm-compatible provider.
API keys can be passed directly or read from environment variables.
Example
# OpenRouter (reads OPENROUTER_API_KEY from env)
config = QwenLayoutAPIConfig(
model="openrouter/qwen/qwen3-vl-8b-instruct",
)
# With explicit key
config = QwenLayoutAPIConfig(
model="openrouter/qwen/qwen3-vl-8b-instruct",
api_key=os.environ["OPENROUTER_API_KEY"],
api_base="https://openrouter.ai/api/v1",
)
detector
¶
Qwen3-VL layout detector.
A Vision-Language Model for flexible layout detection with custom label support. Supports PyTorch, VLLM, MLX, and API backends.
Example
from omnidocs.tasks.layout_extraction import QwenLayoutDetector
from omnidocs.tasks.layout_extraction.qwen import QwenLayoutPyTorchConfig
detector = QwenLayoutDetector(
backend=QwenLayoutPyTorchConfig(model="Qwen/Qwen3-VL-8B-Instruct")
)
result = detector.extract(image)
# With custom labels
result = detector.extract(image, custom_labels=["code_block", "sidebar"])
QwenLayoutDetector
¶
Bases: BaseLayoutExtractor
Qwen3-VL Vision-Language Model layout detector.
A flexible VLM-based layout detector that supports custom labels. Unlike fixed-label models (DocLayoutYOLO, RT-DETR), Qwen can detect any document elements specified at runtime.
Supports PyTorch, VLLM, MLX, and API backends.
Example
from omnidocs.tasks.layout_extraction import QwenLayoutDetector, CustomLabel
from omnidocs.tasks.layout_extraction.qwen import QwenLayoutPyTorchConfig
# Initialize with PyTorch backend
detector = QwenLayoutDetector(
backend=QwenLayoutPyTorchConfig(model="Qwen/Qwen3-VL-8B-Instruct")
)
# Basic extraction with default labels
result = detector.extract(image)
# With custom labels (strings)
result = detector.extract(image, custom_labels=["code_block", "sidebar"])
# With typed custom labels
labels = [
CustomLabel(name="code_block", color="#E74C3C"),
CustomLabel(name="sidebar", description="Side panel content"),
]
result = detector.extract(image, custom_labels=labels)
Initialize Qwen layout detector.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration. One of: - QwenLayoutPyTorchConfig: PyTorch/HuggingFace backend - QwenLayoutVLLMConfig: VLLM high-throughput backend - QwenLayoutMLXConfig: MLX backend for Apple Silicon - QwenLayoutAPIConfig: API backend (OpenRouter, etc.)
TYPE:
|
Source code in omnidocs/tasks/layout_extraction/qwen/detector.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
custom_labels: Optional[
List[Union[str, CustomLabel]]
] = None,
) -> LayoutOutput
Run layout detection on an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file
TYPE:
|
custom_labels
|
Optional custom labels to detect. Can be: - None: Use default labels (title, text, table, figure, etc.) - List[str]: Simple label names ["code_block", "sidebar"] - List[CustomLabel]: Typed labels with metadata
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
LayoutOutput
|
LayoutOutput with detected layout boxes |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If model is not loaded |
ValueError
|
If image format is not supported |
Source code in omnidocs/tasks/layout_extraction/qwen/detector.py
mlx
¶
MLX backend configuration for Qwen3-VL layout detection.
QwenLayoutMLXConfig
¶
Bases: BaseModel
MLX backend configuration for Qwen layout detection.
This backend uses MLX for Apple Silicon native inference. Best for local development and testing on macOS M1/M2/M3+. Requires: mlx, mlx-vlm
Note: This backend only works on Apple Silicon Macs. Do NOT use for Modal/cloud deployments.
pytorch
¶
PyTorch/HuggingFace backend configuration for Qwen3-VL layout detection.
QwenLayoutPyTorchConfig
¶
Bases: BaseModel
PyTorch/HuggingFace backend configuration for Qwen layout detection.
This backend uses the transformers library with PyTorch for local GPU inference. Requires: torch, transformers, accelerate, qwen-vl-utils
Example
vllm
¶
VLLM backend configuration for Qwen3-VL layout detection.
QwenLayoutVLLMConfig
¶
Bases: BaseModel
VLLM backend configuration for Qwen layout detection.
This backend uses VLLM for high-throughput inference. Best for batch processing and production deployments. Requires: vllm, torch, transformers, qwen-vl-utils
Example
rtdetr
¶
RT-DETR layout extractor.
A transformer-based real-time detection model for document layout detection. Uses HuggingFace Transformers implementation.
Model: HuggingPanda/docling-layout
RTDETRConfig
¶
RTDETRLayoutExtractor
¶
Bases: BaseLayoutExtractor
RT-DETR layout extractor using HuggingFace Transformers.
A transformer-based real-time detection model for document layout. Detects: title, text, table, figure, list, formula, captions, headers, footers.
This is a single-backend model (PyTorch/Transformers only).
Example
Initialize RT-DETR layout extractor.
| PARAMETER | DESCRIPTION |
|---|---|
config
|
Configuration object with device, model settings, etc.
TYPE:
|
Source code in omnidocs/tasks/layout_extraction/rtdetr.py
extract
¶
Run layout extraction on an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or path)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
LayoutOutput
|
LayoutOutput with detected layout boxes |
Source code in omnidocs/tasks/layout_extraction/rtdetr.py
180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 | |
vlm
¶
VLM layout detector.
A provider-agnostic Vision-Language Model layout detector using litellm. Works with any cloud API: Gemini, OpenRouter, Azure, OpenAI, Anthropic, etc.
Example
from omnidocs.vlm import VLMAPIConfig
from omnidocs.tasks.layout_extraction import VLMLayoutDetector
config = VLMAPIConfig(model="gemini/gemini-2.5-flash")
detector = VLMLayoutDetector(config=config)
result = detector.extract("document.png")
for box in result.bboxes:
print(f"{box.label.value}: {box.bbox}")
VLMLayoutDetector
¶
Bases: BaseLayoutExtractor
Provider-agnostic VLM layout detector using litellm.
Works with any cloud VLM API: Gemini, OpenRouter, Azure, OpenAI, Anthropic, etc. Supports custom labels for flexible detection.
Example
from omnidocs.vlm import VLMAPIConfig
from omnidocs.tasks.layout_extraction import VLMLayoutDetector
config = VLMAPIConfig(model="gemini/gemini-2.5-flash")
detector = VLMLayoutDetector(config=config)
# Default labels
result = detector.extract("document.png")
# Custom labels
result = detector.extract("document.png", custom_labels=["code_block", "sidebar"])
Initialize VLM layout detector.
| PARAMETER | DESCRIPTION |
|---|---|
config
|
VLM API configuration with model and provider details.
TYPE:
|
Source code in omnidocs/tasks/layout_extraction/vlm.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
custom_labels: Optional[
List[Union[str, CustomLabel]]
] = None,
prompt: Optional[str] = None,
) -> LayoutOutput
Run layout detection on an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path).
TYPE:
|
custom_labels
|
Optional custom labels to detect. Can be: - None: Use default labels (title, text, table, figure, etc.) - List[str]: Simple label names ["code_block", "sidebar"] - List[CustomLabel]: Typed labels with metadata
TYPE:
|
prompt
|
Custom prompt. If None, builds a default detection prompt.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
LayoutOutput
|
LayoutOutput with detected layout boxes. |
Source code in omnidocs/tasks/layout_extraction/vlm.py
ocr_extraction
¶
OCR Extraction Module.
Provides extractors for detecting text with bounding boxes from document images. Returns text content along with spatial coordinates (unlike Text Extraction which returns formatted Markdown/HTML without coordinates).
Available Extractors
- TesseractOCR: Open-source OCR (CPU, requires system Tesseract)
- EasyOCR: PyTorch-based OCR (CPU/GPU, 80+ languages)
- PaddleOCR: PaddlePaddle-based OCR (CPU/GPU, excellent CJK support)
Key Difference from Text Extraction
- OCR Extraction: Text + Bounding Boxes (spatial location)
- Text Extraction: Markdown/HTML (formatted document export)
Example
from omnidocs.tasks.ocr_extraction import TesseractOCR, TesseractOCRConfig
ocr = TesseractOCR(config=TesseractOCRConfig(languages=["eng"]))
result = ocr.extract(image)
for block in result.text_blocks:
print(f"'{block.text}' @ {block.bbox.to_list()} (conf: {block.confidence:.2f})")
# With EasyOCR
from omnidocs.tasks.ocr_extraction import EasyOCR, EasyOCRConfig
ocr = EasyOCR(config=EasyOCRConfig(languages=["en", "ch_sim"], gpu=True))
result = ocr.extract(image)
# With PaddleOCR
from omnidocs.tasks.ocr_extraction import PaddleOCR, PaddleOCRConfig
ocr = PaddleOCR(config=PaddleOCRConfig(lang="ch", device="cpu"))
result = ocr.extract(image)
BaseOCRExtractor
¶
Bases: ABC
Abstract base class for OCR extractors.
All OCR extraction models must inherit from this class and implement the required methods.
Example
extract
abstractmethod
¶
Run OCR extraction on an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
OCROutput
|
OCROutput containing detected text blocks with bounding boxes |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If image format is not supported |
RuntimeError
|
If OCR engine is not initialized or extraction fails |
Source code in omnidocs/tasks/ocr_extraction/base.py
batch_extract
¶
batch_extract(
images: List[Union[Image, ndarray, str, Path]],
progress_callback: Optional[
Callable[[int, int], None]
] = None,
) -> List[OCROutput]
Run OCR extraction on multiple images.
Default implementation loops over extract(). Subclasses can override for optimized batching.
| PARAMETER | DESCRIPTION |
|---|---|
images
|
List of images in any supported format
TYPE:
|
progress_callback
|
Optional function(current, total) for progress
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[OCROutput]
|
List of OCROutput in same order as input |
Examples:
Source code in omnidocs/tasks/ocr_extraction/base.py
extract_document
¶
extract_document(
document: Document,
progress_callback: Optional[
Callable[[int, int], None]
] = None,
) -> List[OCROutput]
Run OCR extraction on all pages of a document.
| PARAMETER | DESCRIPTION |
|---|---|
document
|
Document instance
TYPE:
|
progress_callback
|
Optional function(current, total) for progress
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[OCROutput]
|
List of OCROutput, one per page |
Examples:
Source code in omnidocs/tasks/ocr_extraction/base.py
EasyOCR
¶
Bases: BaseOCRExtractor
EasyOCR text extractor.
Single-backend model (PyTorch - CPU/GPU).
Example
Initialize EasyOCR extractor.
| PARAMETER | DESCRIPTION |
|---|---|
config
|
Configuration object
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
ImportError
|
If easyocr is not installed |
Source code in omnidocs/tasks/ocr_extraction/easyocr.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
detail: int = 1,
paragraph: bool = False,
min_size: int = 10,
text_threshold: float = 0.7,
low_text: float = 0.4,
link_threshold: float = 0.4,
canvas_size: int = 2560,
mag_ratio: float = 1.0,
) -> OCROutput
Run OCR on an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or path)
TYPE:
|
detail
|
0 = simple output, 1 = detailed with boxes
TYPE:
|
paragraph
|
Combine results into paragraphs
TYPE:
|
min_size
|
Minimum text box size
TYPE:
|
text_threshold
|
Text confidence threshold
TYPE:
|
low_text
|
Low text bound
TYPE:
|
link_threshold
|
Link threshold for text joining
TYPE:
|
canvas_size
|
Max image dimension for processing
TYPE:
|
mag_ratio
|
Magnification ratio
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
OCROutput
|
OCROutput with detected text blocks |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If detail is not 0 or 1 |
RuntimeError
|
If EasyOCR is not initialized |
Source code in omnidocs/tasks/ocr_extraction/easyocr.py
149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 | |
extract_batch
¶
Run OCR on multiple images.
| PARAMETER | DESCRIPTION |
|---|---|
images
|
List of input images
TYPE:
|
**kwargs
|
Arguments passed to extract()
DEFAULT:
|
| RETURNS | DESCRIPTION |
|---|---|
List[OCROutput]
|
List of OCROutput objects |
Source code in omnidocs/tasks/ocr_extraction/easyocr.py
EasyOCRConfig
¶
BoundingBox
¶
Bases: BaseModel
Bounding box coordinates in pixel space.
Coordinates follow the convention: (x1, y1) is top-left, (x2, y2) is bottom-right. For rotated text, use the polygon field in TextBlock instead.
Example
to_list
¶
to_xyxy
¶
to_xywh
¶
from_list
classmethod
¶
Create from [x1, y1, x2, y2] list.
Source code in omnidocs/tasks/ocr_extraction/models.py
from_polygon
classmethod
¶
Create axis-aligned bounding box from polygon points.
| PARAMETER | DESCRIPTION |
|---|---|
polygon
|
List of [x, y] points (usually 4 for quadrilateral)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
BoundingBox
|
BoundingBox that encloses all polygon points |
Source code in omnidocs/tasks/ocr_extraction/models.py
to_normalized
¶
Convert to normalized coordinates (0-1024 range).
Scales coordinates from absolute pixel values to a virtual 1024x1024 canvas. This provides consistent coordinates regardless of original image size.
| PARAMETER | DESCRIPTION |
|---|---|
image_width
|
Original image width in pixels
TYPE:
|
image_height
|
Original image height in pixels
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
BoundingBox
|
New BoundingBox with coordinates in 0-1024 range |
Source code in omnidocs/tasks/ocr_extraction/models.py
to_absolute
¶
Convert from normalized (0-1024) to absolute pixel coordinates.
| PARAMETER | DESCRIPTION |
|---|---|
image_width
|
Target image width in pixels
TYPE:
|
image_height
|
Target image height in pixels
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
BoundingBox
|
New BoundingBox with absolute pixel coordinates |
Source code in omnidocs/tasks/ocr_extraction/models.py
OCRGranularity
¶
Bases: str, Enum
OCR detection granularity levels.
Different OCR engines return results at different granularity levels. This enum standardizes the options across all extractors.
OCROutput
¶
Bases: BaseModel
Complete OCR extraction results for a single image.
Contains all detected text blocks with their bounding boxes, plus metadata about the extraction.
Example
filter_by_confidence
¶
Filter text blocks by minimum confidence.
filter_by_granularity
¶
Filter text blocks by granularity level.
to_dict
¶
Convert to dictionary representation.
Source code in omnidocs/tasks/ocr_extraction/models.py
sort_by_position
¶
Return a new OCROutput with blocks sorted by position.
| PARAMETER | DESCRIPTION |
|---|---|
top_to_bottom
|
If True, sort by y-coordinate (reading order)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
OCROutput
|
New OCROutput with sorted text blocks |
Source code in omnidocs/tasks/ocr_extraction/models.py
get_normalized_blocks
¶
Get all text blocks with normalized (0-1024) coordinates.
| RETURNS | DESCRIPTION |
|---|---|
List[Dict]
|
List of dicts with normalized bbox coordinates and metadata. |
Source code in omnidocs/tasks/ocr_extraction/models.py
visualize
¶
visualize(
image: Image,
output_path: Optional[Union[str, Path]] = None,
show_text: bool = True,
show_confidence: bool = False,
line_width: int = 2,
box_color: str = "#2ECC71",
text_color: str = "#000000",
) -> Image.Image
Visualize OCR results on the image.
Draws bounding boxes around detected text with optional labels.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
PIL Image to draw on (will be copied, not modified)
TYPE:
|
output_path
|
Optional path to save the visualization
TYPE:
|
show_text
|
Whether to show detected text
TYPE:
|
show_confidence
|
Whether to show confidence scores
TYPE:
|
line_width
|
Width of bounding box lines
TYPE:
|
box_color
|
Color for bounding boxes (hex)
TYPE:
|
text_color
|
Color for text labels (hex)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Image
|
PIL Image with visualizations drawn |
Source code in omnidocs/tasks/ocr_extraction/models.py
363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 | |
load_json
classmethod
¶
Load an OCROutput instance from a JSON file.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to JSON file
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
OCROutput
|
OCROutput instance |
Source code in omnidocs/tasks/ocr_extraction/models.py
save_json
¶
Save OCROutput instance to a JSON file.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path where JSON file should be saved
TYPE:
|
Source code in omnidocs/tasks/ocr_extraction/models.py
TextBlock
¶
Bases: BaseModel
Single detected text element with text, bounding box, and confidence.
This is the fundamental unit of OCR output - can represent a character, word, line, or block depending on the OCR model and configuration.
Example
to_dict
¶
Convert to dictionary representation.
Source code in omnidocs/tasks/ocr_extraction/models.py
get_normalized_bbox
¶
Get bounding box in normalized (0-1024) coordinates.
| PARAMETER | DESCRIPTION |
|---|---|
image_width
|
Original image width
TYPE:
|
image_height
|
Original image height
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
BoundingBox
|
BoundingBox with normalized coordinates |
Source code in omnidocs/tasks/ocr_extraction/models.py
PaddleOCR
¶
Bases: BaseOCRExtractor
PaddleOCR text extractor.
Single-backend model (PaddlePaddle - CPU/GPU).
Example
Initialize PaddleOCR extractor.
| PARAMETER | DESCRIPTION |
|---|---|
config
|
Configuration object
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
ImportError
|
If paddleocr or paddlepaddle is not installed |
Source code in omnidocs/tasks/ocr_extraction/paddleocr.py
extract
¶
Run OCR on an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or path)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
OCROutput
|
OCROutput with detected text blocks |
Source code in omnidocs/tasks/ocr_extraction/paddleocr.py
158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 | |
PaddleOCRConfig
¶
TesseractOCR
¶
Bases: BaseOCRExtractor
Tesseract OCR extractor.
Single-backend model (CPU only). Requires system Tesseract installation.
Example
Initialize Tesseract OCR extractor.
| PARAMETER | DESCRIPTION |
|---|---|
config
|
Configuration object
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If Tesseract is not installed |
ImportError
|
If pytesseract is not installed |
Source code in omnidocs/tasks/ocr_extraction/tesseract.py
extract
¶
Run OCR on an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or path)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
OCROutput
|
OCROutput with detected text blocks at word level |
Source code in omnidocs/tasks/ocr_extraction/tesseract.py
160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 | |
extract_lines
¶
Run OCR and return line-level blocks.
Groups words into lines based on Tesseract's line detection.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or path)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
OCROutput
|
OCROutput with line-level text blocks |
Source code in omnidocs/tasks/ocr_extraction/tesseract.py
251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 | |
TesseractOCRConfig
¶
base
¶
Base class for OCR extractors.
Defines the abstract interface that all OCR extractors must implement.
BaseOCRExtractor
¶
Bases: ABC
Abstract base class for OCR extractors.
All OCR extraction models must inherit from this class and implement the required methods.
Example
extract
abstractmethod
¶
Run OCR extraction on an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
OCROutput
|
OCROutput containing detected text blocks with bounding boxes |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If image format is not supported |
RuntimeError
|
If OCR engine is not initialized or extraction fails |
Source code in omnidocs/tasks/ocr_extraction/base.py
batch_extract
¶
batch_extract(
images: List[Union[Image, ndarray, str, Path]],
progress_callback: Optional[
Callable[[int, int], None]
] = None,
) -> List[OCROutput]
Run OCR extraction on multiple images.
Default implementation loops over extract(). Subclasses can override for optimized batching.
| PARAMETER | DESCRIPTION |
|---|---|
images
|
List of images in any supported format
TYPE:
|
progress_callback
|
Optional function(current, total) for progress
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[OCROutput]
|
List of OCROutput in same order as input |
Examples:
Source code in omnidocs/tasks/ocr_extraction/base.py
extract_document
¶
extract_document(
document: Document,
progress_callback: Optional[
Callable[[int, int], None]
] = None,
) -> List[OCROutput]
Run OCR extraction on all pages of a document.
| PARAMETER | DESCRIPTION |
|---|---|
document
|
Document instance
TYPE:
|
progress_callback
|
Optional function(current, total) for progress
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[OCROutput]
|
List of OCROutput, one per page |
Examples:
Source code in omnidocs/tasks/ocr_extraction/base.py
easyocr
¶
EasyOCR extractor.
EasyOCR is a PyTorch-based OCR engine with excellent multi-language support. - GPU accelerated (optional) - Supports 80+ languages - Good for scene text and printed documents
Python Package
pip install easyocr
Model Download Location
By default, EasyOCR downloads models to ~/.EasyOCR/ Can be overridden with model_storage_directory parameter
EasyOCRConfig
¶
EasyOCR
¶
Bases: BaseOCRExtractor
EasyOCR text extractor.
Single-backend model (PyTorch - CPU/GPU).
Example
Initialize EasyOCR extractor.
| PARAMETER | DESCRIPTION |
|---|---|
config
|
Configuration object
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
ImportError
|
If easyocr is not installed |
Source code in omnidocs/tasks/ocr_extraction/easyocr.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
detail: int = 1,
paragraph: bool = False,
min_size: int = 10,
text_threshold: float = 0.7,
low_text: float = 0.4,
link_threshold: float = 0.4,
canvas_size: int = 2560,
mag_ratio: float = 1.0,
) -> OCROutput
Run OCR on an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or path)
TYPE:
|
detail
|
0 = simple output, 1 = detailed with boxes
TYPE:
|
paragraph
|
Combine results into paragraphs
TYPE:
|
min_size
|
Minimum text box size
TYPE:
|
text_threshold
|
Text confidence threshold
TYPE:
|
low_text
|
Low text bound
TYPE:
|
link_threshold
|
Link threshold for text joining
TYPE:
|
canvas_size
|
Max image dimension for processing
TYPE:
|
mag_ratio
|
Magnification ratio
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
OCROutput
|
OCROutput with detected text blocks |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If detail is not 0 or 1 |
RuntimeError
|
If EasyOCR is not initialized |
Source code in omnidocs/tasks/ocr_extraction/easyocr.py
149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 | |
extract_batch
¶
Run OCR on multiple images.
| PARAMETER | DESCRIPTION |
|---|---|
images
|
List of input images
TYPE:
|
**kwargs
|
Arguments passed to extract()
DEFAULT:
|
| RETURNS | DESCRIPTION |
|---|---|
List[OCROutput]
|
List of OCROutput objects |
Source code in omnidocs/tasks/ocr_extraction/easyocr.py
models
¶
Pydantic models for OCR extraction outputs.
Defines standardized output types for OCR detection including text blocks with bounding boxes, confidence scores, and granularity levels.
Key difference from Text Extraction: - OCR returns text WITH bounding boxes (word/line/character level) - Text Extraction returns formatted text (MD/HTML) WITHOUT bboxes
Coordinate Systems
- Absolute (default): Coordinates in pixels relative to original image size
- Normalized (0-1024): Coordinates scaled to 0-1024 range (virtual 1024x1024 canvas)
Use bbox.to_normalized(width, height) or output.get_normalized_blocks()
to convert to normalized coordinates.
Example
OCRGranularity
¶
Bases: str, Enum
OCR detection granularity levels.
Different OCR engines return results at different granularity levels. This enum standardizes the options across all extractors.
BoundingBox
¶
Bases: BaseModel
Bounding box coordinates in pixel space.
Coordinates follow the convention: (x1, y1) is top-left, (x2, y2) is bottom-right. For rotated text, use the polygon field in TextBlock instead.
Example
to_list
¶
to_xyxy
¶
to_xywh
¶
from_list
classmethod
¶
Create from [x1, y1, x2, y2] list.
Source code in omnidocs/tasks/ocr_extraction/models.py
from_polygon
classmethod
¶
Create axis-aligned bounding box from polygon points.
| PARAMETER | DESCRIPTION |
|---|---|
polygon
|
List of [x, y] points (usually 4 for quadrilateral)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
BoundingBox
|
BoundingBox that encloses all polygon points |
Source code in omnidocs/tasks/ocr_extraction/models.py
to_normalized
¶
Convert to normalized coordinates (0-1024 range).
Scales coordinates from absolute pixel values to a virtual 1024x1024 canvas. This provides consistent coordinates regardless of original image size.
| PARAMETER | DESCRIPTION |
|---|---|
image_width
|
Original image width in pixels
TYPE:
|
image_height
|
Original image height in pixels
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
BoundingBox
|
New BoundingBox with coordinates in 0-1024 range |
Source code in omnidocs/tasks/ocr_extraction/models.py
to_absolute
¶
Convert from normalized (0-1024) to absolute pixel coordinates.
| PARAMETER | DESCRIPTION |
|---|---|
image_width
|
Target image width in pixels
TYPE:
|
image_height
|
Target image height in pixels
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
BoundingBox
|
New BoundingBox with absolute pixel coordinates |
Source code in omnidocs/tasks/ocr_extraction/models.py
TextBlock
¶
Bases: BaseModel
Single detected text element with text, bounding box, and confidence.
This is the fundamental unit of OCR output - can represent a character, word, line, or block depending on the OCR model and configuration.
Example
to_dict
¶
Convert to dictionary representation.
Source code in omnidocs/tasks/ocr_extraction/models.py
get_normalized_bbox
¶
Get bounding box in normalized (0-1024) coordinates.
| PARAMETER | DESCRIPTION |
|---|---|
image_width
|
Original image width
TYPE:
|
image_height
|
Original image height
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
BoundingBox
|
BoundingBox with normalized coordinates |
Source code in omnidocs/tasks/ocr_extraction/models.py
OCROutput
¶
Bases: BaseModel
Complete OCR extraction results for a single image.
Contains all detected text blocks with their bounding boxes, plus metadata about the extraction.
Example
filter_by_confidence
¶
Filter text blocks by minimum confidence.
filter_by_granularity
¶
Filter text blocks by granularity level.
to_dict
¶
Convert to dictionary representation.
Source code in omnidocs/tasks/ocr_extraction/models.py
sort_by_position
¶
Return a new OCROutput with blocks sorted by position.
| PARAMETER | DESCRIPTION |
|---|---|
top_to_bottom
|
If True, sort by y-coordinate (reading order)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
OCROutput
|
New OCROutput with sorted text blocks |
Source code in omnidocs/tasks/ocr_extraction/models.py
get_normalized_blocks
¶
Get all text blocks with normalized (0-1024) coordinates.
| RETURNS | DESCRIPTION |
|---|---|
List[Dict]
|
List of dicts with normalized bbox coordinates and metadata. |
Source code in omnidocs/tasks/ocr_extraction/models.py
visualize
¶
visualize(
image: Image,
output_path: Optional[Union[str, Path]] = None,
show_text: bool = True,
show_confidence: bool = False,
line_width: int = 2,
box_color: str = "#2ECC71",
text_color: str = "#000000",
) -> Image.Image
Visualize OCR results on the image.
Draws bounding boxes around detected text with optional labels.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
PIL Image to draw on (will be copied, not modified)
TYPE:
|
output_path
|
Optional path to save the visualization
TYPE:
|
show_text
|
Whether to show detected text
TYPE:
|
show_confidence
|
Whether to show confidence scores
TYPE:
|
line_width
|
Width of bounding box lines
TYPE:
|
box_color
|
Color for bounding boxes (hex)
TYPE:
|
text_color
|
Color for text labels (hex)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Image
|
PIL Image with visualizations drawn |
Source code in omnidocs/tasks/ocr_extraction/models.py
363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 | |
load_json
classmethod
¶
Load an OCROutput instance from a JSON file.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to JSON file
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
OCROutput
|
OCROutput instance |
Source code in omnidocs/tasks/ocr_extraction/models.py
save_json
¶
Save OCROutput instance to a JSON file.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path where JSON file should be saved
TYPE:
|
Source code in omnidocs/tasks/ocr_extraction/models.py
paddleocr
¶
PaddleOCR extractor.
PaddleOCR is an OCR toolkit developed by Baidu/PaddlePaddle. - Excellent for CJK languages (Chinese, Japanese, Korean) - GPU accelerated - Supports layout analysis + OCR
Python Package
pip install paddleocr paddlepaddle # CPU version pip install paddleocr paddlepaddle-gpu # GPU version
Model Download Location
By default, PaddleOCR downloads models to ~/.paddleocr/
PaddleOCRConfig
¶
PaddleOCR
¶
Bases: BaseOCRExtractor
PaddleOCR text extractor.
Single-backend model (PaddlePaddle - CPU/GPU).
Example
Initialize PaddleOCR extractor.
| PARAMETER | DESCRIPTION |
|---|---|
config
|
Configuration object
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
ImportError
|
If paddleocr or paddlepaddle is not installed |
Source code in omnidocs/tasks/ocr_extraction/paddleocr.py
extract
¶
Run OCR on an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or path)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
OCROutput
|
OCROutput with detected text blocks |
Source code in omnidocs/tasks/ocr_extraction/paddleocr.py
158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 | |
tesseract
¶
Tesseract OCR extractor.
Tesseract is an open-source OCR engine maintained by Google. - CPU-based (no GPU required) - Requires system installation of Tesseract - Good for printed text, supports 100+ languages
System Requirements
macOS: brew install tesseract Ubuntu: sudo apt-get install tesseract-ocr Windows: Download from https://github.com/UB-Mannheim/tesseract/wiki
Python Package
pip install pytesseract
TesseractOCRConfig
¶
TesseractOCR
¶
Bases: BaseOCRExtractor
Tesseract OCR extractor.
Single-backend model (CPU only). Requires system Tesseract installation.
Example
Initialize Tesseract OCR extractor.
| PARAMETER | DESCRIPTION |
|---|---|
config
|
Configuration object
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If Tesseract is not installed |
ImportError
|
If pytesseract is not installed |
Source code in omnidocs/tasks/ocr_extraction/tesseract.py
extract
¶
Run OCR on an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or path)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
OCROutput
|
OCROutput with detected text blocks at word level |
Source code in omnidocs/tasks/ocr_extraction/tesseract.py
160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 | |
extract_lines
¶
Run OCR and return line-level blocks.
Groups words into lines based on Tesseract's line detection.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or path)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
OCROutput
|
OCROutput with line-level text blocks |
Source code in omnidocs/tasks/ocr_extraction/tesseract.py
251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 | |
reading_order
¶
Reading Order Module.
Provides predictors for determining the logical reading sequence of document elements based on layout detection and spatial analysis.
Available Predictors
- RuleBasedReadingOrderPredictor: Rule-based predictor using R-tree indexing
Example
from omnidocs.tasks.reading_order import RuleBasedReadingOrderPredictor
from omnidocs.tasks.layout_extraction import DocLayoutYOLO, DocLayoutYOLOConfig
from omnidocs.tasks.ocr_extraction import EasyOCR, EasyOCRConfig
# Initialize components
layout_extractor = DocLayoutYOLO(config=DocLayoutYOLOConfig())
ocr = EasyOCR(config=EasyOCRConfig())
predictor = RuleBasedReadingOrderPredictor()
# Process document
layout = layout_extractor.extract(image)
ocr_result = ocr.extract(image)
reading_order = predictor.predict(layout, ocr_result)
# Get text in reading order
text = reading_order.get_full_text()
# Get elements by type
tables = reading_order.get_elements_by_type(ElementType.TABLE)
# Get caption associations
for elem in reading_order.ordered_elements:
if elem.element_type == ElementType.FIGURE:
captions = reading_order.get_captions_for(elem.original_id)
print(f"Figure {elem.original_id} captions: {[c.text for c in captions]}")
BaseReadingOrderPredictor
¶
Bases: ABC
Abstract base class for reading order predictors.
Reading order predictors take layout detection and OCR results and produce a properly ordered sequence of document elements.
Example
predict
abstractmethod
¶
predict(
layout: LayoutOutput,
ocr: Optional[OCROutput] = None,
page_no: int = 0,
) -> ReadingOrderOutput
Predict reading order for a single page.
| PARAMETER | DESCRIPTION |
|---|---|
layout
|
Layout detection results with bounding boxes
TYPE:
|
ocr
|
Optional OCR results. If provided, text will be matched to layout elements by bbox overlap.
TYPE:
|
page_no
|
Page number (for multi-page documents)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ReadingOrderOutput
|
ReadingOrderOutput with ordered elements and associations |
Example
Source code in omnidocs/tasks/reading_order/base.py
predict_multi_page
¶
predict_multi_page(
layouts: List[LayoutOutput],
ocrs: Optional[List[OCROutput]] = None,
) -> List[ReadingOrderOutput]
Predict reading order for multiple pages.
| PARAMETER | DESCRIPTION |
|---|---|
layouts
|
List of layout results, one per page
TYPE:
|
ocrs
|
Optional list of OCR results, one per page
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[ReadingOrderOutput]
|
List of ReadingOrderOutput, one per page |
Source code in omnidocs/tasks/reading_order/base.py
BoundingBox
¶
Bases: BaseModel
Bounding box in pixel coordinates.
to_list
¶
from_list
classmethod
¶
Create from [x1, y1, x2, y2] list.
Source code in omnidocs/tasks/reading_order/models.py
to_normalized
¶
Convert to normalized coordinates (0-1024 range).
| PARAMETER | DESCRIPTION |
|---|---|
image_width
|
Original image width in pixels
TYPE:
|
image_height
|
Original image height in pixels
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
BoundingBox
|
New BoundingBox with coordinates in 0-1024 range |
Source code in omnidocs/tasks/reading_order/models.py
ElementType
¶
Bases: str, Enum
Type of document element for reading order.
OrderedElement
¶
Bases: BaseModel
A document element with its reading order position.
Combines layout detection results with OCR text and assigns a reading order index.
to_dict
¶
Convert to dictionary representation.
Source code in omnidocs/tasks/reading_order/models.py
ReadingOrderOutput
¶
Bases: BaseModel
Complete reading order prediction result.
Provides: - Ordered list of document elements - Caption-to-element associations - Footnote-to-element associations - Merge suggestions for split elements
Example
get_full_text
¶
Get concatenated text in reading order.
Excludes page headers, footers, captions, and footnotes from main text flow.
Source code in omnidocs/tasks/reading_order/models.py
get_elements_by_type
¶
get_captions_for
¶
Get caption elements for a given element ID.
Source code in omnidocs/tasks/reading_order/models.py
get_footnotes_for
¶
Get footnote elements for a given element ID.
Source code in omnidocs/tasks/reading_order/models.py
to_dict
¶
Convert to dictionary representation.
Source code in omnidocs/tasks/reading_order/models.py
save_json
¶
Save to JSON file.
load_json
classmethod
¶
RuleBasedReadingOrderPredictor
¶
Bases: BaseReadingOrderPredictor
Rule-based reading order predictor using spatial analysis.
Uses R-tree spatial indexing and rule-based algorithms to determine the logical reading sequence of document elements. This is a CPU-only implementation that doesn't require GPU resources.
Features: - Multi-column layout detection - Header/footer separation - Caption-to-figure/table association - Footnote linking - Element merge suggestions
Example
from omnidocs.tasks.reading_order import RuleBasedReadingOrderPredictor
from omnidocs.tasks.layout_extraction import DocLayoutYOLO, DocLayoutYOLOConfig
from omnidocs.tasks.ocr_extraction import EasyOCR, EasyOCRConfig
# Initialize components
layout_extractor = DocLayoutYOLO(config=DocLayoutYOLOConfig())
ocr = EasyOCR(config=EasyOCRConfig())
predictor = RuleBasedReadingOrderPredictor()
# Process document
layout = layout_extractor.extract(image)
ocr_result = ocr.extract(image)
reading_order = predictor.predict(layout, ocr_result)
# Get text in reading order
text = reading_order.get_full_text()
Initialize the reading order predictor.
Source code in omnidocs/tasks/reading_order/rule_based/predictor.py
predict
¶
predict(
layout: LayoutOutput,
ocr: Optional[OCROutput] = None,
page_no: int = 0,
) -> ReadingOrderOutput
Predict reading order for a single page.
| PARAMETER | DESCRIPTION |
|---|---|
layout
|
Layout detection results with bounding boxes
TYPE:
|
ocr
|
Optional OCR results for text content
TYPE:
|
page_no
|
Page number (for multi-page documents)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ReadingOrderOutput
|
ReadingOrderOutput with ordered elements and associations |
Source code in omnidocs/tasks/reading_order/rule_based/predictor.py
186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 | |
base
¶
Base class for reading order predictors.
Defines the abstract interface that all reading order predictors must implement.
BaseReadingOrderPredictor
¶
Bases: ABC
Abstract base class for reading order predictors.
Reading order predictors take layout detection and OCR results and produce a properly ordered sequence of document elements.
Example
predict
abstractmethod
¶
predict(
layout: LayoutOutput,
ocr: Optional[OCROutput] = None,
page_no: int = 0,
) -> ReadingOrderOutput
Predict reading order for a single page.
| PARAMETER | DESCRIPTION |
|---|---|
layout
|
Layout detection results with bounding boxes
TYPE:
|
ocr
|
Optional OCR results. If provided, text will be matched to layout elements by bbox overlap.
TYPE:
|
page_no
|
Page number (for multi-page documents)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ReadingOrderOutput
|
ReadingOrderOutput with ordered elements and associations |
Example
Source code in omnidocs/tasks/reading_order/base.py
predict_multi_page
¶
predict_multi_page(
layouts: List[LayoutOutput],
ocrs: Optional[List[OCROutput]] = None,
) -> List[ReadingOrderOutput]
Predict reading order for multiple pages.
| PARAMETER | DESCRIPTION |
|---|---|
layouts
|
List of layout results, one per page
TYPE:
|
ocrs
|
Optional list of OCR results, one per page
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[ReadingOrderOutput]
|
List of ReadingOrderOutput, one per page |
Source code in omnidocs/tasks/reading_order/base.py
models
¶
Pydantic models for reading order prediction.
Takes layout detection and OCR results, produces ordered element sequence with caption and footnote associations.
Example
# Get layout and OCR
layout = layout_extractor.extract(image)
ocr = ocr_extractor.extract(image)
# Predict reading order
reading_order = predictor.predict(layout, ocr)
# Iterate in reading order
for element in reading_order.ordered_elements:
print(f"{element.index}: [{element.element_type}] {element.text[:50]}...")
# Get caption associations
for fig_id, caption_ids in reading_order.caption_map.items():
print(f"Figure {fig_id} has captions: {caption_ids}")
ElementType
¶
Bases: str, Enum
Type of document element for reading order.
BoundingBox
¶
Bases: BaseModel
Bounding box in pixel coordinates.
to_list
¶
from_list
classmethod
¶
Create from [x1, y1, x2, y2] list.
Source code in omnidocs/tasks/reading_order/models.py
to_normalized
¶
Convert to normalized coordinates (0-1024 range).
| PARAMETER | DESCRIPTION |
|---|---|
image_width
|
Original image width in pixels
TYPE:
|
image_height
|
Original image height in pixels
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
BoundingBox
|
New BoundingBox with coordinates in 0-1024 range |
Source code in omnidocs/tasks/reading_order/models.py
OrderedElement
¶
Bases: BaseModel
A document element with its reading order position.
Combines layout detection results with OCR text and assigns a reading order index.
to_dict
¶
Convert to dictionary representation.
Source code in omnidocs/tasks/reading_order/models.py
ReadingOrderOutput
¶
Bases: BaseModel
Complete reading order prediction result.
Provides: - Ordered list of document elements - Caption-to-element associations - Footnote-to-element associations - Merge suggestions for split elements
Example
get_full_text
¶
Get concatenated text in reading order.
Excludes page headers, footers, captions, and footnotes from main text flow.
Source code in omnidocs/tasks/reading_order/models.py
get_elements_by_type
¶
get_captions_for
¶
Get caption elements for a given element ID.
Source code in omnidocs/tasks/reading_order/models.py
get_footnotes_for
¶
Get footnote elements for a given element ID.
Source code in omnidocs/tasks/reading_order/models.py
to_dict
¶
Convert to dictionary representation.
Source code in omnidocs/tasks/reading_order/models.py
save_json
¶
Save to JSON file.
load_json
classmethod
¶
rule_based
¶
Rule-based reading order predictor module.
Provides rule-based reading order prediction using spatial analysis.
RuleBasedReadingOrderPredictor
¶
Bases: BaseReadingOrderPredictor
Rule-based reading order predictor using spatial analysis.
Uses R-tree spatial indexing and rule-based algorithms to determine the logical reading sequence of document elements. This is a CPU-only implementation that doesn't require GPU resources.
Features: - Multi-column layout detection - Header/footer separation - Caption-to-figure/table association - Footnote linking - Element merge suggestions
Example
from omnidocs.tasks.reading_order import RuleBasedReadingOrderPredictor
from omnidocs.tasks.layout_extraction import DocLayoutYOLO, DocLayoutYOLOConfig
from omnidocs.tasks.ocr_extraction import EasyOCR, EasyOCRConfig
# Initialize components
layout_extractor = DocLayoutYOLO(config=DocLayoutYOLOConfig())
ocr = EasyOCR(config=EasyOCRConfig())
predictor = RuleBasedReadingOrderPredictor()
# Process document
layout = layout_extractor.extract(image)
ocr_result = ocr.extract(image)
reading_order = predictor.predict(layout, ocr_result)
# Get text in reading order
text = reading_order.get_full_text()
Initialize the reading order predictor.
Source code in omnidocs/tasks/reading_order/rule_based/predictor.py
predict
¶
predict(
layout: LayoutOutput,
ocr: Optional[OCROutput] = None,
page_no: int = 0,
) -> ReadingOrderOutput
Predict reading order for a single page.
| PARAMETER | DESCRIPTION |
|---|---|
layout
|
Layout detection results with bounding boxes
TYPE:
|
ocr
|
Optional OCR results for text content
TYPE:
|
page_no
|
Page number (for multi-page documents)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ReadingOrderOutput
|
ReadingOrderOutput with ordered elements and associations |
Source code in omnidocs/tasks/reading_order/rule_based/predictor.py
186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 | |
predictor
¶
Rule-based reading order predictor.
Uses spatial analysis and R-tree indexing to determine the logical reading sequence of document elements. Self-contained implementation without external dependencies on docling-ibm-models.
Based on the algorithm from docling-ibm-models, adapted for omnidocs.
RuleBasedReadingOrderPredictor
¶
Bases: BaseReadingOrderPredictor
Rule-based reading order predictor using spatial analysis.
Uses R-tree spatial indexing and rule-based algorithms to determine the logical reading sequence of document elements. This is a CPU-only implementation that doesn't require GPU resources.
Features: - Multi-column layout detection - Header/footer separation - Caption-to-figure/table association - Footnote linking - Element merge suggestions
Example
from omnidocs.tasks.reading_order import RuleBasedReadingOrderPredictor
from omnidocs.tasks.layout_extraction import DocLayoutYOLO, DocLayoutYOLOConfig
from omnidocs.tasks.ocr_extraction import EasyOCR, EasyOCRConfig
# Initialize components
layout_extractor = DocLayoutYOLO(config=DocLayoutYOLOConfig())
ocr = EasyOCR(config=EasyOCRConfig())
predictor = RuleBasedReadingOrderPredictor()
# Process document
layout = layout_extractor.extract(image)
ocr_result = ocr.extract(image)
reading_order = predictor.predict(layout, ocr_result)
# Get text in reading order
text = reading_order.get_full_text()
Initialize the reading order predictor.
Source code in omnidocs/tasks/reading_order/rule_based/predictor.py
predict
¶
predict(
layout: LayoutOutput,
ocr: Optional[OCROutput] = None,
page_no: int = 0,
) -> ReadingOrderOutput
Predict reading order for a single page.
| PARAMETER | DESCRIPTION |
|---|---|
layout
|
Layout detection results with bounding boxes
TYPE:
|
ocr
|
Optional OCR results for text content
TYPE:
|
page_no
|
Page number (for multi-page documents)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ReadingOrderOutput
|
ReadingOrderOutput with ordered elements and associations |
Source code in omnidocs/tasks/reading_order/rule_based/predictor.py
186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 | |
structured_extraction
¶
Structured Extraction Module.
Provides extractors for extracting structured data from document images using Pydantic schemas for type-safe output.
Example
from pydantic import BaseModel
from omnidocs.vlm import VLMAPIConfig
from omnidocs.tasks.structured_extraction import VLMStructuredExtractor
class Invoice(BaseModel):
vendor: str
total: float
items: list[str]
config = VLMAPIConfig(model="gemini/gemini-2.5-flash")
extractor = VLMStructuredExtractor(config=config)
result = extractor.extract(
"invoice.png",
schema=Invoice,
prompt="Extract invoice details from this document.",
)
print(result.vendor, result.total)
BaseStructuredExtractor
¶
Bases: ABC
Abstract base class for structured extractors.
Structured extractors return data matching a user-provided Pydantic schema.
Example
extract
abstractmethod
¶
extract(
image: Union[Image, ndarray, str, Path],
schema: type[BaseModel],
prompt: str,
) -> StructuredOutput
Extract structured data from an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path).
TYPE:
|
schema
|
Pydantic model class defining the expected output structure.
TYPE:
|
prompt
|
Extraction prompt describing what to extract.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
StructuredOutput
|
StructuredOutput containing the validated data. |
Source code in omnidocs/tasks/structured_extraction/base.py
StructuredOutput
¶
Bases: BaseModel
Output from structured extraction.
Contains the extracted data as a validated Pydantic model instance, along with metadata about the extraction.
VLMStructuredExtractor
¶
Bases: BaseStructuredExtractor
Provider-agnostic VLM structured extractor using litellm.
Extracts structured data from document images using any cloud VLM API. Uses litellm's native response_format support to send Pydantic schemas to providers that support structured output (OpenAI, Gemini, etc.).
Example
from pydantic import BaseModel
from omnidocs.vlm import VLMAPIConfig
from omnidocs.tasks.structured_extraction import VLMStructuredExtractor
class Invoice(BaseModel):
vendor: str
total: float
items: list[str]
config = VLMAPIConfig(model="gemini/gemini-2.5-flash")
extractor = VLMStructuredExtractor(config=config)
result = extractor.extract("invoice.png", schema=Invoice, prompt="Extract invoice fields")
print(result.data.vendor)
Initialize VLM structured extractor.
| PARAMETER | DESCRIPTION |
|---|---|
config
|
VLM API configuration with model and provider details.
TYPE:
|
Source code in omnidocs/tasks/structured_extraction/vlm.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
schema: type[BaseModel],
prompt: str,
) -> StructuredOutput
Extract structured data from an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path).
TYPE:
|
schema
|
Pydantic model class defining the expected output structure.
TYPE:
|
prompt
|
Extraction prompt describing what to extract.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
StructuredOutput
|
StructuredOutput containing the validated data. |
Source code in omnidocs/tasks/structured_extraction/vlm.py
base
¶
Base class for structured extractors.
Defines the abstract interface for extracting structured data from document images.
BaseStructuredExtractor
¶
Bases: ABC
Abstract base class for structured extractors.
Structured extractors return data matching a user-provided Pydantic schema.
Example
extract
abstractmethod
¶
extract(
image: Union[Image, ndarray, str, Path],
schema: type[BaseModel],
prompt: str,
) -> StructuredOutput
Extract structured data from an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path).
TYPE:
|
schema
|
Pydantic model class defining the expected output structure.
TYPE:
|
prompt
|
Extraction prompt describing what to extract.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
StructuredOutput
|
StructuredOutput containing the validated data. |
Source code in omnidocs/tasks/structured_extraction/base.py
models
¶
Pydantic models for structured extraction outputs.
StructuredOutput
¶
Bases: BaseModel
Output from structured extraction.
Contains the extracted data as a validated Pydantic model instance, along with metadata about the extraction.
vlm
¶
VLM structured extractor.
A provider-agnostic Vision-Language Model structured extractor using litellm. Extracts structured data matching a Pydantic schema from document images.
Example
from pydantic import BaseModel
from omnidocs.vlm import VLMAPIConfig
from omnidocs.tasks.structured_extraction import VLMStructuredExtractor
class Invoice(BaseModel):
vendor: str
total: float
items: list[str]
date: str
config = VLMAPIConfig(model="gemini/gemini-2.5-flash")
extractor = VLMStructuredExtractor(config=config)
result = extractor.extract(
image="invoice.png",
schema=Invoice,
prompt="Extract invoice details from this document.",
)
print(result.data.vendor, result.data.total)
VLMStructuredExtractor
¶
Bases: BaseStructuredExtractor
Provider-agnostic VLM structured extractor using litellm.
Extracts structured data from document images using any cloud VLM API. Uses litellm's native response_format support to send Pydantic schemas to providers that support structured output (OpenAI, Gemini, etc.).
Example
from pydantic import BaseModel
from omnidocs.vlm import VLMAPIConfig
from omnidocs.tasks.structured_extraction import VLMStructuredExtractor
class Invoice(BaseModel):
vendor: str
total: float
items: list[str]
config = VLMAPIConfig(model="gemini/gemini-2.5-flash")
extractor = VLMStructuredExtractor(config=config)
result = extractor.extract("invoice.png", schema=Invoice, prompt="Extract invoice fields")
print(result.data.vendor)
Initialize VLM structured extractor.
| PARAMETER | DESCRIPTION |
|---|---|
config
|
VLM API configuration with model and provider details.
TYPE:
|
Source code in omnidocs/tasks/structured_extraction/vlm.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
schema: type[BaseModel],
prompt: str,
) -> StructuredOutput
Extract structured data from an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path).
TYPE:
|
schema
|
Pydantic model class defining the expected output structure.
TYPE:
|
prompt
|
Extraction prompt describing what to extract.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
StructuredOutput
|
StructuredOutput containing the validated data. |
Source code in omnidocs/tasks/structured_extraction/vlm.py
table_extraction
¶
Table Extraction Module.
Provides extractors for detecting and extracting table structure from document images. Outputs structured table data with cells, spans, and multiple export formats (HTML, Markdown, Pandas DataFrame).
Available Extractors
- TableFormerExtractor: Transformer-based table structure extractor
Example
from omnidocs.tasks.table_extraction import TableFormerExtractor, TableFormerConfig
# Initialize extractor
extractor = TableFormerExtractor(
config=TableFormerConfig(mode="fast", device="cuda")
)
# Extract table structure
result = extractor.extract(table_image)
# Get HTML output
html = result.to_html()
# Get DataFrame
df = result.to_dataframe()
# Get Markdown
md = result.to_markdown()
# Access cells
for cell in result.cells:
print(f"[{cell.row},{cell.col}] {cell.text}")
BaseTableExtractor
¶
Bases: ABC
Abstract base class for table structure extractors.
Table extractors analyze table images to detect cell structure, identify headers, and extract text content.
Example
extract
abstractmethod
¶
extract(
image: Union[Image, ndarray, str, Path],
ocr_output: Optional[OCROutput] = None,
) -> TableOutput
Extract table structure from an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Table image (should be cropped to table region)
TYPE:
|
ocr_output
|
Optional OCR results for cell text matching. If not provided, model will attempt to extract text.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TableOutput
|
TableOutput with cells, structure, and export methods |
Example
Source code in omnidocs/tasks/table_extraction/base.py
batch_extract
¶
batch_extract(
images: List[Union[Image, ndarray, str, Path]],
ocr_outputs: Optional[List[OCROutput]] = None,
progress_callback: Optional[
Callable[[int, int], None]
] = None,
) -> List[TableOutput]
Extract tables from multiple images.
Default implementation loops over extract(). Subclasses can override for optimized batching.
| PARAMETER | DESCRIPTION |
|---|---|
images
|
List of table images
TYPE:
|
ocr_outputs
|
Optional list of OCR results (same length as images)
TYPE:
|
progress_callback
|
Optional function(current, total) for progress
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[TableOutput]
|
List of TableOutput in same order as input |
Examples:
Source code in omnidocs/tasks/table_extraction/base.py
extract_document
¶
extract_document(
document: Document,
table_bboxes: Optional[List[List[float]]] = None,
progress_callback: Optional[
Callable[[int, int], None]
] = None,
) -> List[TableOutput]
Extract tables from all pages of a document.
| PARAMETER | DESCRIPTION |
|---|---|
document
|
Document instance
TYPE:
|
table_bboxes
|
Optional list of table bounding boxes per page. Each element should be a list of [x1, y1, x2, y2] coords.
TYPE:
|
progress_callback
|
Optional function(current, total) for progress
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[TableOutput]
|
List of TableOutput, one per detected table |
Examples:
Source code in omnidocs/tasks/table_extraction/base.py
BoundingBox
¶
Bases: BaseModel
Bounding box in pixel coordinates.
to_list
¶
to_xyxy
¶
from_list
classmethod
¶
Create from [x1, y1, x2, y2] list.
Source code in omnidocs/tasks/table_extraction/models.py
from_ltrb
classmethod
¶
Create from left, top, right, bottom coordinates.
to_normalized
¶
Convert to normalized coordinates (0-1024 range).
| PARAMETER | DESCRIPTION |
|---|---|
image_width
|
Original image width in pixels
TYPE:
|
image_height
|
Original image height in pixels
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
BoundingBox
|
New BoundingBox with coordinates in 0-1024 range |
Source code in omnidocs/tasks/table_extraction/models.py
CellType
¶
Bases: str, Enum
Type of table cell.
TableCell
¶
Bases: BaseModel
Single table cell with position, span, and content.
The cell position uses 0-indexed row/column indices. Spans indicate how many rows/columns the cell occupies.
to_dict
¶
Convert to dictionary representation.
Source code in omnidocs/tasks/table_extraction/models.py
TableOutput
¶
Bases: BaseModel
Complete table extraction result.
Provides multiple export formats and utility methods for working with extracted table data.
Example
get_cell
¶
Get cell at specific position.
Handles merged cells by returning the cell that covers the position.
Source code in omnidocs/tasks/table_extraction/models.py
get_row
¶
get_column
¶
to_html
¶
Convert table to HTML string.
| PARAMETER | DESCRIPTION |
|---|---|
include_styles
|
Whether to include basic CSS styling
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
HTML table string |
Source code in omnidocs/tasks/table_extraction/models.py
229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 | |
to_dataframe
¶
Convert table to Pandas DataFrame.
| RETURNS | DESCRIPTION |
|---|---|
|
pandas.DataFrame with table data |
| RAISES | DESCRIPTION |
|---|---|
ImportError
|
If pandas is not installed |
Source code in omnidocs/tasks/table_extraction/models.py
to_markdown
¶
Convert table to Markdown format.
Note: Markdown tables don't support merged cells, so spans are ignored and only the top-left cell value is used.
| RETURNS | DESCRIPTION |
|---|---|
str
|
Markdown table string |
Source code in omnidocs/tasks/table_extraction/models.py
to_dict
¶
Convert to dictionary representation.
Source code in omnidocs/tasks/table_extraction/models.py
save_json
¶
Save to JSON file.
load_json
classmethod
¶
TableFormerConfig
¶
Bases: BaseModel
Configuration for TableFormer table structure extractor.
TableFormer is a transformer-based model that predicts table structure using OTSL (Optimal Table Structure Language) tags and cell bounding boxes.
| ATTRIBUTE | DESCRIPTION |
|---|---|
mode |
Inference mode - "fast" or "accurate"
TYPE:
|
device |
Device for inference - "cpu", "cuda", "mps", or "auto"
TYPE:
|
num_threads |
Number of CPU threads for inference
TYPE:
|
do_cell_matching |
Whether to match predicted cells with OCR text cells
TYPE:
|
artifacts_path |
Path to pre-downloaded model artifacts
TYPE:
|
repo_id |
HuggingFace model repository
TYPE:
|
revision |
Model revision/tag
TYPE:
|
Example
from omnidocs.tasks.table_extraction import TableFormerExtractor, TableFormerConfig
# Fast mode
extractor = TableFormerExtractor(config=TableFormerConfig(mode="fast"))
# Accurate mode with GPU
extractor = TableFormerExtractor(
config=TableFormerConfig(
mode="accurate",
device="cuda",
do_cell_matching=True,
)
)
TableFormerExtractor
¶
Bases: BaseTableExtractor
Table structure extractor using TableFormer model.
TableFormer is a transformer-based model that predicts table structure using OTSL (Optimal Table Structure Language) tags. It can detect: - Cell boundaries (bounding boxes) - Row and column spans - Header cells (column and row headers) - Section rows
Example
from omnidocs.tasks.table_extraction import TableFormerExtractor, TableFormerConfig
# Initialize extractor
extractor = TableFormerExtractor(
config=TableFormerConfig(mode="fast", device="cuda")
)
# Extract table structure
result = extractor.extract(table_image)
# Get HTML output
html = result.to_html()
# Get DataFrame
df = result.to_dataframe()
Initialize TableFormer extractor.
| PARAMETER | DESCRIPTION |
|---|---|
config
|
TableFormerConfig with model settings
TYPE:
|
Source code in omnidocs/tasks/table_extraction/tableformer/pytorch.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
ocr_output: Optional[OCROutput] = None,
) -> TableOutput
Extract table structure from an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Table image (should be cropped to table region)
TYPE:
|
ocr_output
|
Optional OCR results for cell text matching
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TableOutput
|
TableOutput with cells, structure, and export methods |
Example
Source code in omnidocs/tasks/table_extraction/tableformer/pytorch.py
TableFormerMode
¶
Bases: str, Enum
TableFormer inference mode.
base
¶
Base class for table extractors.
Defines the abstract interface that all table extractors must implement.
BaseTableExtractor
¶
Bases: ABC
Abstract base class for table structure extractors.
Table extractors analyze table images to detect cell structure, identify headers, and extract text content.
Example
extract
abstractmethod
¶
extract(
image: Union[Image, ndarray, str, Path],
ocr_output: Optional[OCROutput] = None,
) -> TableOutput
Extract table structure from an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Table image (should be cropped to table region)
TYPE:
|
ocr_output
|
Optional OCR results for cell text matching. If not provided, model will attempt to extract text.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TableOutput
|
TableOutput with cells, structure, and export methods |
Example
Source code in omnidocs/tasks/table_extraction/base.py
batch_extract
¶
batch_extract(
images: List[Union[Image, ndarray, str, Path]],
ocr_outputs: Optional[List[OCROutput]] = None,
progress_callback: Optional[
Callable[[int, int], None]
] = None,
) -> List[TableOutput]
Extract tables from multiple images.
Default implementation loops over extract(). Subclasses can override for optimized batching.
| PARAMETER | DESCRIPTION |
|---|---|
images
|
List of table images
TYPE:
|
ocr_outputs
|
Optional list of OCR results (same length as images)
TYPE:
|
progress_callback
|
Optional function(current, total) for progress
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[TableOutput]
|
List of TableOutput in same order as input |
Examples:
Source code in omnidocs/tasks/table_extraction/base.py
extract_document
¶
extract_document(
document: Document,
table_bboxes: Optional[List[List[float]]] = None,
progress_callback: Optional[
Callable[[int, int], None]
] = None,
) -> List[TableOutput]
Extract tables from all pages of a document.
| PARAMETER | DESCRIPTION |
|---|---|
document
|
Document instance
TYPE:
|
table_bboxes
|
Optional list of table bounding boxes per page. Each element should be a list of [x1, y1, x2, y2] coords.
TYPE:
|
progress_callback
|
Optional function(current, total) for progress
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[TableOutput]
|
List of TableOutput, one per detected table |
Examples:
Source code in omnidocs/tasks/table_extraction/base.py
models
¶
Pydantic models for table extraction outputs.
Provides structured table data with cells, spans, and multiple export formats including HTML, Markdown, and Pandas DataFrame conversion.
Example
CellType
¶
Bases: str, Enum
Type of table cell.
BoundingBox
¶
Bases: BaseModel
Bounding box in pixel coordinates.
to_list
¶
to_xyxy
¶
from_list
classmethod
¶
Create from [x1, y1, x2, y2] list.
Source code in omnidocs/tasks/table_extraction/models.py
from_ltrb
classmethod
¶
Create from left, top, right, bottom coordinates.
to_normalized
¶
Convert to normalized coordinates (0-1024 range).
| PARAMETER | DESCRIPTION |
|---|---|
image_width
|
Original image width in pixels
TYPE:
|
image_height
|
Original image height in pixels
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
BoundingBox
|
New BoundingBox with coordinates in 0-1024 range |
Source code in omnidocs/tasks/table_extraction/models.py
TableCell
¶
Bases: BaseModel
Single table cell with position, span, and content.
The cell position uses 0-indexed row/column indices. Spans indicate how many rows/columns the cell occupies.
to_dict
¶
Convert to dictionary representation.
Source code in omnidocs/tasks/table_extraction/models.py
TableOutput
¶
Bases: BaseModel
Complete table extraction result.
Provides multiple export formats and utility methods for working with extracted table data.
Example
get_cell
¶
Get cell at specific position.
Handles merged cells by returning the cell that covers the position.
Source code in omnidocs/tasks/table_extraction/models.py
get_row
¶
get_column
¶
to_html
¶
Convert table to HTML string.
| PARAMETER | DESCRIPTION |
|---|---|
include_styles
|
Whether to include basic CSS styling
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
HTML table string |
Source code in omnidocs/tasks/table_extraction/models.py
229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 | |
to_dataframe
¶
Convert table to Pandas DataFrame.
| RETURNS | DESCRIPTION |
|---|---|
|
pandas.DataFrame with table data |
| RAISES | DESCRIPTION |
|---|---|
ImportError
|
If pandas is not installed |
Source code in omnidocs/tasks/table_extraction/models.py
to_markdown
¶
Convert table to Markdown format.
Note: Markdown tables don't support merged cells, so spans are ignored and only the top-left cell value is used.
| RETURNS | DESCRIPTION |
|---|---|
str
|
Markdown table string |
Source code in omnidocs/tasks/table_extraction/models.py
to_dict
¶
Convert to dictionary representation.
Source code in omnidocs/tasks/table_extraction/models.py
save_json
¶
Save to JSON file.
load_json
classmethod
¶
tableformer
¶
TableFormer module for table structure extraction.
Provides the TableFormer-based table structure extractor.
TableFormerConfig
¶
Bases: BaseModel
Configuration for TableFormer table structure extractor.
TableFormer is a transformer-based model that predicts table structure using OTSL (Optimal Table Structure Language) tags and cell bounding boxes.
| ATTRIBUTE | DESCRIPTION |
|---|---|
mode |
Inference mode - "fast" or "accurate"
TYPE:
|
device |
Device for inference - "cpu", "cuda", "mps", or "auto"
TYPE:
|
num_threads |
Number of CPU threads for inference
TYPE:
|
do_cell_matching |
Whether to match predicted cells with OCR text cells
TYPE:
|
artifacts_path |
Path to pre-downloaded model artifacts
TYPE:
|
repo_id |
HuggingFace model repository
TYPE:
|
revision |
Model revision/tag
TYPE:
|
Example
from omnidocs.tasks.table_extraction import TableFormerExtractor, TableFormerConfig
# Fast mode
extractor = TableFormerExtractor(config=TableFormerConfig(mode="fast"))
# Accurate mode with GPU
extractor = TableFormerExtractor(
config=TableFormerConfig(
mode="accurate",
device="cuda",
do_cell_matching=True,
)
)
TableFormerMode
¶
Bases: str, Enum
TableFormer inference mode.
TableFormerExtractor
¶
Bases: BaseTableExtractor
Table structure extractor using TableFormer model.
TableFormer is a transformer-based model that predicts table structure using OTSL (Optimal Table Structure Language) tags. It can detect: - Cell boundaries (bounding boxes) - Row and column spans - Header cells (column and row headers) - Section rows
Example
from omnidocs.tasks.table_extraction import TableFormerExtractor, TableFormerConfig
# Initialize extractor
extractor = TableFormerExtractor(
config=TableFormerConfig(mode="fast", device="cuda")
)
# Extract table structure
result = extractor.extract(table_image)
# Get HTML output
html = result.to_html()
# Get DataFrame
df = result.to_dataframe()
Initialize TableFormer extractor.
| PARAMETER | DESCRIPTION |
|---|---|
config
|
TableFormerConfig with model settings
TYPE:
|
Source code in omnidocs/tasks/table_extraction/tableformer/pytorch.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
ocr_output: Optional[OCROutput] = None,
) -> TableOutput
Extract table structure from an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Table image (should be cropped to table region)
TYPE:
|
ocr_output
|
Optional OCR results for cell text matching
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TableOutput
|
TableOutput with cells, structure, and export methods |
Example
Source code in omnidocs/tasks/table_extraction/tableformer/pytorch.py
config
¶
Configuration for TableFormer table structure extractor.
TableFormer uses a dual-decoder transformer architecture with OTSL+ support for recognizing table structure from images.
Example
from omnidocs.tasks.table_extraction import TableFormerExtractor, TableFormerConfig
# Fast mode (default)
extractor = TableFormerExtractor(config=TableFormerConfig())
# Accurate mode with GPU
extractor = TableFormerExtractor(
config=TableFormerConfig(
mode="accurate",
device="cuda",
do_cell_matching=True,
)
)
TableFormerMode
¶
Bases: str, Enum
TableFormer inference mode.
TableFormerConfig
¶
Bases: BaseModel
Configuration for TableFormer table structure extractor.
TableFormer is a transformer-based model that predicts table structure using OTSL (Optimal Table Structure Language) tags and cell bounding boxes.
| ATTRIBUTE | DESCRIPTION |
|---|---|
mode |
Inference mode - "fast" or "accurate"
TYPE:
|
device |
Device for inference - "cpu", "cuda", "mps", or "auto"
TYPE:
|
num_threads |
Number of CPU threads for inference
TYPE:
|
do_cell_matching |
Whether to match predicted cells with OCR text cells
TYPE:
|
artifacts_path |
Path to pre-downloaded model artifacts
TYPE:
|
repo_id |
HuggingFace model repository
TYPE:
|
revision |
Model revision/tag
TYPE:
|
Example
from omnidocs.tasks.table_extraction import TableFormerExtractor, TableFormerConfig
# Fast mode
extractor = TableFormerExtractor(config=TableFormerConfig(mode="fast"))
# Accurate mode with GPU
extractor = TableFormerExtractor(
config=TableFormerConfig(
mode="accurate",
device="cuda",
do_cell_matching=True,
)
)
pytorch
¶
TableFormer extractor implementation using PyTorch backend.
Uses the TFPredictor from docling-ibm-models for table structure recognition.
TableFormerExtractor
¶
Bases: BaseTableExtractor
Table structure extractor using TableFormer model.
TableFormer is a transformer-based model that predicts table structure using OTSL (Optimal Table Structure Language) tags. It can detect: - Cell boundaries (bounding boxes) - Row and column spans - Header cells (column and row headers) - Section rows
Example
from omnidocs.tasks.table_extraction import TableFormerExtractor, TableFormerConfig
# Initialize extractor
extractor = TableFormerExtractor(
config=TableFormerConfig(mode="fast", device="cuda")
)
# Extract table structure
result = extractor.extract(table_image)
# Get HTML output
html = result.to_html()
# Get DataFrame
df = result.to_dataframe()
Initialize TableFormer extractor.
| PARAMETER | DESCRIPTION |
|---|---|
config
|
TableFormerConfig with model settings
TYPE:
|
Source code in omnidocs/tasks/table_extraction/tableformer/pytorch.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
ocr_output: Optional[OCROutput] = None,
) -> TableOutput
Extract table structure from an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Table image (should be cropped to table region)
TYPE:
|
ocr_output
|
Optional OCR results for cell text matching
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TableOutput
|
TableOutput with cells, structure, and export methods |
Example
Source code in omnidocs/tasks/table_extraction/tableformer/pytorch.py
text_extraction
¶
Text Extraction Module.
Provides extractors for converting document images to structured text formats (HTML, Markdown, JSON). Uses Vision-Language Models for accurate text extraction with formatting preservation and optional layout detection.
Available Extractors
- QwenTextExtractor: Qwen3-VL based extractor (multi-backend)
- DotsOCRTextExtractor: Dots OCR with layout-aware extraction (PyTorch/VLLM/API)
- NanonetsTextExtractor: Nanonets OCR2-3B for text extraction (PyTorch/VLLM)
- GraniteDoclingTextExtractor: IBM Granite Docling for document conversion (multi-backend)
- MinerUVLTextExtractor: MinerU VL 1.2B with layout-aware two-step extraction (multi-backend)
Example
from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig
extractor = QwenTextExtractor(
backend=QwenTextPyTorchConfig(model="Qwen/Qwen3-VL-8B-Instruct")
)
result = extractor.extract(image, output_format="markdown")
print(result.content)
BaseTextExtractor
¶
Bases: ABC
Abstract base class for text extractors.
All text extraction models must inherit from this class and implement the required methods.
Example
extract
abstractmethod
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text from an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file
TYPE:
|
output_format
|
Desired output format: - "html": Structured HTML - "markdown": Markdown format
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput containing extracted text content |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If image format or output_format is not supported |
RuntimeError
|
If model is not loaded or inference fails |
Source code in omnidocs/tasks/text_extraction/base.py
batch_extract
¶
batch_extract(
images: List[Union[Image, ndarray, str, Path]],
output_format: Literal["html", "markdown"] = "markdown",
progress_callback: Optional[
Callable[[int, int], None]
] = None,
) -> List[TextOutput]
Extract text from multiple images.
Default implementation loops over extract(). Subclasses can override for optimized batching (e.g., VLLM).
| PARAMETER | DESCRIPTION |
|---|---|
images
|
List of images in any supported format
TYPE:
|
output_format
|
Desired output format
TYPE:
|
progress_callback
|
Optional function(current, total) for progress
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[TextOutput]
|
List of TextOutput in same order as input |
Examples:
images = [doc.get_page(i) for i in range(doc.page_count)]
results = extractor.batch_extract(images, output_format="markdown")
Source code in omnidocs/tasks/text_extraction/base.py
extract_document
¶
extract_document(
document: Document,
output_format: Literal["html", "markdown"] = "markdown",
progress_callback: Optional[
Callable[[int, int], None]
] = None,
) -> List[TextOutput]
Extract text from all pages of a document.
| PARAMETER | DESCRIPTION |
|---|---|
document
|
Document instance
TYPE:
|
output_format
|
Desired output format
TYPE:
|
progress_callback
|
Optional function(current, total) for progress
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[TextOutput]
|
List of TextOutput, one per page |
Examples:
doc = Document.from_pdf("paper.pdf")
results = extractor.extract_document(doc, output_format="markdown")
Source code in omnidocs/tasks/text_extraction/base.py
DotsOCRTextExtractor
¶
Bases: BaseTextExtractor
Dots OCR Vision-Language Model text extractor with layout detection.
Extracts text from document images with layout information including: - 11 layout categories (Caption, Footnote, Formula, List-item, etc.) - Bounding boxes (normalized to 0-1024) - Multi-format text (Markdown, LaTeX, HTML) - Reading order preservation
Supports PyTorch, VLLM, and API backends.
Example
from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRPyTorchConfig
# Initialize with PyTorch backend
extractor = DotsOCRTextExtractor(
backend=DotsOCRPyTorchConfig(model="rednote-hilab/dots.ocr")
)
# Extract with layout
result = extractor.extract(image, include_layout=True)
print(f"Found {result.num_layout_elements} elements")
print(result.content)
Initialize Dots OCR text extractor.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration. One of: - DotsOCRPyTorchConfig: PyTorch/HuggingFace backend - DotsOCRVLLMConfig: VLLM high-throughput backend - DotsOCRAPIConfig: API backend (online VLLM server)
TYPE:
|
Source code in omnidocs/tasks/text_extraction/dotsocr/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal[
"markdown", "html", "json"
] = "markdown",
include_layout: bool = False,
custom_prompt: Optional[str] = None,
max_tokens: int = 8192,
) -> DotsOCRTextOutput
Extract text from image using Dots OCR.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path)
TYPE:
|
output_format
|
Output format ("markdown", "html", or "json")
TYPE:
|
include_layout
|
Include layout bounding boxes in output
TYPE:
|
custom_prompt
|
Override default extraction prompt
TYPE:
|
max_tokens
|
Maximum tokens for generation
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DotsOCRTextOutput
|
DotsOCRTextOutput with extracted content and optional layout |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If model is not loaded or inference fails |
Source code in omnidocs/tasks/text_extraction/dotsocr/extractor.py
GraniteDoclingTextExtractor
¶
Bases: BaseTextExtractor
Granite Docling text extractor supporting PyTorch, VLLM, MLX, and API backends.
Granite Docling is IBM's compact vision-language model optimized for document conversion. It outputs DocTags format which is converted to Markdown using the docling_core library.
Example
from omnidocs.tasks.text_extraction.granitedocling import ( ... GraniteDoclingTextExtractor, ... GraniteDoclingTextPyTorchConfig, ... ) config = GraniteDoclingTextPyTorchConfig(device="cuda") extractor = GraniteDoclingTextExtractor(backend=config) result = extractor.extract(image, output_format="markdown") print(result.content)
Initialize Granite Docling extractor with backend configuration.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration (PyTorch, VLLM, MLX, or API config)
TYPE:
|
Source code in omnidocs/tasks/text_extraction/granitedocling/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text from an image using Granite Docling.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path)
TYPE:
|
output_format
|
Output format ("markdown" or "html")
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput with extracted content |
Source code in omnidocs/tasks/text_extraction/granitedocling/extractor.py
MinerUVLTextExtractor
¶
Bases: BaseTextExtractor
MinerU VL text extractor with layout-aware extraction.
Performs two-step extraction: 1. Layout detection (detect regions) 2. Content recognition (extract text/table/equation from each region)
Supports multiple backends: - PyTorch (HuggingFace Transformers) - VLLM (high-throughput GPU) - MLX (Apple Silicon) - API (VLLM OpenAI-compatible server)
Example
from omnidocs.tasks.text_extraction import MinerUVLTextExtractor
from omnidocs.tasks.text_extraction.mineruvl import MinerUVLTextPyTorchConfig
extractor = MinerUVLTextExtractor(
backend=MinerUVLTextPyTorchConfig(device="cuda")
)
result = extractor.extract(image)
print(result.content) # Combined text + tables + equations
print(result.blocks) # List of ContentBlock objects
Initialize MinerU VL text extractor.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration (PyTorch, VLLM, MLX, or API)
TYPE:
|
Source code in omnidocs/tasks/text_extraction/mineruvl/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text with layout-aware two-step extraction.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path)
TYPE:
|
output_format
|
Output format ('html' or 'markdown')
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput with extracted content and metadata |
Source code in omnidocs/tasks/text_extraction/mineruvl/extractor.py
extract_with_blocks
¶
extract_with_blocks(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> tuple[TextOutput, List[ContentBlock]]
Extract text and return both TextOutput and ContentBlocks.
This method provides access to the detailed block information including bounding boxes and block types.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image
TYPE:
|
output_format
|
Output format
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple[TextOutput, List[ContentBlock]]
|
Tuple of (TextOutput, List[ContentBlock]) |
Source code in omnidocs/tasks/text_extraction/mineruvl/extractor.py
DotsOCRTextOutput
¶
Bases: BaseModel
Text extraction output from Dots OCR with layout information.
Dots OCR provides structured output with: - Layout detection (11 categories) - Bounding boxes (normalized to 0-1024) - Multi-format text (Markdown/LaTeX/HTML) - Reading order preservation
Layout Categories
Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Section-header, Table, Text, Title
Text Formatting
- Text/Title/Section-header: Markdown
- Formula: LaTeX
- Table: HTML
- Picture: (text omitted)
Example
LayoutElement
¶
Bases: BaseModel
Single layout element from document layout detection.
Represents a detected region in the document with its bounding box, category label, and extracted text content.
| ATTRIBUTE | DESCRIPTION |
|---|---|
bbox |
Bounding box coordinates [x1, y1, x2, y2] (normalized to 0-1024)
TYPE:
|
category |
Layout category (e.g., "Text", "Title", "Table", "Formula")
TYPE:
|
text |
Extracted text content (None for pictures)
TYPE:
|
confidence |
Detection confidence score (optional)
TYPE:
|
OutputFormat
¶
Bases: str, Enum
Supported text extraction output formats.
Each format has different characteristics
- HTML: Structured with div elements, preserves layout semantics
- MARKDOWN: Portable, human-readable, good for documentation
- JSON: Structured data with layout information (Dots OCR)
TextOutput
¶
Bases: BaseModel
Text extraction output from a document image.
Contains the extracted text content in the requested format, along with optional raw output and plain text versions.
Example
NanonetsTextExtractor
¶
Bases: BaseTextExtractor
Nanonets OCR2-3B Vision-Language Model text extractor.
Extracts text from document images with support for:
- Tables (output as HTML)
- Equations (output as LaTeX)
- Image captions (wrapped in tags)
- Watermarks (wrapped in
Supports PyTorch, VLLM, and MLX backends.
Example
from omnidocs.tasks.text_extraction import NanonetsTextExtractor
from omnidocs.tasks.text_extraction.nanonets import NanonetsTextPyTorchConfig
# Initialize with PyTorch backend
extractor = NanonetsTextExtractor(
backend=NanonetsTextPyTorchConfig()
)
# Extract text
result = extractor.extract(image)
print(result.content)
Initialize Nanonets text extractor.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration. One of: - NanonetsTextPyTorchConfig: PyTorch/HuggingFace backend - NanonetsTextVLLMConfig: VLLM high-throughput backend - NanonetsTextMLXConfig: MLX backend for Apple Silicon
TYPE:
|
Source code in omnidocs/tasks/text_extraction/nanonets/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text from an image.
Note: Nanonets OCR2 produces a unified output format that includes tables as HTML and equations as LaTeX inline. The output_format parameter is accepted for API compatibility but does not change the output structure.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file
TYPE:
|
output_format
|
Accepted for API compatibility (default: "markdown")
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput containing extracted text content |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If model is not loaded |
ValueError
|
If image format is not supported |
Source code in omnidocs/tasks/text_extraction/nanonets/extractor.py
QwenTextExtractor
¶
Bases: BaseTextExtractor
Qwen3-VL Vision-Language Model text extractor.
Extracts text from document images and outputs as structured HTML or Markdown. Uses Qwen3-VL's built-in document parsing prompts.
Supports PyTorch, VLLM, MLX, and API backends.
Example
from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig
# Initialize with PyTorch backend
extractor = QwenTextExtractor(
backend=QwenTextPyTorchConfig(model="Qwen/Qwen3-VL-8B-Instruct")
)
# Extract as Markdown
result = extractor.extract(image, output_format="markdown")
print(result.content)
# Extract as HTML
result = extractor.extract(image, output_format="html")
print(result.content)
Initialize Qwen text extractor.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration. One of: - QwenTextPyTorchConfig: PyTorch/HuggingFace backend - QwenTextVLLMConfig: VLLM high-throughput backend - QwenTextMLXConfig: MLX backend for Apple Silicon - QwenTextAPIConfig: API backend (OpenRouter, etc.)
TYPE:
|
Source code in omnidocs/tasks/text_extraction/qwen/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text from an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file
TYPE:
|
output_format
|
Desired output format: - "html": Structured HTML with div elements - "markdown": Markdown format
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput containing extracted text content |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If model is not loaded |
ValueError
|
If image format or output_format is not supported |
Source code in omnidocs/tasks/text_extraction/qwen/extractor.py
VLMTextExtractor
¶
Bases: BaseTextExtractor
Provider-agnostic VLM text extractor using litellm.
Works with any cloud VLM API: Gemini, OpenRouter, Azure, OpenAI, Anthropic, etc. Supports custom prompts for specialized extraction.
Example
from omnidocs.vlm import VLMAPIConfig
from omnidocs.tasks.text_extraction import VLMTextExtractor
# Gemini (reads GOOGLE_API_KEY from env)
config = VLMAPIConfig(model="gemini/gemini-2.5-flash")
extractor = VLMTextExtractor(config=config)
# Default extraction
result = extractor.extract("document.png", output_format="markdown")
# Custom prompt
result = extractor.extract(
"document.png",
prompt="Extract only the table data as markdown",
)
Initialize VLM text extractor.
| PARAMETER | DESCRIPTION |
|---|---|
config
|
VLM API configuration with model and provider details.
TYPE:
|
Source code in omnidocs/tasks/text_extraction/vlm.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
prompt: Optional[str] = None,
) -> TextOutput
Extract text from an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path).
TYPE:
|
output_format
|
Desired output format ("html" or "markdown").
TYPE:
|
prompt
|
Custom prompt. If None, uses a task-specific default prompt.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput containing extracted text content. |
Source code in omnidocs/tasks/text_extraction/vlm.py
base
¶
Base class for text extractors.
Defines the abstract interface that all text extractors must implement.
BaseTextExtractor
¶
Bases: ABC
Abstract base class for text extractors.
All text extraction models must inherit from this class and implement the required methods.
Example
extract
abstractmethod
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text from an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file
TYPE:
|
output_format
|
Desired output format: - "html": Structured HTML - "markdown": Markdown format
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput containing extracted text content |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If image format or output_format is not supported |
RuntimeError
|
If model is not loaded or inference fails |
Source code in omnidocs/tasks/text_extraction/base.py
batch_extract
¶
batch_extract(
images: List[Union[Image, ndarray, str, Path]],
output_format: Literal["html", "markdown"] = "markdown",
progress_callback: Optional[
Callable[[int, int], None]
] = None,
) -> List[TextOutput]
Extract text from multiple images.
Default implementation loops over extract(). Subclasses can override for optimized batching (e.g., VLLM).
| PARAMETER | DESCRIPTION |
|---|---|
images
|
List of images in any supported format
TYPE:
|
output_format
|
Desired output format
TYPE:
|
progress_callback
|
Optional function(current, total) for progress
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[TextOutput]
|
List of TextOutput in same order as input |
Examples:
images = [doc.get_page(i) for i in range(doc.page_count)]
results = extractor.batch_extract(images, output_format="markdown")
Source code in omnidocs/tasks/text_extraction/base.py
extract_document
¶
extract_document(
document: Document,
output_format: Literal["html", "markdown"] = "markdown",
progress_callback: Optional[
Callable[[int, int], None]
] = None,
) -> List[TextOutput]
Extract text from all pages of a document.
| PARAMETER | DESCRIPTION |
|---|---|
document
|
Document instance
TYPE:
|
output_format
|
Desired output format
TYPE:
|
progress_callback
|
Optional function(current, total) for progress
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[TextOutput]
|
List of TextOutput, one per page |
Examples:
doc = Document.from_pdf("paper.pdf")
results = extractor.extract_document(doc, output_format="markdown")
Source code in omnidocs/tasks/text_extraction/base.py
dotsocr
¶
Dots OCR text extractor and backend configurations.
Available backends: - PyTorch: DotsOCRPyTorchConfig (local GPU inference) - VLLM: DotsOCRVLLMConfig (offline batch inference) - API: DotsOCRAPIConfig (online VLLM server via OpenAI-compatible API)
DotsOCRAPIConfig
¶
Bases: BaseModel
API backend configuration for Dots OCR.
This config is for accessing a deployed VLLM server via OpenAI-compatible API. Typically used with modal_dotsocr_vllm_online.py deployment.
Example
DotsOCRTextExtractor
¶
Bases: BaseTextExtractor
Dots OCR Vision-Language Model text extractor with layout detection.
Extracts text from document images with layout information including: - 11 layout categories (Caption, Footnote, Formula, List-item, etc.) - Bounding boxes (normalized to 0-1024) - Multi-format text (Markdown, LaTeX, HTML) - Reading order preservation
Supports PyTorch, VLLM, and API backends.
Example
from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRPyTorchConfig
# Initialize with PyTorch backend
extractor = DotsOCRTextExtractor(
backend=DotsOCRPyTorchConfig(model="rednote-hilab/dots.ocr")
)
# Extract with layout
result = extractor.extract(image, include_layout=True)
print(f"Found {result.num_layout_elements} elements")
print(result.content)
Initialize Dots OCR text extractor.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration. One of: - DotsOCRPyTorchConfig: PyTorch/HuggingFace backend - DotsOCRVLLMConfig: VLLM high-throughput backend - DotsOCRAPIConfig: API backend (online VLLM server)
TYPE:
|
Source code in omnidocs/tasks/text_extraction/dotsocr/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal[
"markdown", "html", "json"
] = "markdown",
include_layout: bool = False,
custom_prompt: Optional[str] = None,
max_tokens: int = 8192,
) -> DotsOCRTextOutput
Extract text from image using Dots OCR.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path)
TYPE:
|
output_format
|
Output format ("markdown", "html", or "json")
TYPE:
|
include_layout
|
Include layout bounding boxes in output
TYPE:
|
custom_prompt
|
Override default extraction prompt
TYPE:
|
max_tokens
|
Maximum tokens for generation
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DotsOCRTextOutput
|
DotsOCRTextOutput with extracted content and optional layout |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If model is not loaded or inference fails |
Source code in omnidocs/tasks/text_extraction/dotsocr/extractor.py
DotsOCRPyTorchConfig
¶
Bases: BaseModel
PyTorch/HuggingFace backend configuration for Dots OCR.
Dots OCR provides layout-aware text extraction with 11 predefined layout categories (Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Section-header, Table, Text, Title).
Example
DotsOCRVLLMConfig
¶
Bases: BaseModel
VLLM backend configuration for Dots OCR.
VLLM provides high-throughput inference with optimizations like: - PagedAttention for efficient KV cache management - Continuous batching for higher throughput - Optimized CUDA kernels
Example
api
¶
API backend configuration for Dots OCR (VLLM online server).
DotsOCRAPIConfig
¶
Bases: BaseModel
API backend configuration for Dots OCR.
This config is for accessing a deployed VLLM server via OpenAI-compatible API. Typically used with modal_dotsocr_vllm_online.py deployment.
Example
extractor
¶
Dots OCR text extractor with layout-aware extraction.
A Vision-Language Model optimized for document OCR with structured output containing layout information, bounding boxes, and multi-format text.
Supports PyTorch, VLLM, and API backends.
Example
from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRPyTorchConfig
extractor = DotsOCRTextExtractor(
backend=DotsOCRPyTorchConfig(model="rednote-hilab/dots.ocr")
)
result = extractor.extract(image, include_layout=True)
print(result.content)
for elem in result.layout:
print(f"{elem.category}: {elem.bbox}")
DotsOCRTextExtractor
¶
Bases: BaseTextExtractor
Dots OCR Vision-Language Model text extractor with layout detection.
Extracts text from document images with layout information including: - 11 layout categories (Caption, Footnote, Formula, List-item, etc.) - Bounding boxes (normalized to 0-1024) - Multi-format text (Markdown, LaTeX, HTML) - Reading order preservation
Supports PyTorch, VLLM, and API backends.
Example
from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRPyTorchConfig
# Initialize with PyTorch backend
extractor = DotsOCRTextExtractor(
backend=DotsOCRPyTorchConfig(model="rednote-hilab/dots.ocr")
)
# Extract with layout
result = extractor.extract(image, include_layout=True)
print(f"Found {result.num_layout_elements} elements")
print(result.content)
Initialize Dots OCR text extractor.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration. One of: - DotsOCRPyTorchConfig: PyTorch/HuggingFace backend - DotsOCRVLLMConfig: VLLM high-throughput backend - DotsOCRAPIConfig: API backend (online VLLM server)
TYPE:
|
Source code in omnidocs/tasks/text_extraction/dotsocr/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal[
"markdown", "html", "json"
] = "markdown",
include_layout: bool = False,
custom_prompt: Optional[str] = None,
max_tokens: int = 8192,
) -> DotsOCRTextOutput
Extract text from image using Dots OCR.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path)
TYPE:
|
output_format
|
Output format ("markdown", "html", or "json")
TYPE:
|
include_layout
|
Include layout bounding boxes in output
TYPE:
|
custom_prompt
|
Override default extraction prompt
TYPE:
|
max_tokens
|
Maximum tokens for generation
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DotsOCRTextOutput
|
DotsOCRTextOutput with extracted content and optional layout |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If model is not loaded or inference fails |
Source code in omnidocs/tasks/text_extraction/dotsocr/extractor.py
pytorch
¶
PyTorch backend configuration for Dots OCR.
DotsOCRPyTorchConfig
¶
Bases: BaseModel
PyTorch/HuggingFace backend configuration for Dots OCR.
Dots OCR provides layout-aware text extraction with 11 predefined layout categories (Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Section-header, Table, Text, Title).
Example
vllm
¶
VLLM backend configuration for Dots OCR.
DotsOCRVLLMConfig
¶
Bases: BaseModel
VLLM backend configuration for Dots OCR.
VLLM provides high-throughput inference with optimizations like: - PagedAttention for efficient KV cache management - Continuous batching for higher throughput - Optimized CUDA kernels
Example
granitedocling
¶
Granite Docling text extraction with multi-backend support.
GraniteDoclingTextAPIConfig
¶
Bases: BaseModel
Configuration for Granite Docling text extraction via API.
Uses litellm for provider-agnostic API access. Supports OpenRouter, Gemini, Azure, OpenAI, and any other litellm-compatible provider.
API keys can be passed directly or read from environment variables.
Example
GraniteDoclingTextExtractor
¶
Bases: BaseTextExtractor
Granite Docling text extractor supporting PyTorch, VLLM, MLX, and API backends.
Granite Docling is IBM's compact vision-language model optimized for document conversion. It outputs DocTags format which is converted to Markdown using the docling_core library.
Example
from omnidocs.tasks.text_extraction.granitedocling import ( ... GraniteDoclingTextExtractor, ... GraniteDoclingTextPyTorchConfig, ... ) config = GraniteDoclingTextPyTorchConfig(device="cuda") extractor = GraniteDoclingTextExtractor(backend=config) result = extractor.extract(image, output_format="markdown") print(result.content)
Initialize Granite Docling extractor with backend configuration.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration (PyTorch, VLLM, MLX, or API config)
TYPE:
|
Source code in omnidocs/tasks/text_extraction/granitedocling/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text from an image using Granite Docling.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path)
TYPE:
|
output_format
|
Output format ("markdown" or "html")
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput with extracted content |
Source code in omnidocs/tasks/text_extraction/granitedocling/extractor.py
GraniteDoclingTextMLXConfig
¶
Bases: BaseModel
Configuration for Granite Docling text extraction with MLX backend.
This backend is optimized for Apple Silicon Macs (M1/M2/M3/M4). Uses the MLX-optimized model variant.
GraniteDoclingTextPyTorchConfig
¶
Bases: BaseModel
Configuration for Granite Docling text extraction with PyTorch backend.
GraniteDoclingTextVLLMConfig
¶
Bases: BaseModel
Configuration for Granite Docling text extraction with VLLM backend.
IMPORTANT: This config uses revision="untied" by default, which is required for VLLM compatibility with Granite Docling's tied weights.
api
¶
API backend configuration for Granite Docling text extraction.
Uses litellm for provider-agnostic inference (OpenRouter, Gemini, Azure, etc.).
GraniteDoclingTextAPIConfig
¶
Bases: BaseModel
Configuration for Granite Docling text extraction via API.
Uses litellm for provider-agnostic API access. Supports OpenRouter, Gemini, Azure, OpenAI, and any other litellm-compatible provider.
API keys can be passed directly or read from environment variables.
Example
extractor
¶
Granite Docling text extractor with multi-backend support.
GraniteDoclingTextExtractor
¶
Bases: BaseTextExtractor
Granite Docling text extractor supporting PyTorch, VLLM, MLX, and API backends.
Granite Docling is IBM's compact vision-language model optimized for document conversion. It outputs DocTags format which is converted to Markdown using the docling_core library.
Example
from omnidocs.tasks.text_extraction.granitedocling import ( ... GraniteDoclingTextExtractor, ... GraniteDoclingTextPyTorchConfig, ... ) config = GraniteDoclingTextPyTorchConfig(device="cuda") extractor = GraniteDoclingTextExtractor(backend=config) result = extractor.extract(image, output_format="markdown") print(result.content)
Initialize Granite Docling extractor with backend configuration.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration (PyTorch, VLLM, MLX, or API config)
TYPE:
|
Source code in omnidocs/tasks/text_extraction/granitedocling/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text from an image using Granite Docling.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path)
TYPE:
|
output_format
|
Output format ("markdown" or "html")
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput with extracted content |
Source code in omnidocs/tasks/text_extraction/granitedocling/extractor.py
mlx
¶
MLX backend configuration for Granite Docling text extraction (Apple Silicon).
GraniteDoclingTextMLXConfig
¶
Bases: BaseModel
Configuration for Granite Docling text extraction with MLX backend.
This backend is optimized for Apple Silicon Macs (M1/M2/M3/M4). Uses the MLX-optimized model variant.
pytorch
¶
PyTorch backend configuration for Granite Docling text extraction.
GraniteDoclingTextPyTorchConfig
¶
Bases: BaseModel
Configuration for Granite Docling text extraction with PyTorch backend.
vllm
¶
VLLM backend configuration for Granite Docling text extraction.
GraniteDoclingTextVLLMConfig
¶
Bases: BaseModel
Configuration for Granite Docling text extraction with VLLM backend.
IMPORTANT: This config uses revision="untied" by default, which is required for VLLM compatibility with Granite Docling's tied weights.
mineruvl
¶
MinerU VL text extraction module.
MinerU VL is a vision-language model for document layout detection and text/table/equation recognition. It performs two-step extraction: 1. Layout Detection: Detect regions with types (text, table, equation, etc.) 2. Content Recognition: Extract content from each detected region
Example
from omnidocs.tasks.text_extraction import MinerUVLTextExtractor
from omnidocs.tasks.text_extraction.mineruvl import MinerUVLTextPyTorchConfig
# Initialize with PyTorch backend
extractor = MinerUVLTextExtractor(
backend=MinerUVLTextPyTorchConfig(device="cuda")
)
# Extract text
result = extractor.extract(image)
print(result.content)
# Extract with detailed blocks
result, blocks = extractor.extract_with_blocks(image)
for block in blocks:
print(f"{block.type}: {block.content[:50]}...")
MinerUVLTextAPIConfig
¶
Bases: BaseModel
API backend config for MinerU VL text extraction.
Connects to a deployed VLLM server with OpenAI-compatible API.
Example
MinerUVLTextExtractor
¶
Bases: BaseTextExtractor
MinerU VL text extractor with layout-aware extraction.
Performs two-step extraction: 1. Layout detection (detect regions) 2. Content recognition (extract text/table/equation from each region)
Supports multiple backends: - PyTorch (HuggingFace Transformers) - VLLM (high-throughput GPU) - MLX (Apple Silicon) - API (VLLM OpenAI-compatible server)
Example
from omnidocs.tasks.text_extraction import MinerUVLTextExtractor
from omnidocs.tasks.text_extraction.mineruvl import MinerUVLTextPyTorchConfig
extractor = MinerUVLTextExtractor(
backend=MinerUVLTextPyTorchConfig(device="cuda")
)
result = extractor.extract(image)
print(result.content) # Combined text + tables + equations
print(result.blocks) # List of ContentBlock objects
Initialize MinerU VL text extractor.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration (PyTorch, VLLM, MLX, or API)
TYPE:
|
Source code in omnidocs/tasks/text_extraction/mineruvl/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text with layout-aware two-step extraction.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path)
TYPE:
|
output_format
|
Output format ('html' or 'markdown')
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput with extracted content and metadata |
Source code in omnidocs/tasks/text_extraction/mineruvl/extractor.py
extract_with_blocks
¶
extract_with_blocks(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> tuple[TextOutput, List[ContentBlock]]
Extract text and return both TextOutput and ContentBlocks.
This method provides access to the detailed block information including bounding boxes and block types.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image
TYPE:
|
output_format
|
Output format
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple[TextOutput, List[ContentBlock]]
|
Tuple of (TextOutput, List[ContentBlock]) |
Source code in omnidocs/tasks/text_extraction/mineruvl/extractor.py
MinerUVLTextMLXConfig
¶
Bases: BaseModel
MLX backend config for MinerU VL text extraction on Apple Silicon.
Uses MLX-VLM for efficient inference on M1/M2/M3/M4 chips.
Example
MinerUVLTextPyTorchConfig
¶
Bases: BaseModel
PyTorch/HuggingFace backend config for MinerU VL text extraction.
Uses HuggingFace Transformers with Qwen2VLForConditionalGeneration.
Example
BlockType
¶
Bases: str, Enum
MinerU VL block types (22 categories).
ContentBlock
¶
Bases: BaseModel
A detected content block with type, bounding box, angle, and content.
Coordinates are normalized to [0, 1] range relative to image dimensions.
to_absolute
¶
Convert normalized bbox to absolute pixel coordinates.
Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
MinerUSamplingParams
¶
MinerUSamplingParams(
temperature: Optional[float] = 0.0,
top_p: Optional[float] = 0.01,
top_k: Optional[int] = 1,
presence_penalty: Optional[float] = 0.0,
frequency_penalty: Optional[float] = 0.0,
repetition_penalty: Optional[float] = 1.0,
no_repeat_ngram_size: Optional[int] = 100,
max_new_tokens: Optional[int] = None,
)
Bases: SamplingParams
Default sampling parameters optimized for MinerU VL.
Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
SamplingParams
dataclass
¶
SamplingParams(
temperature: Optional[float] = None,
top_p: Optional[float] = None,
top_k: Optional[int] = None,
presence_penalty: Optional[float] = None,
frequency_penalty: Optional[float] = None,
repetition_penalty: Optional[float] = None,
no_repeat_ngram_size: Optional[int] = None,
max_new_tokens: Optional[int] = None,
)
Sampling parameters for text generation.
MinerUVLTextVLLMConfig
¶
Bases: BaseModel
VLLM backend config for MinerU VL text extraction.
Uses VLLM for high-throughput GPU inference with: - PagedAttention for efficient KV cache - Continuous batching - Optimized CUDA kernels
Example
convert_otsl_to_html
¶
Convert OTSL table format to HTML.
Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 | |
parse_layout_output
¶
Parse layout detection model output into ContentBlocks.
Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
api
¶
API backend configuration for MinerU VL text extraction.
MinerUVLTextAPIConfig
¶
Bases: BaseModel
API backend config for MinerU VL text extraction.
Connects to a deployed VLLM server with OpenAI-compatible API.
Example
extractor
¶
MinerU VL text extractor with layout-aware two-step extraction.
MinerU VL performs document extraction in two steps: 1. Layout Detection: Detect regions with types (text, table, equation, etc.) 2. Content Recognition: Extract text/table/equation content from each region
MinerUVLTextExtractor
¶
Bases: BaseTextExtractor
MinerU VL text extractor with layout-aware extraction.
Performs two-step extraction: 1. Layout detection (detect regions) 2. Content recognition (extract text/table/equation from each region)
Supports multiple backends: - PyTorch (HuggingFace Transformers) - VLLM (high-throughput GPU) - MLX (Apple Silicon) - API (VLLM OpenAI-compatible server)
Example
from omnidocs.tasks.text_extraction import MinerUVLTextExtractor
from omnidocs.tasks.text_extraction.mineruvl import MinerUVLTextPyTorchConfig
extractor = MinerUVLTextExtractor(
backend=MinerUVLTextPyTorchConfig(device="cuda")
)
result = extractor.extract(image)
print(result.content) # Combined text + tables + equations
print(result.blocks) # List of ContentBlock objects
Initialize MinerU VL text extractor.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration (PyTorch, VLLM, MLX, or API)
TYPE:
|
Source code in omnidocs/tasks/text_extraction/mineruvl/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text with layout-aware two-step extraction.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path)
TYPE:
|
output_format
|
Output format ('html' or 'markdown')
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput with extracted content and metadata |
Source code in omnidocs/tasks/text_extraction/mineruvl/extractor.py
extract_with_blocks
¶
extract_with_blocks(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> tuple[TextOutput, List[ContentBlock]]
Extract text and return both TextOutput and ContentBlocks.
This method provides access to the detailed block information including bounding boxes and block types.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image
TYPE:
|
output_format
|
Output format
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple[TextOutput, List[ContentBlock]]
|
Tuple of (TextOutput, List[ContentBlock]) |
Source code in omnidocs/tasks/text_extraction/mineruvl/extractor.py
mlx
¶
MLX backend configuration for MinerU VL text extraction (Apple Silicon).
MinerUVLTextMLXConfig
¶
Bases: BaseModel
MLX backend config for MinerU VL text extraction on Apple Silicon.
Uses MLX-VLM for efficient inference on M1/M2/M3/M4 chips.
Example
pytorch
¶
PyTorch/HuggingFace backend configuration for MinerU VL text extraction.
MinerUVLTextPyTorchConfig
¶
Bases: BaseModel
PyTorch/HuggingFace backend config for MinerU VL text extraction.
Uses HuggingFace Transformers with Qwen2VLForConditionalGeneration.
Example
utils
¶
MinerU VL utilities for document extraction.
Contains data structures, parsing, prompts, and post-processing functions for MinerU VL document extraction pipeline.
This file contains code adapted from mineru-vl-utils
https://github.com/opendatalab/mineru-vl-utils https://pypi.org/project/mineru-vl-utils/
The original mineru-vl-utils is licensed under AGPL-3.0: Copyright (c) OpenDataLab https://github.com/opendatalab/mineru-vl-utils/blob/main/LICENSE.md
Adapted components
- BlockType enum (from structs.py)
- ContentBlock data structure (from structs.py)
- OTSL to HTML table conversion (from post_process/otsl2html.py)
BlockType
¶
Bases: str, Enum
MinerU VL block types (22 categories).
ContentBlock
¶
Bases: BaseModel
A detected content block with type, bounding box, angle, and content.
Coordinates are normalized to [0, 1] range relative to image dimensions.
to_absolute
¶
Convert normalized bbox to absolute pixel coordinates.
Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
SamplingParams
dataclass
¶
SamplingParams(
temperature: Optional[float] = None,
top_p: Optional[float] = None,
top_k: Optional[int] = None,
presence_penalty: Optional[float] = None,
frequency_penalty: Optional[float] = None,
repetition_penalty: Optional[float] = None,
no_repeat_ngram_size: Optional[int] = None,
max_new_tokens: Optional[int] = None,
)
Sampling parameters for text generation.
MinerUSamplingParams
¶
MinerUSamplingParams(
temperature: Optional[float] = 0.0,
top_p: Optional[float] = 0.01,
top_k: Optional[int] = 1,
presence_penalty: Optional[float] = 0.0,
frequency_penalty: Optional[float] = 0.0,
repetition_penalty: Optional[float] = 1.0,
no_repeat_ngram_size: Optional[int] = 100,
max_new_tokens: Optional[int] = None,
)
Bases: SamplingParams
Default sampling parameters optimized for MinerU VL.
Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
convert_bbox
¶
Convert bbox from model output (0-1000) to normalized format (0-1).
Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
parse_angle
¶
Parse rotation angle from model output tail string.
parse_layout_output
¶
Parse layout detection model output into ContentBlocks.
Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
get_rgb_image
¶
Convert image to RGB mode.
prepare_for_layout
¶
prepare_for_layout(
image: Image,
layout_size: Tuple[int, int] = LAYOUT_IMAGE_SIZE,
) -> Image.Image
Prepare image for layout detection.
Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
resize_by_need
¶
Resize image if needed based on aspect ratio constraints.
Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
prepare_for_extract
¶
prepare_for_extract(
image: Image,
blocks: List[ContentBlock],
prompts: Dict[str, str] = None,
sampling_params: Dict[str, SamplingParams] = None,
skip_types: set = None,
) -> Tuple[
List[Image.Image],
List[str],
List[SamplingParams],
List[int],
]
Prepare cropped images for content extraction.
Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
convert_otsl_to_html
¶
Convert OTSL table format to HTML.
Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 | |
simple_post_process
¶
Simple post-processing: convert OTSL tables to HTML.
Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
vllm
¶
VLLM backend configuration for MinerU VL text extraction.
MinerUVLTextVLLMConfig
¶
Bases: BaseModel
VLLM backend config for MinerU VL text extraction.
Uses VLLM for high-throughput GPU inference with: - PagedAttention for efficient KV cache - Continuous batching - Optimized CUDA kernels
Example
models
¶
Pydantic models for text extraction outputs.
Defines output types and format enums for text extraction.
OutputFormat
¶
Bases: str, Enum
Supported text extraction output formats.
Each format has different characteristics
- HTML: Structured with div elements, preserves layout semantics
- MARKDOWN: Portable, human-readable, good for documentation
- JSON: Structured data with layout information (Dots OCR)
TextOutput
¶
Bases: BaseModel
Text extraction output from a document image.
Contains the extracted text content in the requested format, along with optional raw output and plain text versions.
Example
LayoutElement
¶
Bases: BaseModel
Single layout element from document layout detection.
Represents a detected region in the document with its bounding box, category label, and extracted text content.
| ATTRIBUTE | DESCRIPTION |
|---|---|
bbox |
Bounding box coordinates [x1, y1, x2, y2] (normalized to 0-1024)
TYPE:
|
category |
Layout category (e.g., "Text", "Title", "Table", "Formula")
TYPE:
|
text |
Extracted text content (None for pictures)
TYPE:
|
confidence |
Detection confidence score (optional)
TYPE:
|
DotsOCRTextOutput
¶
Bases: BaseModel
Text extraction output from Dots OCR with layout information.
Dots OCR provides structured output with: - Layout detection (11 categories) - Bounding boxes (normalized to 0-1024) - Multi-format text (Markdown/LaTeX/HTML) - Reading order preservation
Layout Categories
Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Section-header, Table, Text, Title
Text Formatting
- Text/Title/Section-header: Markdown
- Formula: LaTeX
- Table: HTML
- Picture: (text omitted)
Example
nanonets
¶
Nanonets OCR2-3B backend configurations and extractor for text extraction.
Available backends
- NanonetsTextPyTorchConfig: PyTorch/HuggingFace backend
- NanonetsTextVLLMConfig: VLLM high-throughput backend
- NanonetsTextMLXConfig: MLX backend for Apple Silicon
Example
NanonetsTextExtractor
¶
Bases: BaseTextExtractor
Nanonets OCR2-3B Vision-Language Model text extractor.
Extracts text from document images with support for:
- Tables (output as HTML)
- Equations (output as LaTeX)
- Image captions (wrapped in tags)
- Watermarks (wrapped in
Supports PyTorch, VLLM, and MLX backends.
Example
from omnidocs.tasks.text_extraction import NanonetsTextExtractor
from omnidocs.tasks.text_extraction.nanonets import NanonetsTextPyTorchConfig
# Initialize with PyTorch backend
extractor = NanonetsTextExtractor(
backend=NanonetsTextPyTorchConfig()
)
# Extract text
result = extractor.extract(image)
print(result.content)
Initialize Nanonets text extractor.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration. One of: - NanonetsTextPyTorchConfig: PyTorch/HuggingFace backend - NanonetsTextVLLMConfig: VLLM high-throughput backend - NanonetsTextMLXConfig: MLX backend for Apple Silicon
TYPE:
|
Source code in omnidocs/tasks/text_extraction/nanonets/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text from an image.
Note: Nanonets OCR2 produces a unified output format that includes tables as HTML and equations as LaTeX inline. The output_format parameter is accepted for API compatibility but does not change the output structure.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file
TYPE:
|
output_format
|
Accepted for API compatibility (default: "markdown")
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput containing extracted text content |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If model is not loaded |
ValueError
|
If image format is not supported |
Source code in omnidocs/tasks/text_extraction/nanonets/extractor.py
NanonetsTextMLXConfig
¶
Bases: BaseModel
MLX backend configuration for Nanonets OCR2-3B text extraction.
This backend uses MLX for Apple Silicon native inference. Best for local development and testing on macOS M1/M2/M3/M4+. Requires: mlx, mlx-vlm
Note: This backend only works on Apple Silicon Macs. Do NOT use for Modal/cloud deployments.
NanonetsTextPyTorchConfig
¶
Bases: BaseModel
PyTorch/HuggingFace backend configuration for Nanonets OCR2-3B text extraction.
This backend uses the transformers library with PyTorch for local GPU inference. Requires: torch, transformers, accelerate
NanonetsTextVLLMConfig
¶
Bases: BaseModel
VLLM backend configuration for Nanonets OCR2-3B text extraction.
This backend uses VLLM for high-throughput inference. Best for batch processing and production deployments. Requires: vllm, torch, transformers, qwen-vl-utils
extractor
¶
Nanonets OCR2-3B text extractor.
A Vision-Language Model for extracting text from document images with support for tables (HTML), equations (LaTeX), and image captions.
Supports PyTorch and VLLM backends.
Example
NanonetsTextExtractor
¶
Bases: BaseTextExtractor
Nanonets OCR2-3B Vision-Language Model text extractor.
Extracts text from document images with support for:
- Tables (output as HTML)
- Equations (output as LaTeX)
- Image captions (wrapped in tags)
- Watermarks (wrapped in
Supports PyTorch, VLLM, and MLX backends.
Example
from omnidocs.tasks.text_extraction import NanonetsTextExtractor
from omnidocs.tasks.text_extraction.nanonets import NanonetsTextPyTorchConfig
# Initialize with PyTorch backend
extractor = NanonetsTextExtractor(
backend=NanonetsTextPyTorchConfig()
)
# Extract text
result = extractor.extract(image)
print(result.content)
Initialize Nanonets text extractor.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration. One of: - NanonetsTextPyTorchConfig: PyTorch/HuggingFace backend - NanonetsTextVLLMConfig: VLLM high-throughput backend - NanonetsTextMLXConfig: MLX backend for Apple Silicon
TYPE:
|
Source code in omnidocs/tasks/text_extraction/nanonets/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text from an image.
Note: Nanonets OCR2 produces a unified output format that includes tables as HTML and equations as LaTeX inline. The output_format parameter is accepted for API compatibility but does not change the output structure.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file
TYPE:
|
output_format
|
Accepted for API compatibility (default: "markdown")
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput containing extracted text content |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If model is not loaded |
ValueError
|
If image format is not supported |
Source code in omnidocs/tasks/text_extraction/nanonets/extractor.py
mlx
¶
MLX backend configuration for Nanonets OCR2-3B text extraction.
NanonetsTextMLXConfig
¶
Bases: BaseModel
MLX backend configuration for Nanonets OCR2-3B text extraction.
This backend uses MLX for Apple Silicon native inference. Best for local development and testing on macOS M1/M2/M3/M4+. Requires: mlx, mlx-vlm
Note: This backend only works on Apple Silicon Macs. Do NOT use for Modal/cloud deployments.
pytorch
¶
PyTorch/HuggingFace backend configuration for Nanonets OCR2-3B text extraction.
NanonetsTextPyTorchConfig
¶
Bases: BaseModel
PyTorch/HuggingFace backend configuration for Nanonets OCR2-3B text extraction.
This backend uses the transformers library with PyTorch for local GPU inference. Requires: torch, transformers, accelerate
vllm
¶
VLLM backend configuration for Nanonets OCR2-3B text extraction.
NanonetsTextVLLMConfig
¶
Bases: BaseModel
VLLM backend configuration for Nanonets OCR2-3B text extraction.
This backend uses VLLM for high-throughput inference. Best for batch processing and production deployments. Requires: vllm, torch, transformers, qwen-vl-utils
qwen
¶
Qwen3-VL backend configurations and extractor for text extraction.
Available backends
- QwenTextPyTorchConfig: PyTorch/HuggingFace backend
- QwenTextVLLMConfig: VLLM high-throughput backend
- QwenTextMLXConfig: MLX backend for Apple Silicon
- QwenTextAPIConfig: API backend (OpenRouter, etc.)
Example
QwenTextAPIConfig
¶
Bases: BaseModel
API backend configuration for Qwen text extraction.
Uses litellm for provider-agnostic API access. Supports OpenRouter, Gemini, Azure, OpenAI, and any other litellm-compatible provider.
API keys can be passed directly or read from environment variables.
Example
# OpenRouter (reads OPENROUTER_API_KEY from env)
config = QwenTextAPIConfig(
model="openrouter/qwen/qwen3-vl-8b-instruct",
)
# With explicit key
config = QwenTextAPIConfig(
model="openrouter/qwen/qwen3-vl-8b-instruct",
api_key=os.environ["OPENROUTER_API_KEY"],
api_base="https://openrouter.ai/api/v1",
)
QwenTextExtractor
¶
Bases: BaseTextExtractor
Qwen3-VL Vision-Language Model text extractor.
Extracts text from document images and outputs as structured HTML or Markdown. Uses Qwen3-VL's built-in document parsing prompts.
Supports PyTorch, VLLM, MLX, and API backends.
Example
from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig
# Initialize with PyTorch backend
extractor = QwenTextExtractor(
backend=QwenTextPyTorchConfig(model="Qwen/Qwen3-VL-8B-Instruct")
)
# Extract as Markdown
result = extractor.extract(image, output_format="markdown")
print(result.content)
# Extract as HTML
result = extractor.extract(image, output_format="html")
print(result.content)
Initialize Qwen text extractor.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration. One of: - QwenTextPyTorchConfig: PyTorch/HuggingFace backend - QwenTextVLLMConfig: VLLM high-throughput backend - QwenTextMLXConfig: MLX backend for Apple Silicon - QwenTextAPIConfig: API backend (OpenRouter, etc.)
TYPE:
|
Source code in omnidocs/tasks/text_extraction/qwen/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text from an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file
TYPE:
|
output_format
|
Desired output format: - "html": Structured HTML with div elements - "markdown": Markdown format
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput containing extracted text content |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If model is not loaded |
ValueError
|
If image format or output_format is not supported |
Source code in omnidocs/tasks/text_extraction/qwen/extractor.py
QwenTextMLXConfig
¶
Bases: BaseModel
MLX backend configuration for Qwen text extraction.
This backend uses MLX for Apple Silicon native inference. Best for local development and testing on macOS M1/M2/M3+. Requires: mlx, mlx-vlm
Note: This backend only works on Apple Silicon Macs. Do NOT use for Modal/cloud deployments.
QwenTextPyTorchConfig
¶
Bases: BaseModel
PyTorch/HuggingFace backend configuration for Qwen text extraction.
This backend uses the transformers library with PyTorch for local GPU inference. Requires: torch, transformers, accelerate, qwen-vl-utils
Example
QwenTextVLLMConfig
¶
Bases: BaseModel
VLLM backend configuration for Qwen text extraction.
This backend uses VLLM for high-throughput inference. Best for batch processing and production deployments. Requires: vllm, torch, transformers, qwen-vl-utils
Example
api
¶
API backend configuration for Qwen3-VL text extraction.
Uses litellm for provider-agnostic inference (OpenRouter, Gemini, Azure, etc.).
QwenTextAPIConfig
¶
Bases: BaseModel
API backend configuration for Qwen text extraction.
Uses litellm for provider-agnostic API access. Supports OpenRouter, Gemini, Azure, OpenAI, and any other litellm-compatible provider.
API keys can be passed directly or read from environment variables.
Example
# OpenRouter (reads OPENROUTER_API_KEY from env)
config = QwenTextAPIConfig(
model="openrouter/qwen/qwen3-vl-8b-instruct",
)
# With explicit key
config = QwenTextAPIConfig(
model="openrouter/qwen/qwen3-vl-8b-instruct",
api_key=os.environ["OPENROUTER_API_KEY"],
api_base="https://openrouter.ai/api/v1",
)
extractor
¶
Qwen3-VL text extractor.
A Vision-Language Model for extracting text from document images as structured HTML or Markdown.
Supports PyTorch, VLLM, MLX, and API backends.
Example
from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig
extractor = QwenTextExtractor(
backend=QwenTextPyTorchConfig(model="Qwen/Qwen3-VL-8B-Instruct")
)
result = extractor.extract(image, output_format="markdown")
print(result.content)
QwenTextExtractor
¶
Bases: BaseTextExtractor
Qwen3-VL Vision-Language Model text extractor.
Extracts text from document images and outputs as structured HTML or Markdown. Uses Qwen3-VL's built-in document parsing prompts.
Supports PyTorch, VLLM, MLX, and API backends.
Example
from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig
# Initialize with PyTorch backend
extractor = QwenTextExtractor(
backend=QwenTextPyTorchConfig(model="Qwen/Qwen3-VL-8B-Instruct")
)
# Extract as Markdown
result = extractor.extract(image, output_format="markdown")
print(result.content)
# Extract as HTML
result = extractor.extract(image, output_format="html")
print(result.content)
Initialize Qwen text extractor.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration. One of: - QwenTextPyTorchConfig: PyTorch/HuggingFace backend - QwenTextVLLMConfig: VLLM high-throughput backend - QwenTextMLXConfig: MLX backend for Apple Silicon - QwenTextAPIConfig: API backend (OpenRouter, etc.)
TYPE:
|
Source code in omnidocs/tasks/text_extraction/qwen/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text from an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file
TYPE:
|
output_format
|
Desired output format: - "html": Structured HTML with div elements - "markdown": Markdown format
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput containing extracted text content |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If model is not loaded |
ValueError
|
If image format or output_format is not supported |
Source code in omnidocs/tasks/text_extraction/qwen/extractor.py
mlx
¶
MLX backend configuration for Qwen3-VL text extraction.
QwenTextMLXConfig
¶
Bases: BaseModel
MLX backend configuration for Qwen text extraction.
This backend uses MLX for Apple Silicon native inference. Best for local development and testing on macOS M1/M2/M3+. Requires: mlx, mlx-vlm
Note: This backend only works on Apple Silicon Macs. Do NOT use for Modal/cloud deployments.
pytorch
¶
PyTorch/HuggingFace backend configuration for Qwen3-VL text extraction.
QwenTextPyTorchConfig
¶
Bases: BaseModel
PyTorch/HuggingFace backend configuration for Qwen text extraction.
This backend uses the transformers library with PyTorch for local GPU inference. Requires: torch, transformers, accelerate, qwen-vl-utils
Example
vllm
¶
VLLM backend configuration for Qwen3-VL text extraction.
QwenTextVLLMConfig
¶
Bases: BaseModel
VLLM backend configuration for Qwen text extraction.
This backend uses VLLM for high-throughput inference. Best for batch processing and production deployments. Requires: vllm, torch, transformers, qwen-vl-utils
Example
vlm
¶
VLM text extractor.
A provider-agnostic Vision-Language Model text extractor using litellm. Works with any cloud API: Gemini, OpenRouter, Azure, OpenAI, Anthropic, etc.
Example
from omnidocs.vlm import VLMAPIConfig
from omnidocs.tasks.text_extraction import VLMTextExtractor
config = VLMAPIConfig(model="gemini/gemini-2.5-flash")
extractor = VLMTextExtractor(config=config)
result = extractor.extract("document.png", output_format="markdown")
print(result.content)
# With custom prompt
result = extractor.extract("document.png", prompt="Extract only table data as markdown")
VLMTextExtractor
¶
Bases: BaseTextExtractor
Provider-agnostic VLM text extractor using litellm.
Works with any cloud VLM API: Gemini, OpenRouter, Azure, OpenAI, Anthropic, etc. Supports custom prompts for specialized extraction.
Example
from omnidocs.vlm import VLMAPIConfig
from omnidocs.tasks.text_extraction import VLMTextExtractor
# Gemini (reads GOOGLE_API_KEY from env)
config = VLMAPIConfig(model="gemini/gemini-2.5-flash")
extractor = VLMTextExtractor(config=config)
# Default extraction
result = extractor.extract("document.png", output_format="markdown")
# Custom prompt
result = extractor.extract(
"document.png",
prompt="Extract only the table data as markdown",
)
Initialize VLM text extractor.
| PARAMETER | DESCRIPTION |
|---|---|
config
|
VLM API configuration with model and provider details.
TYPE:
|
Source code in omnidocs/tasks/text_extraction/vlm.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
prompt: Optional[str] = None,
) -> TextOutput
Extract text from an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path).
TYPE:
|
output_format
|
Desired output format ("html" or "markdown").
TYPE:
|
prompt
|
Custom prompt. If None, uses a task-specific default prompt.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput containing extracted text content. |