Overview¶
Layout Extraction Module.
Provides extractors for detecting document layout elements such as titles, text blocks, figures, tables, formulas, and captions.
Available Extractors
- DocLayoutYOLO: YOLO-based layout detector (fast, accurate)
- RTDETRLayoutExtractor: Transformer-based detector (more categories)
- QwenLayoutDetector: VLM-based detector with custom label support (multi-backend)
- MinerUVLLayoutDetector: MinerU VL 1.2B layout detector (multi-backend)
Example
from omnidocs.tasks.layout_extraction import DocLayoutYOLO, DocLayoutYOLOConfig
extractor = DocLayoutYOLO(config=DocLayoutYOLOConfig(device="cuda"))
result = extractor.extract(image)
for box in result.bboxes:
print(f"{box.label.value}: {box.confidence:.2f}")
# VLM-based detection with custom labels
from omnidocs.tasks.layout_extraction import QwenLayoutDetector, CustomLabel
from omnidocs.tasks.layout_extraction.qwen import QwenLayoutPyTorchConfig
detector = QwenLayoutDetector(
backend=QwenLayoutPyTorchConfig(model="Qwen/Qwen3-VL-8B-Instruct")
)
result = detector.extract(image, custom_labels=["code_block", "sidebar"])
BaseLayoutExtractor
¶
Bases: ABC
Abstract base class for layout extractors.
All layout extraction models must inherit from this class and implement the required methods.
Example
extract
abstractmethod
¶
Run layout extraction on an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
LayoutOutput
|
LayoutOutput containing detected layout boxes with standardized labels |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If image format is not supported |
RuntimeError
|
If model is not loaded or inference fails |
Source code in omnidocs/tasks/layout_extraction/base.py
batch_extract
¶
batch_extract(
images: List[Union[Image, ndarray, str, Path]],
progress_callback: Optional[
Callable[[int, int], None]
] = None,
) -> List[LayoutOutput]
Run layout extraction on multiple images.
Default implementation loops over extract(). Subclasses can override for optimized batching.
| PARAMETER | DESCRIPTION |
|---|---|
images
|
List of images in any supported format
TYPE:
|
progress_callback
|
Optional function(current, total) for progress
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[LayoutOutput]
|
List of LayoutOutput in same order as input |
Examples:
Source code in omnidocs/tasks/layout_extraction/base.py
extract_document
¶
extract_document(
document: Document,
progress_callback: Optional[
Callable[[int, int], None]
] = None,
) -> List[LayoutOutput]
Run layout extraction on all pages of a document.
| PARAMETER | DESCRIPTION |
|---|---|
document
|
Document instance
TYPE:
|
progress_callback
|
Optional function(current, total) for progress
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[LayoutOutput]
|
List of LayoutOutput, one per page |
Examples:
Source code in omnidocs/tasks/layout_extraction/base.py
DocLayoutYOLO
¶
Bases: BaseLayoutExtractor
DocLayout-YOLO layout extractor.
A YOLO-based model optimized for document layout detection. Detects: title, text, figure, table, formula, captions, etc.
This is a single-backend model (PyTorch only).
Example
Initialize DocLayout-YOLO extractor.
| PARAMETER | DESCRIPTION |
|---|---|
config
|
Configuration object with device, model_path, etc.
TYPE:
|
Source code in omnidocs/tasks/layout_extraction/doc_layout_yolo.py
extract
¶
Run layout extraction on an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or path)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
LayoutOutput
|
LayoutOutput with detected layout boxes |
Source code in omnidocs/tasks/layout_extraction/doc_layout_yolo.py
DocLayoutYOLOConfig
¶
MinerUVLLayoutDetector
¶
Bases: BaseLayoutExtractor
MinerU VL layout detector.
Uses MinerU2.5-2509-1.2B for document layout detection. Detects 22+ element types including text, titles, tables, equations, figures, code, and more.
For full document extraction (layout + content), use MinerUVLTextExtractor from the text_extraction module instead.
Example
from omnidocs.tasks.layout_extraction import MinerUVLLayoutDetector
from omnidocs.tasks.layout_extraction.mineruvl import MinerUVLLayoutPyTorchConfig
detector = MinerUVLLayoutDetector(
backend=MinerUVLLayoutPyTorchConfig(device="cuda")
)
result = detector.extract(image)
for box in result.bboxes:
print(f"{box.label}: {box.confidence:.2f}")
Initialize MinerU VL layout detector.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration (PyTorch, VLLM, MLX, or API)
TYPE:
|
Source code in omnidocs/tasks/layout_extraction/mineruvl/detector.py
extract
¶
Detect layout elements in the image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
LayoutOutput
|
LayoutOutput with standardized labels and bounding boxes |
Source code in omnidocs/tasks/layout_extraction/mineruvl/detector.py
BoundingBox
¶
Bases: BaseModel
Bounding box coordinates in pixel space.
Coordinates follow the convention: (x1, y1) is top-left, (x2, y2) is bottom-right.
to_list
¶
to_xyxy
¶
to_xywh
¶
from_list
classmethod
¶
Create from [x1, y1, x2, y2] list.
Source code in omnidocs/tasks/layout_extraction/models.py
to_normalized
¶
Convert to normalized coordinates (0-1024 range).
Scales coordinates from absolute pixel values to a virtual 1024x1024 canvas. This provides consistent coordinates regardless of original image size.
| PARAMETER | DESCRIPTION |
|---|---|
image_width
|
Original image width in pixels
TYPE:
|
image_height
|
Original image height in pixels
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
BoundingBox
|
New BoundingBox with coordinates in 0-1024 range |
Example
Source code in omnidocs/tasks/layout_extraction/models.py
to_absolute
¶
Convert from normalized (0-1024) to absolute pixel coordinates.
| PARAMETER | DESCRIPTION |
|---|---|
image_width
|
Target image width in pixels
TYPE:
|
image_height
|
Target image height in pixels
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
BoundingBox
|
New BoundingBox with absolute pixel coordinates |
Source code in omnidocs/tasks/layout_extraction/models.py
CustomLabel
¶
Bases: BaseModel
Type-safe custom layout label definition for VLM-based models.
VLM models like Qwen3-VL support flexible custom labels beyond the standard LayoutLabel enum. Use this class to define custom labels with validation.
Example
from omnidocs.tasks.layout_extraction import CustomLabel
# Simple custom label
code_block = CustomLabel(name="code_block")
# With metadata
sidebar = CustomLabel(
name="sidebar",
description="Secondary content panel",
color="#9B59B6",
)
# Use with QwenLayoutDetector
result = detector.extract(image, custom_labels=[code_block, sidebar])
LabelMapping
¶
Base class for model-specific label mappings.
Each model maps its native labels to standardized LayoutLabel values.
Initialize label mapping.
| PARAMETER | DESCRIPTION |
|---|---|
mapping
|
Dict mapping model-specific labels to LayoutLabel enum values
TYPE:
|
Source code in omnidocs/tasks/layout_extraction/models.py
LayoutBox
¶
Bases: BaseModel
Single detected layout element with label, bounding box, and confidence.
to_dict
¶
Convert to dictionary representation.
Source code in omnidocs/tasks/layout_extraction/models.py
get_normalized_bbox
¶
Get bounding box in normalized (0-1024) coordinates.
| PARAMETER | DESCRIPTION |
|---|---|
image_width
|
Original image width
TYPE:
|
image_height
|
Original image height
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
BoundingBox
|
BoundingBox with normalized coordinates |
Source code in omnidocs/tasks/layout_extraction/models.py
LayoutLabel
¶
Bases: str, Enum
Standardized layout labels used across all layout extractors.
These provide a consistent vocabulary regardless of which model is used.
LayoutOutput
¶
Bases: BaseModel
Complete layout extraction results for a single image.
filter_by_label
¶
filter_by_confidence
¶
to_dict
¶
Convert to dictionary representation.
Source code in omnidocs/tasks/layout_extraction/models.py
sort_by_position
¶
Return a new LayoutOutput with boxes sorted by position.
| PARAMETER | DESCRIPTION |
|---|---|
top_to_bottom
|
If True, sort by y-coordinate (reading order)
TYPE:
|
Source code in omnidocs/tasks/layout_extraction/models.py
get_normalized_bboxes
¶
Get all bounding boxes in normalized (0-1024) coordinates.
| RETURNS | DESCRIPTION |
|---|---|
List[Dict]
|
List of dicts with normalized bbox coordinates and metadata. |
Example
Source code in omnidocs/tasks/layout_extraction/models.py
visualize
¶
visualize(
image: Image,
output_path: Optional[Union[str, Path]] = None,
show_labels: bool = True,
show_confidence: bool = True,
line_width: int = 3,
font_size: int = 12,
) -> Image.Image
Visualize layout detection results on the image.
Draws bounding boxes with labels and confidence scores on the image. Each layout category has a distinct color for easy identification.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
PIL Image to draw on (will be copied, not modified)
TYPE:
|
output_path
|
Optional path to save the visualization
TYPE:
|
show_labels
|
Whether to show label text
TYPE:
|
show_confidence
|
Whether to show confidence scores
TYPE:
|
line_width
|
Width of bounding box lines
TYPE:
|
font_size
|
Size of label text (note: uses default font)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Image
|
PIL Image with visualizations drawn |
Example
Source code in omnidocs/tasks/layout_extraction/models.py
load_json
classmethod
¶
Load a LayoutOutput instance from a JSON file.
Reads a JSON file and deserializes its contents into a LayoutOutput object. Uses Pydantic's model_validate_json for proper handling of nested objects.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to JSON file containing serialized LayoutOutput data. Can be string or pathlib.Path object.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
LayoutOutput
|
Deserialized layout output instance from file.
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
FileNotFoundError
|
If the specified file does not exist. |
UnicodeDecodeError
|
If file cannot be decoded as UTF-8. |
ValueError
|
If file contents are not valid JSON. |
ValidationError
|
If JSON data doesn't match LayoutOutput schema. |
Example
Found 5 elementsSource code in omnidocs/tasks/layout_extraction/models.py
save_json
¶
Save LayoutOutput instance to a JSON file.
Serializes the LayoutOutput object to JSON and writes it to a file. Automatically creates parent directories if they don't exist. Uses UTF-8 encoding for compatibility and proper handling of special characters.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path where JSON file should be saved. Can be string or pathlib.Path object. Parent directories will be created if they don't exist.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
None
|
None |
| RAISES | DESCRIPTION |
|---|---|
OSError
|
If file cannot be written due to permission or disk errors. |
TypeError
|
If file_path is not a string or Path object. |
Example
Source code in omnidocs/tasks/layout_extraction/models.py
QwenLayoutDetector
¶
Bases: BaseLayoutExtractor
Qwen3-VL Vision-Language Model layout detector.
A flexible VLM-based layout detector that supports custom labels. Unlike fixed-label models (DocLayoutYOLO, RT-DETR), Qwen can detect any document elements specified at runtime.
Supports PyTorch, VLLM, MLX, and API backends.
Example
from omnidocs.tasks.layout_extraction import QwenLayoutDetector, CustomLabel
from omnidocs.tasks.layout_extraction.qwen import QwenLayoutPyTorchConfig
# Initialize with PyTorch backend
detector = QwenLayoutDetector(
backend=QwenLayoutPyTorchConfig(model="Qwen/Qwen3-VL-8B-Instruct")
)
# Basic extraction with default labels
result = detector.extract(image)
# With custom labels (strings)
result = detector.extract(image, custom_labels=["code_block", "sidebar"])
# With typed custom labels
labels = [
CustomLabel(name="code_block", color="#E74C3C"),
CustomLabel(name="sidebar", description="Side panel content"),
]
result = detector.extract(image, custom_labels=labels)
Initialize Qwen layout detector.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration. One of: - QwenLayoutPyTorchConfig: PyTorch/HuggingFace backend - QwenLayoutVLLMConfig: VLLM high-throughput backend - QwenLayoutMLXConfig: MLX backend for Apple Silicon - QwenLayoutAPIConfig: API backend (OpenRouter, etc.)
TYPE:
|
Source code in omnidocs/tasks/layout_extraction/qwen/detector.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
custom_labels: Optional[
List[Union[str, CustomLabel]]
] = None,
) -> LayoutOutput
Run layout detection on an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file
TYPE:
|
custom_labels
|
Optional custom labels to detect. Can be: - None: Use default labels (title, text, table, figure, etc.) - List[str]: Simple label names ["code_block", "sidebar"] - List[CustomLabel]: Typed labels with metadata
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
LayoutOutput
|
LayoutOutput with detected layout boxes |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If model is not loaded |
ValueError
|
If image format is not supported |
Source code in omnidocs/tasks/layout_extraction/qwen/detector.py
RTDETRConfig
¶
RTDETRLayoutExtractor
¶
Bases: BaseLayoutExtractor
RT-DETR layout extractor using HuggingFace Transformers.
A transformer-based real-time detection model for document layout. Detects: title, text, table, figure, list, formula, captions, headers, footers.
This is a single-backend model (PyTorch/Transformers only).
Example
Initialize RT-DETR layout extractor.
| PARAMETER | DESCRIPTION |
|---|---|
config
|
Configuration object with device, model settings, etc.
TYPE:
|
Source code in omnidocs/tasks/layout_extraction/rtdetr.py
extract
¶
Run layout extraction on an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or path)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
LayoutOutput
|
LayoutOutput with detected layout boxes |
Source code in omnidocs/tasks/layout_extraction/rtdetr.py
180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 | |
VLMLayoutDetector
¶
Bases: BaseLayoutExtractor
Provider-agnostic VLM layout detector using litellm.
Works with any cloud VLM API: Gemini, OpenRouter, Azure, OpenAI, Anthropic, etc. Supports custom labels for flexible detection.
Example
from omnidocs.vlm import VLMAPIConfig
from omnidocs.tasks.layout_extraction import VLMLayoutDetector
config = VLMAPIConfig(model="gemini/gemini-2.5-flash")
detector = VLMLayoutDetector(config=config)
# Default labels
result = detector.extract("document.png")
# Custom labels
result = detector.extract("document.png", custom_labels=["code_block", "sidebar"])
Initialize VLM layout detector.
| PARAMETER | DESCRIPTION |
|---|---|
config
|
VLM API configuration with model and provider details.
TYPE:
|
Source code in omnidocs/tasks/layout_extraction/vlm.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
custom_labels: Optional[
List[Union[str, CustomLabel]]
] = None,
prompt: Optional[str] = None,
) -> LayoutOutput
Run layout detection on an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path).
TYPE:
|
custom_labels
|
Optional custom labels to detect. Can be: - None: Use default labels (title, text, table, figure, etc.) - List[str]: Simple label names ["code_block", "sidebar"] - List[CustomLabel]: Typed labels with metadata
TYPE:
|
prompt
|
Custom prompt. If None, builds a default detection prompt.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
LayoutOutput
|
LayoutOutput with detected layout boxes. |
Source code in omnidocs/tasks/layout_extraction/vlm.py
base
¶
Base class for layout extractors.
Defines the abstract interface that all layout extractors must implement.
BaseLayoutExtractor
¶
Bases: ABC
Abstract base class for layout extractors.
All layout extraction models must inherit from this class and implement the required methods.
Example
extract
abstractmethod
¶
Run layout extraction on an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
LayoutOutput
|
LayoutOutput containing detected layout boxes with standardized labels |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If image format is not supported |
RuntimeError
|
If model is not loaded or inference fails |
Source code in omnidocs/tasks/layout_extraction/base.py
batch_extract
¶
batch_extract(
images: List[Union[Image, ndarray, str, Path]],
progress_callback: Optional[
Callable[[int, int], None]
] = None,
) -> List[LayoutOutput]
Run layout extraction on multiple images.
Default implementation loops over extract(). Subclasses can override for optimized batching.
| PARAMETER | DESCRIPTION |
|---|---|
images
|
List of images in any supported format
TYPE:
|
progress_callback
|
Optional function(current, total) for progress
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[LayoutOutput]
|
List of LayoutOutput in same order as input |
Examples:
Source code in omnidocs/tasks/layout_extraction/base.py
extract_document
¶
extract_document(
document: Document,
progress_callback: Optional[
Callable[[int, int], None]
] = None,
) -> List[LayoutOutput]
Run layout extraction on all pages of a document.
| PARAMETER | DESCRIPTION |
|---|---|
document
|
Document instance
TYPE:
|
progress_callback
|
Optional function(current, total) for progress
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[LayoutOutput]
|
List of LayoutOutput, one per page |
Examples:
Source code in omnidocs/tasks/layout_extraction/base.py
doc_layout_yolo
¶
DocLayout-YOLO layout extractor.
A YOLO-based model for document layout detection, optimized for academic papers and technical documents.
Model: juliozhao/DocLayout-YOLO-DocStructBench
DocLayoutYOLOConfig
¶
DocLayoutYOLO
¶
Bases: BaseLayoutExtractor
DocLayout-YOLO layout extractor.
A YOLO-based model optimized for document layout detection. Detects: title, text, figure, table, formula, captions, etc.
This is a single-backend model (PyTorch only).
Example
Initialize DocLayout-YOLO extractor.
| PARAMETER | DESCRIPTION |
|---|---|
config
|
Configuration object with device, model_path, etc.
TYPE:
|
Source code in omnidocs/tasks/layout_extraction/doc_layout_yolo.py
extract
¶
Run layout extraction on an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or path)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
LayoutOutput
|
LayoutOutput with detected layout boxes |
Source code in omnidocs/tasks/layout_extraction/doc_layout_yolo.py
mineruvl
¶
MinerU VL layout detection module.
MinerU VL can be used for standalone layout detection, returning detected regions with types and bounding boxes.
For full document extraction (layout + content), use MinerUVLTextExtractor from the text_extraction module instead.
Example
from omnidocs.tasks.layout_extraction import MinerUVLLayoutDetector
from omnidocs.tasks.layout_extraction.mineruvl import MinerUVLLayoutPyTorchConfig
detector = MinerUVLLayoutDetector(
backend=MinerUVLLayoutPyTorchConfig(device="cuda")
)
result = detector.extract(image)
for box in result.bboxes:
print(f"{box.label}: {box.confidence:.2f}")
MinerUVLLayoutAPIConfig
¶
Bases: BaseModel
API backend config for MinerU VL layout detection.
Example
MinerUVLLayoutDetector
¶
Bases: BaseLayoutExtractor
MinerU VL layout detector.
Uses MinerU2.5-2509-1.2B for document layout detection. Detects 22+ element types including text, titles, tables, equations, figures, code, and more.
For full document extraction (layout + content), use MinerUVLTextExtractor from the text_extraction module instead.
Example
from omnidocs.tasks.layout_extraction import MinerUVLLayoutDetector
from omnidocs.tasks.layout_extraction.mineruvl import MinerUVLLayoutPyTorchConfig
detector = MinerUVLLayoutDetector(
backend=MinerUVLLayoutPyTorchConfig(device="cuda")
)
result = detector.extract(image)
for box in result.bboxes:
print(f"{box.label}: {box.confidence:.2f}")
Initialize MinerU VL layout detector.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration (PyTorch, VLLM, MLX, or API)
TYPE:
|
Source code in omnidocs/tasks/layout_extraction/mineruvl/detector.py
extract
¶
Detect layout elements in the image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
LayoutOutput
|
LayoutOutput with standardized labels and bounding boxes |
Source code in omnidocs/tasks/layout_extraction/mineruvl/detector.py
MinerUVLLayoutMLXConfig
¶
Bases: BaseModel
MLX backend config for MinerU VL layout detection on Apple Silicon.
Example
MinerUVLLayoutPyTorchConfig
¶
Bases: BaseModel
PyTorch/HuggingFace backend config for MinerU VL layout detection.
Example
MinerUVLLayoutVLLMConfig
¶
Bases: BaseModel
VLLM backend config for MinerU VL layout detection.
Example
api
¶
API backend configuration for MinerU VL layout detection.
MinerUVLLayoutAPIConfig
¶
Bases: BaseModel
API backend config for MinerU VL layout detection.
Example
detector
¶
MinerU VL layout detector.
Uses MinerU2.5-2509-1.2B for document layout detection. Detects 22+ element types including text, titles, tables, equations, figures, code.
MinerUVLLayoutDetector
¶
Bases: BaseLayoutExtractor
MinerU VL layout detector.
Uses MinerU2.5-2509-1.2B for document layout detection. Detects 22+ element types including text, titles, tables, equations, figures, code, and more.
For full document extraction (layout + content), use MinerUVLTextExtractor from the text_extraction module instead.
Example
from omnidocs.tasks.layout_extraction import MinerUVLLayoutDetector
from omnidocs.tasks.layout_extraction.mineruvl import MinerUVLLayoutPyTorchConfig
detector = MinerUVLLayoutDetector(
backend=MinerUVLLayoutPyTorchConfig(device="cuda")
)
result = detector.extract(image)
for box in result.bboxes:
print(f"{box.label}: {box.confidence:.2f}")
Initialize MinerU VL layout detector.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration (PyTorch, VLLM, MLX, or API)
TYPE:
|
Source code in omnidocs/tasks/layout_extraction/mineruvl/detector.py
extract
¶
Detect layout elements in the image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
LayoutOutput
|
LayoutOutput with standardized labels and bounding boxes |
Source code in omnidocs/tasks/layout_extraction/mineruvl/detector.py
mlx
¶
MLX backend configuration for MinerU VL layout detection (Apple Silicon).
MinerUVLLayoutMLXConfig
¶
Bases: BaseModel
MLX backend config for MinerU VL layout detection on Apple Silicon.
Example
pytorch
¶
PyTorch backend configuration for MinerU VL layout detection.
MinerUVLLayoutPyTorchConfig
¶
Bases: BaseModel
PyTorch/HuggingFace backend config for MinerU VL layout detection.
Example
models
¶
Pydantic models for layout extraction outputs.
Defines standardized output types and label enums for layout detection.
Coordinate Systems
- Absolute (default): Coordinates in pixels relative to original image size
- Normalized (0-1024): Coordinates scaled to 0-1024 range (virtual 1024x1024 canvas)
Use bbox.to_normalized(width, height) or output.get_normalized_bboxes()
to convert to normalized coordinates.
Example
LayoutLabel
¶
Bases: str, Enum
Standardized layout labels used across all layout extractors.
These provide a consistent vocabulary regardless of which model is used.
CustomLabel
¶
Bases: BaseModel
Type-safe custom layout label definition for VLM-based models.
VLM models like Qwen3-VL support flexible custom labels beyond the standard LayoutLabel enum. Use this class to define custom labels with validation.
Example
from omnidocs.tasks.layout_extraction import CustomLabel
# Simple custom label
code_block = CustomLabel(name="code_block")
# With metadata
sidebar = CustomLabel(
name="sidebar",
description="Secondary content panel",
color="#9B59B6",
)
# Use with QwenLayoutDetector
result = detector.extract(image, custom_labels=[code_block, sidebar])
LabelMapping
¶
Base class for model-specific label mappings.
Each model maps its native labels to standardized LayoutLabel values.
Initialize label mapping.
| PARAMETER | DESCRIPTION |
|---|---|
mapping
|
Dict mapping model-specific labels to LayoutLabel enum values
TYPE:
|
Source code in omnidocs/tasks/layout_extraction/models.py
BoundingBox
¶
Bases: BaseModel
Bounding box coordinates in pixel space.
Coordinates follow the convention: (x1, y1) is top-left, (x2, y2) is bottom-right.
to_list
¶
to_xyxy
¶
to_xywh
¶
from_list
classmethod
¶
Create from [x1, y1, x2, y2] list.
Source code in omnidocs/tasks/layout_extraction/models.py
to_normalized
¶
Convert to normalized coordinates (0-1024 range).
Scales coordinates from absolute pixel values to a virtual 1024x1024 canvas. This provides consistent coordinates regardless of original image size.
| PARAMETER | DESCRIPTION |
|---|---|
image_width
|
Original image width in pixels
TYPE:
|
image_height
|
Original image height in pixels
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
BoundingBox
|
New BoundingBox with coordinates in 0-1024 range |
Example
Source code in omnidocs/tasks/layout_extraction/models.py
to_absolute
¶
Convert from normalized (0-1024) to absolute pixel coordinates.
| PARAMETER | DESCRIPTION |
|---|---|
image_width
|
Target image width in pixels
TYPE:
|
image_height
|
Target image height in pixels
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
BoundingBox
|
New BoundingBox with absolute pixel coordinates |
Source code in omnidocs/tasks/layout_extraction/models.py
LayoutBox
¶
Bases: BaseModel
Single detected layout element with label, bounding box, and confidence.
to_dict
¶
Convert to dictionary representation.
Source code in omnidocs/tasks/layout_extraction/models.py
get_normalized_bbox
¶
Get bounding box in normalized (0-1024) coordinates.
| PARAMETER | DESCRIPTION |
|---|---|
image_width
|
Original image width
TYPE:
|
image_height
|
Original image height
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
BoundingBox
|
BoundingBox with normalized coordinates |
Source code in omnidocs/tasks/layout_extraction/models.py
LayoutOutput
¶
Bases: BaseModel
Complete layout extraction results for a single image.
filter_by_label
¶
filter_by_confidence
¶
to_dict
¶
Convert to dictionary representation.
Source code in omnidocs/tasks/layout_extraction/models.py
sort_by_position
¶
Return a new LayoutOutput with boxes sorted by position.
| PARAMETER | DESCRIPTION |
|---|---|
top_to_bottom
|
If True, sort by y-coordinate (reading order)
TYPE:
|
Source code in omnidocs/tasks/layout_extraction/models.py
get_normalized_bboxes
¶
Get all bounding boxes in normalized (0-1024) coordinates.
| RETURNS | DESCRIPTION |
|---|---|
List[Dict]
|
List of dicts with normalized bbox coordinates and metadata. |
Example
Source code in omnidocs/tasks/layout_extraction/models.py
visualize
¶
visualize(
image: Image,
output_path: Optional[Union[str, Path]] = None,
show_labels: bool = True,
show_confidence: bool = True,
line_width: int = 3,
font_size: int = 12,
) -> Image.Image
Visualize layout detection results on the image.
Draws bounding boxes with labels and confidence scores on the image. Each layout category has a distinct color for easy identification.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
PIL Image to draw on (will be copied, not modified)
TYPE:
|
output_path
|
Optional path to save the visualization
TYPE:
|
show_labels
|
Whether to show label text
TYPE:
|
show_confidence
|
Whether to show confidence scores
TYPE:
|
line_width
|
Width of bounding box lines
TYPE:
|
font_size
|
Size of label text (note: uses default font)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Image
|
PIL Image with visualizations drawn |
Example
Source code in omnidocs/tasks/layout_extraction/models.py
load_json
classmethod
¶
Load a LayoutOutput instance from a JSON file.
Reads a JSON file and deserializes its contents into a LayoutOutput object. Uses Pydantic's model_validate_json for proper handling of nested objects.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to JSON file containing serialized LayoutOutput data. Can be string or pathlib.Path object.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
LayoutOutput
|
Deserialized layout output instance from file.
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
FileNotFoundError
|
If the specified file does not exist. |
UnicodeDecodeError
|
If file cannot be decoded as UTF-8. |
ValueError
|
If file contents are not valid JSON. |
ValidationError
|
If JSON data doesn't match LayoutOutput schema. |
Example
Found 5 elementsSource code in omnidocs/tasks/layout_extraction/models.py
save_json
¶
Save LayoutOutput instance to a JSON file.
Serializes the LayoutOutput object to JSON and writes it to a file. Automatically creates parent directories if they don't exist. Uses UTF-8 encoding for compatibility and proper handling of special characters.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path where JSON file should be saved. Can be string or pathlib.Path object. Parent directories will be created if they don't exist.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
None
|
None |
| RAISES | DESCRIPTION |
|---|---|
OSError
|
If file cannot be written due to permission or disk errors. |
TypeError
|
If file_path is not a string or Path object. |
Example
Source code in omnidocs/tasks/layout_extraction/models.py
qwen
¶
Qwen3-VL backend configurations and detector for layout detection.
Available backends
- QwenLayoutPyTorchConfig: PyTorch/HuggingFace backend
- QwenLayoutVLLMConfig: VLLM high-throughput backend
- QwenLayoutMLXConfig: MLX backend for Apple Silicon
- QwenLayoutAPIConfig: API backend (OpenRouter, etc.)
Example
QwenLayoutAPIConfig
¶
Bases: BaseModel
API backend configuration for Qwen layout detection.
Uses litellm for provider-agnostic API access. Supports OpenRouter, Gemini, Azure, OpenAI, and any other litellm-compatible provider.
API keys can be passed directly or read from environment variables.
Example
# OpenRouter (reads OPENROUTER_API_KEY from env)
config = QwenLayoutAPIConfig(
model="openrouter/qwen/qwen3-vl-8b-instruct",
)
# With explicit key
config = QwenLayoutAPIConfig(
model="openrouter/qwen/qwen3-vl-8b-instruct",
api_key=os.environ["OPENROUTER_API_KEY"],
api_base="https://openrouter.ai/api/v1",
)
QwenLayoutDetector
¶
Bases: BaseLayoutExtractor
Qwen3-VL Vision-Language Model layout detector.
A flexible VLM-based layout detector that supports custom labels. Unlike fixed-label models (DocLayoutYOLO, RT-DETR), Qwen can detect any document elements specified at runtime.
Supports PyTorch, VLLM, MLX, and API backends.
Example
from omnidocs.tasks.layout_extraction import QwenLayoutDetector, CustomLabel
from omnidocs.tasks.layout_extraction.qwen import QwenLayoutPyTorchConfig
# Initialize with PyTorch backend
detector = QwenLayoutDetector(
backend=QwenLayoutPyTorchConfig(model="Qwen/Qwen3-VL-8B-Instruct")
)
# Basic extraction with default labels
result = detector.extract(image)
# With custom labels (strings)
result = detector.extract(image, custom_labels=["code_block", "sidebar"])
# With typed custom labels
labels = [
CustomLabel(name="code_block", color="#E74C3C"),
CustomLabel(name="sidebar", description="Side panel content"),
]
result = detector.extract(image, custom_labels=labels)
Initialize Qwen layout detector.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration. One of: - QwenLayoutPyTorchConfig: PyTorch/HuggingFace backend - QwenLayoutVLLMConfig: VLLM high-throughput backend - QwenLayoutMLXConfig: MLX backend for Apple Silicon - QwenLayoutAPIConfig: API backend (OpenRouter, etc.)
TYPE:
|
Source code in omnidocs/tasks/layout_extraction/qwen/detector.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
custom_labels: Optional[
List[Union[str, CustomLabel]]
] = None,
) -> LayoutOutput
Run layout detection on an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file
TYPE:
|
custom_labels
|
Optional custom labels to detect. Can be: - None: Use default labels (title, text, table, figure, etc.) - List[str]: Simple label names ["code_block", "sidebar"] - List[CustomLabel]: Typed labels with metadata
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
LayoutOutput
|
LayoutOutput with detected layout boxes |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If model is not loaded |
ValueError
|
If image format is not supported |
Source code in omnidocs/tasks/layout_extraction/qwen/detector.py
QwenLayoutMLXConfig
¶
Bases: BaseModel
MLX backend configuration for Qwen layout detection.
This backend uses MLX for Apple Silicon native inference. Best for local development and testing on macOS M1/M2/M3+. Requires: mlx, mlx-vlm
Note: This backend only works on Apple Silicon Macs. Do NOT use for Modal/cloud deployments.
QwenLayoutPyTorchConfig
¶
Bases: BaseModel
PyTorch/HuggingFace backend configuration for Qwen layout detection.
This backend uses the transformers library with PyTorch for local GPU inference. Requires: torch, transformers, accelerate, qwen-vl-utils
Example
QwenLayoutVLLMConfig
¶
Bases: BaseModel
VLLM backend configuration for Qwen layout detection.
This backend uses VLLM for high-throughput inference. Best for batch processing and production deployments. Requires: vllm, torch, transformers, qwen-vl-utils
Example
api
¶
API backend configuration for Qwen3-VL layout detection.
Uses litellm for provider-agnostic inference (OpenRouter, Gemini, Azure, etc.).
QwenLayoutAPIConfig
¶
Bases: BaseModel
API backend configuration for Qwen layout detection.
Uses litellm for provider-agnostic API access. Supports OpenRouter, Gemini, Azure, OpenAI, and any other litellm-compatible provider.
API keys can be passed directly or read from environment variables.
Example
# OpenRouter (reads OPENROUTER_API_KEY from env)
config = QwenLayoutAPIConfig(
model="openrouter/qwen/qwen3-vl-8b-instruct",
)
# With explicit key
config = QwenLayoutAPIConfig(
model="openrouter/qwen/qwen3-vl-8b-instruct",
api_key=os.environ["OPENROUTER_API_KEY"],
api_base="https://openrouter.ai/api/v1",
)
detector
¶
Qwen3-VL layout detector.
A Vision-Language Model for flexible layout detection with custom label support. Supports PyTorch, VLLM, MLX, and API backends.
Example
from omnidocs.tasks.layout_extraction import QwenLayoutDetector
from omnidocs.tasks.layout_extraction.qwen import QwenLayoutPyTorchConfig
detector = QwenLayoutDetector(
backend=QwenLayoutPyTorchConfig(model="Qwen/Qwen3-VL-8B-Instruct")
)
result = detector.extract(image)
# With custom labels
result = detector.extract(image, custom_labels=["code_block", "sidebar"])
QwenLayoutDetector
¶
Bases: BaseLayoutExtractor
Qwen3-VL Vision-Language Model layout detector.
A flexible VLM-based layout detector that supports custom labels. Unlike fixed-label models (DocLayoutYOLO, RT-DETR), Qwen can detect any document elements specified at runtime.
Supports PyTorch, VLLM, MLX, and API backends.
Example
from omnidocs.tasks.layout_extraction import QwenLayoutDetector, CustomLabel
from omnidocs.tasks.layout_extraction.qwen import QwenLayoutPyTorchConfig
# Initialize with PyTorch backend
detector = QwenLayoutDetector(
backend=QwenLayoutPyTorchConfig(model="Qwen/Qwen3-VL-8B-Instruct")
)
# Basic extraction with default labels
result = detector.extract(image)
# With custom labels (strings)
result = detector.extract(image, custom_labels=["code_block", "sidebar"])
# With typed custom labels
labels = [
CustomLabel(name="code_block", color="#E74C3C"),
CustomLabel(name="sidebar", description="Side panel content"),
]
result = detector.extract(image, custom_labels=labels)
Initialize Qwen layout detector.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration. One of: - QwenLayoutPyTorchConfig: PyTorch/HuggingFace backend - QwenLayoutVLLMConfig: VLLM high-throughput backend - QwenLayoutMLXConfig: MLX backend for Apple Silicon - QwenLayoutAPIConfig: API backend (OpenRouter, etc.)
TYPE:
|
Source code in omnidocs/tasks/layout_extraction/qwen/detector.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
custom_labels: Optional[
List[Union[str, CustomLabel]]
] = None,
) -> LayoutOutput
Run layout detection on an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file
TYPE:
|
custom_labels
|
Optional custom labels to detect. Can be: - None: Use default labels (title, text, table, figure, etc.) - List[str]: Simple label names ["code_block", "sidebar"] - List[CustomLabel]: Typed labels with metadata
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
LayoutOutput
|
LayoutOutput with detected layout boxes |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If model is not loaded |
ValueError
|
If image format is not supported |
Source code in omnidocs/tasks/layout_extraction/qwen/detector.py
mlx
¶
MLX backend configuration for Qwen3-VL layout detection.
QwenLayoutMLXConfig
¶
Bases: BaseModel
MLX backend configuration for Qwen layout detection.
This backend uses MLX for Apple Silicon native inference. Best for local development and testing on macOS M1/M2/M3+. Requires: mlx, mlx-vlm
Note: This backend only works on Apple Silicon Macs. Do NOT use for Modal/cloud deployments.
pytorch
¶
PyTorch/HuggingFace backend configuration for Qwen3-VL layout detection.
QwenLayoutPyTorchConfig
¶
Bases: BaseModel
PyTorch/HuggingFace backend configuration for Qwen layout detection.
This backend uses the transformers library with PyTorch for local GPU inference. Requires: torch, transformers, accelerate, qwen-vl-utils
Example
vllm
¶
VLLM backend configuration for Qwen3-VL layout detection.
QwenLayoutVLLMConfig
¶
Bases: BaseModel
VLLM backend configuration for Qwen layout detection.
This backend uses VLLM for high-throughput inference. Best for batch processing and production deployments. Requires: vllm, torch, transformers, qwen-vl-utils
Example
rtdetr
¶
RT-DETR layout extractor.
A transformer-based real-time detection model for document layout detection. Uses HuggingFace Transformers implementation.
Model: HuggingPanda/docling-layout
RTDETRConfig
¶
RTDETRLayoutExtractor
¶
Bases: BaseLayoutExtractor
RT-DETR layout extractor using HuggingFace Transformers.
A transformer-based real-time detection model for document layout. Detects: title, text, table, figure, list, formula, captions, headers, footers.
This is a single-backend model (PyTorch/Transformers only).
Example
Initialize RT-DETR layout extractor.
| PARAMETER | DESCRIPTION |
|---|---|
config
|
Configuration object with device, model settings, etc.
TYPE:
|
Source code in omnidocs/tasks/layout_extraction/rtdetr.py
extract
¶
Run layout extraction on an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or path)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
LayoutOutput
|
LayoutOutput with detected layout boxes |
Source code in omnidocs/tasks/layout_extraction/rtdetr.py
180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 | |
vlm
¶
VLM layout detector.
A provider-agnostic Vision-Language Model layout detector using litellm. Works with any cloud API: Gemini, OpenRouter, Azure, OpenAI, Anthropic, etc.
Example
from omnidocs.vlm import VLMAPIConfig
from omnidocs.tasks.layout_extraction import VLMLayoutDetector
config = VLMAPIConfig(model="gemini/gemini-2.5-flash")
detector = VLMLayoutDetector(config=config)
result = detector.extract("document.png")
for box in result.bboxes:
print(f"{box.label.value}: {box.bbox}")
VLMLayoutDetector
¶
Bases: BaseLayoutExtractor
Provider-agnostic VLM layout detector using litellm.
Works with any cloud VLM API: Gemini, OpenRouter, Azure, OpenAI, Anthropic, etc. Supports custom labels for flexible detection.
Example
from omnidocs.vlm import VLMAPIConfig
from omnidocs.tasks.layout_extraction import VLMLayoutDetector
config = VLMAPIConfig(model="gemini/gemini-2.5-flash")
detector = VLMLayoutDetector(config=config)
# Default labels
result = detector.extract("document.png")
# Custom labels
result = detector.extract("document.png", custom_labels=["code_block", "sidebar"])
Initialize VLM layout detector.
| PARAMETER | DESCRIPTION |
|---|---|
config
|
VLM API configuration with model and provider details.
TYPE:
|
Source code in omnidocs/tasks/layout_extraction/vlm.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
custom_labels: Optional[
List[Union[str, CustomLabel]]
] = None,
prompt: Optional[str] = None,
) -> LayoutOutput
Run layout detection on an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path).
TYPE:
|
custom_labels
|
Optional custom labels to detect. Can be: - None: Use default labels (title, text, table, figure, etc.) - List[str]: Simple label names ["code_block", "sidebar"] - List[CustomLabel]: Typed labels with metadata
TYPE:
|
prompt
|
Custom prompt. If None, builds a default detection prompt.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
LayoutOutput
|
LayoutOutput with detected layout boxes. |