Skip to content

DocLayout-YOLO Layout Detection

Model Overview

DocLayout-YOLO is a YOLO-based (You Only Look Once) object detector specifically optimized for document layout analysis. It's the fastest layout detection model in OmniDocs, making it ideal for processing large document collections.

Model ID: juliozhao/DocLayout-YOLO-DocStructBench Architecture: YOLOv10 (object detection) Training Focus: Academic papers, technical documents, arXiv papers Framework: PyTorch only (no other backends)

Key Capabilities

  • Fast Inference: 0.1-0.3s per page (fastest in OmniDocs)
  • 10 Layout Categories: Title, text, figures, tables, formulas, captions, etc.
  • Fixed Labels: Standardized output across all documents
  • Document-Optimized: Trained on 100K+ academic papers
  • Confidence Scores: Per-detection confidence for filtering

Limitations

  • PyTorch only: No VLLM, MLX, or API backends
  • GPU required: No CPU inference (YOLO needs GPU)
  • Fixed categories: Cannot customize labels
  • English-focused: Optimized for English documents
  • Specialized: Best for academic/technical documents
  • Layout only: Does not extract text content (use with OCR/VLM)

Installation & Configuration

Basic Installation

# Install with layout analysis support
pip install omnidocs[pytorch]

# Specifically install doclayout-yolo
pip install doclayout-yolo

Configuration

from omnidocs.tasks.layout_extraction import DocLayoutYOLO, DocLayoutYOLOConfig

config = DocLayoutYOLOConfig(
    device="cuda",           # GPU required
    model_path=None,         # Auto-download from HuggingFace
    img_size=1024,           # Input image size
    confidence=0.25,         # Detection confidence threshold
)

extractor = DocLayoutYOLO(config=config)

Config Parameters:

Parameter Type Default Description
device str "cuda" Device: "cuda", "mps", "cpu" (GPU required)
model_path str None Path to model weights, or None for auto-download
img_size int 1024 Input size for inference (320-1920)
confidence float 0.25 Confidence threshold (0-1, higher = stricter)

Layout Categories (10 Fixed)

DocLayout-YOLO detects exactly 10 layout element types:

Category Description Common Content
Title Document/section title "Introduction", "Methods"
Plain text Body paragraph Main content paragraphs
Figure Image/diagram (content region) Graphs, plots, photos
Figure caption Caption for figures "Fig. 1: System Overview"
Table Tabular data (content region) Data tables, matrices
Table caption Caption for tables "Table 2: Performance Results"
Table footnote Notes under tables Footnotes, explanations
Formula Isolated equation Display math: $E=mc^2$
Formula caption Caption for formulas "Equation 3.1: Distance metric"
Abandon Elements to ignore Watermarks, page numbers, artifacts

Usage Examples

Basic Layout Detection

from omnidocs.tasks.layout_extraction import DocLayoutYOLO, DocLayoutYOLOConfig
from PIL import Image

# Initialize
config = DocLayoutYOLOConfig(device="cuda", confidence=0.3)
extractor = DocLayoutYOLO(config=config)

# Load image
image = Image.open("document.png")

# Extract layout
result = extractor.extract(image)

# Access results
print(f"Found {result.element_count} elements")
print(f"Labels: {result.labels_found}")

for box in result.bboxes:
    print(f"  {box.label.value}: confidence={box.confidence:.2f}")
    print(f"    bbox={box.bbox.to_list()}")

Filter by Confidence

# Keep only high-confidence detections
high_conf = result.filter_by_confidence(min_confidence=0.5)

for box in high_conf:
    print(f"{box.label.value} ({box.confidence:.2%})")

Filter by Label

from omnidocs.tasks.layout_extraction import LayoutLabel

# Extract only text regions
text_boxes = result.filter_by_label(LayoutLabel.TEXT)
print(f"Found {len(text_boxes)} text blocks")

# Extract only figures
figures = result.filter_by_label(LayoutLabel.FIGURE)
for fig in figures:
    x1, y1, x2, y2 = fig.bbox.to_xyxy()
    width = x2 - x1
    height = y2 - y1
    print(f"Figure: {width}x{height} at ({x1}, {y1})")

Normalized Coordinates

# Get bounding boxes normalized to 0-1024 scale
normalized = result.get_normalized_bboxes()

for box_dict in normalized:
    print(f"{box_dict['label']}:")
    print(f"  bbox (0-1024): {box_dict['bbox']}")
    print(f"  confidence: {box_dict['confidence']:.2f}")

Visualization

from PIL import Image

# Load original image
image = Image.open("document.png")

# Create visualization with bounding boxes
viz = result.visualize(
    image,
    output_path="layout_visualization.png",
    show_labels=True,
    show_confidence=True,
    line_width=2,
)

# Display
viz.show()

Batch Processing

from pathlib import Path
import json

# Process multiple documents
doc_dir = Path("documents/")
results = {}

for img_path in sorted(doc_dir.glob("*.png")):
    print(f"Processing {img_path.name}...")
    image = Image.open(img_path)
    layout = extractor.extract(image)

    results[img_path.name] = layout.to_dict()

# Save results
with open("layout_results.json", "w") as f:
    json.dump(results, f, indent=2)

# Summary statistics
total_elements = sum(r["element_count"] for r in results.values())
avg_elements = total_elements / len(results)
print(f"Total elements: {total_elements}")
print(f"Average per document: {avg_elements:.1f}")

Performance Characteristics

Speed Comparison

Device Image Size Time
A10G GPU 1024x1024 0.1-0.2s
A10G GPU 2048x2048 0.3-0.5s
CPU 1024x1024 5-10s
CPU 2048x2048 15-30s

Memory Requirements

Batch Size VRAM Device
1 (single) 2-4 GB A10G
2-4 4-8 GB A10G
1 1-2 GB A100

Typical Detection Counts

Document Type Elements Speed
Single page paper 10-30 0.1s
Research paper (5pp) 50-150 0.5s
Scanned book page 20-40 0.15s

Troubleshooting

Model Download Issues

Symptom: Model fails to download on first run

Solution:

# Set cache directory
import os
os.environ["HF_HOME"] = "/path/to/cache"

# Pre-download the model
from huggingface_hub import snapshot_download
snapshot_download("juliozhao/DocLayout-YOLO-DocStructBench")

# Now use extractor (will use cached model)
extractor = DocLayoutYOLO(config=config)

Confidence Threshold Tuning

Symptom: Too many false positives OR missing real elements

Solutions:

# Too many false positives → increase confidence
config = DocLayoutYOLOConfig(confidence=0.5)  # Stricter

# Missing elements → decrease confidence
config = DocLayoutYOLOConfig(confidence=0.1)  # More lenient

# Find optimal threshold
from PIL import Image
image = Image.open("test.png")

for conf in [0.1, 0.25, 0.5, 0.75]:
    config = DocLayoutYOLOConfig(confidence=conf)
    extractor = DocLayoutYOLO(config=config)
    result = extractor.extract(image)
    print(f"Confidence {conf}: {result.element_count} elements")

Image Size Issues

Symptom: Poor detection on very large or small images

Solutions:

from PIL import Image

image = Image.open("document.png")
print(f"Original size: {image.size}")

# Resize to standard size for better detection
target_size = 1024
image.thumbnail((target_size, target_size), Image.Resampling.LANCZOS)

result = extractor.extract(image)
print(f"Found {result.element_count} elements")

Model Selection Guide

When to Use DocLayout-YOLO

Best for: - Fast batch processing of large document collections - Academic papers and technical documents - When speed is critical (real-time requirements) - You need layout detection only (will add OCR separately)

Not ideal for: - Extracting actual text (use Qwen or DotsOCR) - Complex/unusual layouts (use Qwen layout detector) - Handwritten documents (use Surya) - When you need custom layout categories (use Qwen)

DocLayout-YOLO vs Other Layout Models

Feature DocLayout-YOLO RT-DETR Qwen Layout
Speed Very Fast Fast Medium
Categories 10 (fixed) 12+ (fixed) Unlimited (custom)
Backend PyTorch only PyTorch Multi-backend
Memory 2-4 GB 4-8 GB 8-16 GB
Quality Good Excellent Excellent
Use Case Fast detection Precision Flexibility

Choose DocLayout-YOLO if: You need fast detection for batch processing Choose Qwen Layout if: You need flexible categories or better quality


API Reference

DocLayoutYOLO.extract()

def extract(image: Union[Image.Image, np.ndarray, str, Path]) -> LayoutOutput:
    """
    Run layout extraction on an image.

    Args:
        image: Input image (PIL Image, numpy array, or path)

    Returns:
        LayoutOutput with detected layout boxes
    """

LayoutOutput Properties

result = extractor.extract(image)

# Basic properties
result.bboxes              # List[LayoutBox] - all detections
result.element_count       # Number of elements
result.labels_found        # Set of unique labels
result.image_width         # Source image width
result.image_height        # Source image height
result.model_name          # "DocLayout-YOLO"

# Filter methods
result.filter_by_label(label)        # Filter by LayoutLabel
result.filter_by_confidence(min_conf) # Filter by confidence

# Coordinate conversion
result.get_normalized_bboxes()  # Convert to 0-1024 scale
result.sort_by_position()       # Sort by reading order

# Export
result.to_dict()           # Convert to dictionary
result.visualize(image)    # Create visualization
result.save_json(path)     # Save to JSON file
result.load_json(path)     # Load from JSON file

LayoutBox Properties

for box in result.bboxes:
    box.label             # LayoutLabel enum
    box.bbox              # BoundingBox object
    box.confidence        # float (0-1)
    box.class_id          # int - YOLO class ID
    box.original_label    # str - original YOLO label

BoundingBox Methods

bbox = box.bbox

# Access coordinates
bbox.x1, bbox.y1          # Top-left corner
bbox.x2, bbox.y2          # Bottom-right corner
bbox.width                # Width in pixels
bbox.height               # Height in pixels
bbox.area                 # Area in pixels²
bbox.center               # (center_x, center_y) tuple

# Convert formats
bbox.to_list()            # [x1, y1, x2, y2]
bbox.to_xyxy()            # (x1, y1, x2, y2)
bbox.to_xywh()            # (x, y, width, height)

# Normalize to 0-1024 range
normalized = bbox.to_normalized(image_width, image_height)

# Convert back to absolute
absolute = normalized.to_absolute(image_width, image_height)

Advanced Usage

Reading Order Detection

# DocLayout-YOLO automatically sorts by position (top to bottom, left to right)
sorted_result = result.sort_by_position(top_to_bottom=True)

for i, box in enumerate(sorted_result.bboxes):
    print(f"{i+1}. {box.label.value} at ({box.bbox.y1:.0f}, {box.bbox.x1:.0f})")

Region-Based Processing

# Get all elements in upper half of page
upper_half = [
    box for box in result.bboxes
    if box.bbox.y1 < result.image_height // 2
]

# Get all large elements (> 1/4 page width)
page_width = result.image_width
large_elements = [
    box for box in result.bboxes
    if box.bbox.width > page_width // 4
]

print(f"Upper half: {len(upper_half)} elements")
print(f"Large: {len(large_elements)} elements")

Export to Different Formats

# Save as JSON for downstream processing
result.save_json("layout.json")

# Convert to dict for custom serialization
layout_dict = result.to_dict()

# Export to COCO format (for computer vision tools)
coco_format = {
    "images": [{
        "id": 0,
        "width": result.image_width,
        "height": result.image_height,
        "file_name": "document.png"
    }],
    "annotations": [
        {
            "id": i,
            "image_id": 0,
            "category_id": box.class_id,
            "bbox": list(box.bbox.to_xywh()),  # COCO format: [x, y, w, h]
            "area": box.bbox.area,
            "iscrowd": 0,
        }
        for i, box in enumerate(result.bboxes)
    ],
}

Integration with Text Extraction

Pipeline: Layout + OCR

from omnidocs.tasks.ocr_extraction import TesseractOCR, TesseractOCRConfig
from PIL import Image

# Step 1: Detect layout
layout_result = extractor.extract(image)

# Step 2: Extract text from regions
ocr = TesseractOCR(config=TesseractOCRConfig(languages=["eng"]))

for box in layout_result.bboxes:
    if box.label.value in ["text", "title"]:
        # Crop region
        x1, y1, x2, y2 = box.bbox.to_xyxy()
        region = image.crop((x1, y1, x2, y2))

        # OCR the region
        ocr_result = ocr.extract(region)

        print(f"{box.label.value}: {ocr_result.full_text}")

Pipeline: Layout + VLM

from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig

# Step 1: Detect layout
layout_result = extractor.extract(image)

# Step 2: Extract text per element
extractor_qwen = QwenTextExtractor(
    backend=QwenTextPyTorchConfig(device="cuda")
)

for i, box in enumerate(layout_result.bboxes):
    # Crop region
    x1, y1, x2, y2 = box.bbox.to_xyxy()
    region = image.crop((x1, y1, x2, y2))

    # Extract with Qwen
    result = extractor_qwen.extract(region)

    print(f"Element {i} ({box.label.value}):")
    print(result.content)
    print()

See Also