Skip to content

Rtdetr

RT-DETR layout extractor.

A transformer-based real-time detection model for document layout detection. Uses HuggingFace Transformers implementation.

Model: HuggingPanda/docling-layout

RTDETRConfig

Bases: BaseModel

Configuration for RT-DETR layout extractor.

This is a single-backend model (PyTorch/Transformers only).

Example
config = RTDETRConfig(device="cuda", confidence=0.4)
extractor = RTDETRLayoutExtractor(config=config)

RTDETRLayoutExtractor

RTDETRLayoutExtractor(config: RTDETRConfig)

Bases: BaseLayoutExtractor

RT-DETR layout extractor using HuggingFace Transformers.

A transformer-based real-time detection model for document layout. Detects: title, text, table, figure, list, formula, captions, headers, footers.

This is a single-backend model (PyTorch/Transformers only).

Example
from omnidocs.tasks.layout_extraction import RTDETRLayoutExtractor, RTDETRConfig

extractor = RTDETRLayoutExtractor(config=RTDETRConfig(device="cuda"))
result = extractor.extract(image)

for box in result.bboxes:
        print(f"{box.label.value}: {box.confidence:.2f}")

Initialize RT-DETR layout extractor.

PARAMETER DESCRIPTION
config

Configuration object with device, model settings, etc.

TYPE: RTDETRConfig

Source code in omnidocs/tasks/layout_extraction/rtdetr.py
def __init__(self, config: RTDETRConfig):
    """
    Initialize RT-DETR layout extractor.

    Args:
        config: Configuration object with device, model settings, etc.
    """
    self.config = config
    self._model = None
    self._processor = None
    self._device = self._resolve_device(config.device)
    self._model_path = self._resolve_model_path(config.model_path)

    # Load model
    self._load_model()

extract

extract(
    image: Union[Image, ndarray, str, Path],
) -> LayoutOutput

Run layout extraction on an image.

PARAMETER DESCRIPTION
image

Input image (PIL Image, numpy array, or path)

TYPE: Union[Image, ndarray, str, Path]

RETURNS DESCRIPTION
LayoutOutput

LayoutOutput with detected layout boxes

Source code in omnidocs/tasks/layout_extraction/rtdetr.py
def extract(self, image: Union[Image.Image, np.ndarray, str, Path]) -> LayoutOutput:
    """
    Run layout extraction on an image.

    Args:
        image: Input image (PIL Image, numpy array, or path)

    Returns:
        LayoutOutput with detected layout boxes
    """
    import torch

    if self._model is None or self._processor is None:
        raise RuntimeError("Model not loaded. Call _load_model() first.")

    # Prepare image
    pil_image = self._prepare_image(image)
    img_width, img_height = pil_image.size

    # Preprocess
    inputs = self._processor(
        images=pil_image,
        return_tensors="pt",
        size={"height": self.config.image_size, "width": self.config.image_size},
    )

    # Move to device
    inputs = {k: v.to(self._device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}

    # Run inference
    with torch.no_grad():
        outputs = self._model(**inputs)

    # Post-process results
    target_sizes = torch.tensor([[img_height, img_width]])
    results = self._processor.post_process_object_detection(
        outputs,
        target_sizes=target_sizes,
        threshold=self.config.confidence,
    )[0]

    # Parse detections
    layout_boxes = []

    for score, label_id, box in zip(results["scores"], results["labels"], results["boxes"]):
        confidence = float(score.item())
        class_id = int(label_id.item())

        # Get original label from model config
        # Note: The model outputs 0-indexed class IDs, but id2label has background at index 0,
        # so we add 1 to map correctly (e.g., model output 8 -> id2label[9] = "Table")
        original_label = self._model.config.id2label.get(class_id + 1, f"class_{class_id}")

        # Map to standardized label
        standard_label = RTDETR_MAPPING.to_standard(original_label)

        # Box coordinates
        box_coords = box.cpu().tolist()

        layout_boxes.append(
            LayoutBox(
                label=standard_label,
                bbox=BoundingBox.from_list(box_coords),
                confidence=confidence,
                class_id=class_id,
                original_label=original_label,
            )
        )

    # Sort by y-coordinate (top to bottom reading order)
    layout_boxes.sort(key=lambda b: (b.bbox.y1, b.bbox.x1))

    return LayoutOutput(
        bboxes=layout_boxes,
        image_width=img_width,
        image_height=img_height,
        model_name="RT-DETR (docling-layout)",
    )