Skip to content

Overview

Dots OCR text extractor and backend configurations.

Available backends: - PyTorch: DotsOCRPyTorchConfig (local GPU inference) - VLLM: DotsOCRVLLMConfig (offline batch inference) - API: DotsOCRAPIConfig (online VLLM server via OpenAI-compatible API)

DotsOCRAPIConfig

Bases: BaseModel

API backend configuration for Dots OCR.

This config is for accessing a deployed VLLM server via OpenAI-compatible API. Typically used with modal_dotsocr_vllm_online.py deployment.

Example
from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRAPIConfig

config = DotsOCRAPIConfig(
        model="dotsocr",
        api_base="https://your-modal-app.modal.run/v1",
        api_key="optional-key",
    )
extractor = DotsOCRTextExtractor(backend=config)

DotsOCRTextExtractor

DotsOCRTextExtractor(backend: DotsOCRBackendConfig)

Bases: BaseTextExtractor

Dots OCR Vision-Language Model text extractor with layout detection.

Extracts text from document images with layout information including: - 11 layout categories (Caption, Footnote, Formula, List-item, etc.) - Bounding boxes (normalized to 0-1024) - Multi-format text (Markdown, LaTeX, HTML) - Reading order preservation

Supports PyTorch, VLLM, and API backends.

Example
from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRPyTorchConfig

# Initialize with PyTorch backend
extractor = DotsOCRTextExtractor(
        backend=DotsOCRPyTorchConfig(model="rednote-hilab/dots.ocr")
    )

# Extract with layout
result = extractor.extract(image, include_layout=True)
print(f"Found {result.num_layout_elements} elements")
print(result.content)

Initialize Dots OCR text extractor.

PARAMETER DESCRIPTION
backend

Backend configuration. One of: - DotsOCRPyTorchConfig: PyTorch/HuggingFace backend - DotsOCRVLLMConfig: VLLM high-throughput backend - DotsOCRAPIConfig: API backend (online VLLM server)

TYPE: DotsOCRBackendConfig

Source code in omnidocs/tasks/text_extraction/dotsocr/extractor.py
def __init__(self, backend: DotsOCRBackendConfig):
    """
    Initialize Dots OCR text extractor.

    Args:
        backend: Backend configuration. One of:
            - DotsOCRPyTorchConfig: PyTorch/HuggingFace backend
            - DotsOCRVLLMConfig: VLLM high-throughput backend
            - DotsOCRAPIConfig: API backend (online VLLM server)
    """
    self.backend_config = backend
    self._backend: Any = None
    self._processor: Any = None
    self._model: Any = None
    self._loaded = False

    # Load model
    self._load_model()

extract

extract(
    image: Union[Image, ndarray, str, Path],
    output_format: Literal[
        "markdown", "html", "json"
    ] = "markdown",
    include_layout: bool = False,
    custom_prompt: Optional[str] = None,
    max_tokens: int = 8192,
) -> DotsOCRTextOutput

Extract text from image using Dots OCR.

PARAMETER DESCRIPTION
image

Input image (PIL Image, numpy array, or file path)

TYPE: Union[Image, ndarray, str, Path]

output_format

Output format ("markdown", "html", or "json")

TYPE: Literal['markdown', 'html', 'json'] DEFAULT: 'markdown'

include_layout

Include layout bounding boxes in output

TYPE: bool DEFAULT: False

custom_prompt

Override default extraction prompt

TYPE: Optional[str] DEFAULT: None

max_tokens

Maximum tokens for generation

TYPE: int DEFAULT: 8192

RETURNS DESCRIPTION
DotsOCRTextOutput

DotsOCRTextOutput with extracted content and optional layout

RAISES DESCRIPTION
RuntimeError

If model is not loaded or inference fails

Source code in omnidocs/tasks/text_extraction/dotsocr/extractor.py
def extract(
    self,
    image: Union[Image.Image, np.ndarray, str, Path],
    output_format: Literal["markdown", "html", "json"] = "markdown",
    include_layout: bool = False,
    custom_prompt: Optional[str] = None,
    max_tokens: int = 8192,
) -> DotsOCRTextOutput:
    """
    Extract text from image using Dots OCR.

    Args:
        image: Input image (PIL Image, numpy array, or file path)
        output_format: Output format ("markdown", "html", or "json")
        include_layout: Include layout bounding boxes in output
        custom_prompt: Override default extraction prompt
        max_tokens: Maximum tokens for generation

    Returns:
        DotsOCRTextOutput with extracted content and optional layout

    Raises:
        RuntimeError: If model is not loaded or inference fails
    """
    if not self._loaded:
        raise RuntimeError("Model not loaded. Call _load_model() first.")

    # Prepare image
    img = self._prepare_image(image)

    # Get prompt
    prompt = custom_prompt or DOTS_OCR_PROMPT

    # Run inference based on backend
    config_type = type(self.backend_config).__name__

    if config_type == "DotsOCRPyTorchConfig":
        raw_output = self._infer_pytorch(img, prompt, max_tokens)
    elif config_type == "DotsOCRVLLMConfig":
        raw_output = self._infer_vllm(img, prompt, max_tokens)
    elif config_type == "DotsOCRAPIConfig":
        raw_output = self._infer_api(img, prompt, max_tokens)
    else:
        raise RuntimeError(f"Unknown backend: {config_type}")

    # Parse output
    return self._parse_output(
        raw_output,
        img.size,
        output_format,
        include_layout,
    )

DotsOCRPyTorchConfig

Bases: BaseModel

PyTorch/HuggingFace backend configuration for Dots OCR.

Dots OCR provides layout-aware text extraction with 11 predefined layout categories (Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Section-header, Table, Text, Title).

Example
from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRPyTorchConfig

config = DotsOCRPyTorchConfig(
        model="rednote-hilab/dots.ocr",
        device="cuda",
        torch_dtype="bfloat16",
    )
extractor = DotsOCRTextExtractor(backend=config)

DotsOCRVLLMConfig

Bases: BaseModel

VLLM backend configuration for Dots OCR.

VLLM provides high-throughput inference with optimizations like: - PagedAttention for efficient KV cache management - Continuous batching for higher throughput - Optimized CUDA kernels

Example
from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRVLLMConfig

config = DotsOCRVLLMConfig(
        model="rednote-hilab/dots.ocr",
        tensor_parallel_size=2,
        gpu_memory_utilization=0.9,
    )
extractor = DotsOCRTextExtractor(backend=config)

api

API backend configuration for Dots OCR (VLLM online server).

DotsOCRAPIConfig

Bases: BaseModel

API backend configuration for Dots OCR.

This config is for accessing a deployed VLLM server via OpenAI-compatible API. Typically used with modal_dotsocr_vllm_online.py deployment.

Example
from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRAPIConfig

config = DotsOCRAPIConfig(
        model="dotsocr",
        api_base="https://your-modal-app.modal.run/v1",
        api_key="optional-key",
    )
extractor = DotsOCRTextExtractor(backend=config)

extractor

Dots OCR text extractor with layout-aware extraction.

A Vision-Language Model optimized for document OCR with structured output containing layout information, bounding boxes, and multi-format text.

Supports PyTorch, VLLM, and API backends.

Example
from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRPyTorchConfig

extractor = DotsOCRTextExtractor(
        backend=DotsOCRPyTorchConfig(model="rednote-hilab/dots.ocr")
    )
result = extractor.extract(image, include_layout=True)
print(result.content)
for elem in result.layout:
        print(f"{elem.category}: {elem.bbox}")

DotsOCRTextExtractor

DotsOCRTextExtractor(backend: DotsOCRBackendConfig)

Bases: BaseTextExtractor

Dots OCR Vision-Language Model text extractor with layout detection.

Extracts text from document images with layout information including: - 11 layout categories (Caption, Footnote, Formula, List-item, etc.) - Bounding boxes (normalized to 0-1024) - Multi-format text (Markdown, LaTeX, HTML) - Reading order preservation

Supports PyTorch, VLLM, and API backends.

Example
from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRPyTorchConfig

# Initialize with PyTorch backend
extractor = DotsOCRTextExtractor(
        backend=DotsOCRPyTorchConfig(model="rednote-hilab/dots.ocr")
    )

# Extract with layout
result = extractor.extract(image, include_layout=True)
print(f"Found {result.num_layout_elements} elements")
print(result.content)

Initialize Dots OCR text extractor.

PARAMETER DESCRIPTION
backend

Backend configuration. One of: - DotsOCRPyTorchConfig: PyTorch/HuggingFace backend - DotsOCRVLLMConfig: VLLM high-throughput backend - DotsOCRAPIConfig: API backend (online VLLM server)

TYPE: DotsOCRBackendConfig

Source code in omnidocs/tasks/text_extraction/dotsocr/extractor.py
def __init__(self, backend: DotsOCRBackendConfig):
    """
    Initialize Dots OCR text extractor.

    Args:
        backend: Backend configuration. One of:
            - DotsOCRPyTorchConfig: PyTorch/HuggingFace backend
            - DotsOCRVLLMConfig: VLLM high-throughput backend
            - DotsOCRAPIConfig: API backend (online VLLM server)
    """
    self.backend_config = backend
    self._backend: Any = None
    self._processor: Any = None
    self._model: Any = None
    self._loaded = False

    # Load model
    self._load_model()

extract

extract(
    image: Union[Image, ndarray, str, Path],
    output_format: Literal[
        "markdown", "html", "json"
    ] = "markdown",
    include_layout: bool = False,
    custom_prompt: Optional[str] = None,
    max_tokens: int = 8192,
) -> DotsOCRTextOutput

Extract text from image using Dots OCR.

PARAMETER DESCRIPTION
image

Input image (PIL Image, numpy array, or file path)

TYPE: Union[Image, ndarray, str, Path]

output_format

Output format ("markdown", "html", or "json")

TYPE: Literal['markdown', 'html', 'json'] DEFAULT: 'markdown'

include_layout

Include layout bounding boxes in output

TYPE: bool DEFAULT: False

custom_prompt

Override default extraction prompt

TYPE: Optional[str] DEFAULT: None

max_tokens

Maximum tokens for generation

TYPE: int DEFAULT: 8192

RETURNS DESCRIPTION
DotsOCRTextOutput

DotsOCRTextOutput with extracted content and optional layout

RAISES DESCRIPTION
RuntimeError

If model is not loaded or inference fails

Source code in omnidocs/tasks/text_extraction/dotsocr/extractor.py
def extract(
    self,
    image: Union[Image.Image, np.ndarray, str, Path],
    output_format: Literal["markdown", "html", "json"] = "markdown",
    include_layout: bool = False,
    custom_prompt: Optional[str] = None,
    max_tokens: int = 8192,
) -> DotsOCRTextOutput:
    """
    Extract text from image using Dots OCR.

    Args:
        image: Input image (PIL Image, numpy array, or file path)
        output_format: Output format ("markdown", "html", or "json")
        include_layout: Include layout bounding boxes in output
        custom_prompt: Override default extraction prompt
        max_tokens: Maximum tokens for generation

    Returns:
        DotsOCRTextOutput with extracted content and optional layout

    Raises:
        RuntimeError: If model is not loaded or inference fails
    """
    if not self._loaded:
        raise RuntimeError("Model not loaded. Call _load_model() first.")

    # Prepare image
    img = self._prepare_image(image)

    # Get prompt
    prompt = custom_prompt or DOTS_OCR_PROMPT

    # Run inference based on backend
    config_type = type(self.backend_config).__name__

    if config_type == "DotsOCRPyTorchConfig":
        raw_output = self._infer_pytorch(img, prompt, max_tokens)
    elif config_type == "DotsOCRVLLMConfig":
        raw_output = self._infer_vllm(img, prompt, max_tokens)
    elif config_type == "DotsOCRAPIConfig":
        raw_output = self._infer_api(img, prompt, max_tokens)
    else:
        raise RuntimeError(f"Unknown backend: {config_type}")

    # Parse output
    return self._parse_output(
        raw_output,
        img.size,
        output_format,
        include_layout,
    )

pytorch

PyTorch backend configuration for Dots OCR.

DotsOCRPyTorchConfig

Bases: BaseModel

PyTorch/HuggingFace backend configuration for Dots OCR.

Dots OCR provides layout-aware text extraction with 11 predefined layout categories (Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Section-header, Table, Text, Title).

Example
from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRPyTorchConfig

config = DotsOCRPyTorchConfig(
        model="rednote-hilab/dots.ocr",
        device="cuda",
        torch_dtype="bfloat16",
    )
extractor = DotsOCRTextExtractor(backend=config)

vllm

VLLM backend configuration for Dots OCR.

DotsOCRVLLMConfig

Bases: BaseModel

VLLM backend configuration for Dots OCR.

VLLM provides high-throughput inference with optimizations like: - PagedAttention for efficient KV cache management - Continuous batching for higher throughput - Optimized CUDA kernels

Example
from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRVLLMConfig

config = DotsOCRVLLMConfig(
        model="rednote-hilab/dots.ocr",
        tensor_parallel_size=2,
        gpu_memory_utilization=0.9,
    )
extractor = DotsOCRTextExtractor(backend=config)