Overview¶

Dots OCR text extractor and backend configurations.

Available backends: - PyTorch: DotsOCRPyTorchConfig (local GPU inference) - VLLM: DotsOCRVLLMConfig (offline batch inference) - API: DotsOCRAPIConfig (online VLLM server via OpenAI-compatible API)

DotsOCRAPIConfig ¶

Bases: BaseModel

API backend configuration for Dots OCR.

This config is for accessing a deployed VLLM server via OpenAI-compatible API. Typically used with modal_dotsocr_vllm_online.py deployment.

Example

from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRAPIConfig

config = DotsOCRAPIConfig(
        model="dotsocr",
        api_base="https://your-modal-app.modal.run/v1",
        api_key="optional-key",
    )
extractor = DotsOCRTextExtractor(backend=config)

DotsOCRTextExtractor ¶

DotsOCRTextExtractor(backend: DotsOCRBackendConfig)

Bases: BaseTextExtractor

Dots OCR Vision-Language Model text extractor with layout detection.

Extracts text from document images with layout information including: - 11 layout categories (Caption, Footnote, Formula, List-item, etc.) - Bounding boxes (normalized to 0-1024) - Multi-format text (Markdown, LaTeX, HTML) - Reading order preservation

Supports PyTorch, VLLM, and API backends.

Example

from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRPyTorchConfig

# Initialize with PyTorch backend
extractor = DotsOCRTextExtractor(
        backend=DotsOCRPyTorchConfig(model="rednote-hilab/dots.ocr")
    )

# Extract with layout
result = extractor.extract(image, include_layout=True)
print(f"Found {result.num_layout_elements} elements")
print(result.content)

Initialize Dots OCR text extractor.

PARAMETER	DESCRIPTION
`backend`	Backend configuration. One of: - DotsOCRPyTorchConfig: PyTorch/HuggingFace backend - DotsOCRVLLMConfig: VLLM high-throughput backend - DotsOCRAPIConfig: API backend (online VLLM server) TYPE: `DotsOCRBackendConfig`

Source code in omnidocs/tasks/text_extraction/dotsocr/extractor.py

def __init__(self, backend: DotsOCRBackendConfig):
    """
    Initialize Dots OCR text extractor.

    Args:
        backend: Backend configuration. One of:
            - DotsOCRPyTorchConfig: PyTorch/HuggingFace backend
            - DotsOCRVLLMConfig: VLLM high-throughput backend
            - DotsOCRAPIConfig: API backend (online VLLM server)
    """
    self.backend_config = backend
    self._backend: Any = None
    self._processor: Any = None
    self._model: Any = None
    self._loaded = False

    # Load model
    self._load_model()

extract ¶

extract(
    image: Union[Image, ndarray, str, Path],
    output_format: Literal[
        "markdown", "html", "json"
    ] = "markdown",
    include_layout: bool = False,
    custom_prompt: Optional[str] = None,
    max_tokens: int = 8192,
) -> DotsOCRTextOutput

Extract text from image using Dots OCR.

PARAMETER	DESCRIPTION
`image`	Input image (PIL Image, numpy array, or file path) TYPE: `Union[Image, ndarray, str, Path]`
`output_format`	Output format ("markdown", "html", or "json") TYPE: `Literal['markdown', 'html', 'json']` DEFAULT: `'markdown'`
`include_layout`	Include layout bounding boxes in output TYPE: `bool` DEFAULT: `False`
`custom_prompt`	Override default extraction prompt TYPE: `Optional[str]` DEFAULT: `None`
`max_tokens`	Maximum tokens for generation TYPE: `int` DEFAULT: `8192`

RETURNS	DESCRIPTION
`DotsOCRTextOutput`	DotsOCRTextOutput with extracted content and optional layout

RAISES	DESCRIPTION
`RuntimeError`	If model is not loaded or inference fails

Source code in omnidocs/tasks/text_extraction/dotsocr/extractor.py

def extract(
    self,
    image: Union[Image.Image, np.ndarray, str, Path],
    output_format: Literal["markdown", "html", "json"] = "markdown",
    include_layout: bool = False,
    custom_prompt: Optional[str] = None,
    max_tokens: int = 8192,
) -> DotsOCRTextOutput:
    """
    Extract text from image using Dots OCR.

    Args:
        image: Input image (PIL Image, numpy array, or file path)
        output_format: Output format ("markdown", "html", or "json")
        include_layout: Include layout bounding boxes in output
        custom_prompt: Override default extraction prompt
        max_tokens: Maximum tokens for generation

    Returns:
        DotsOCRTextOutput with extracted content and optional layout

    Raises:
        RuntimeError: If model is not loaded or inference fails
    """
    if not self._loaded:
        raise RuntimeError("Model not loaded. Call _load_model() first.")

    # Prepare image
    img = self._prepare_image(image)

    # Get prompt
    prompt = custom_prompt or DOTS_OCR_PROMPT

    # Run inference based on backend
    config_type = type(self.backend_config).__name__

    if config_type == "DotsOCRPyTorchConfig":
        raw_output = self._infer_pytorch(img, prompt, max_tokens)
    elif config_type == "DotsOCRVLLMConfig":
        raw_output = self._infer_vllm(img, prompt, max_tokens)
    elif config_type == "DotsOCRAPIConfig":
        raw_output = self._infer_api(img, prompt, max_tokens)
    else:
        raise RuntimeError(f"Unknown backend: {config_type}")

    # Parse output
    return self._parse_output(
        raw_output,
        img.size,
        output_format,
        include_layout,
    )

DotsOCRPyTorchConfig ¶

Bases: BaseModel

PyTorch/HuggingFace backend configuration for Dots OCR.

Dots OCR provides layout-aware text extraction with 11 predefined layout categories (Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Section-header, Table, Text, Title).

Example

from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRPyTorchConfig

config = DotsOCRPyTorchConfig(
        model="rednote-hilab/dots.ocr",
        device="cuda",
        torch_dtype="bfloat16",
    )
extractor = DotsOCRTextExtractor(backend=config)

DotsOCRVLLMConfig ¶

Bases: BaseModel

VLLM backend configuration for Dots OCR.

VLLM provides high-throughput inference with optimizations like: - PagedAttention for efficient KV cache management - Continuous batching for higher throughput - Optimized CUDA kernels

Example

from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRVLLMConfig

config = DotsOCRVLLMConfig(
        model="rednote-hilab/dots.ocr",
        tensor_parallel_size=2,
        gpu_memory_utilization=0.9,
    )
extractor = DotsOCRTextExtractor(backend=config)

api ¶

API backend configuration for Dots OCR (VLLM online server).

DotsOCRAPIConfig ¶

Bases: BaseModel

API backend configuration for Dots OCR.

This config is for accessing a deployed VLLM server via OpenAI-compatible API. Typically used with modal_dotsocr_vllm_online.py deployment.

Example

from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRAPIConfig

config = DotsOCRAPIConfig(
        model="dotsocr",
        api_base="https://your-modal-app.modal.run/v1",
        api_key="optional-key",
    )
extractor = DotsOCRTextExtractor(backend=config)

extractor ¶

Dots OCR text extractor with layout-aware extraction.

A Vision-Language Model optimized for document OCR with structured output containing layout information, bounding boxes, and multi-format text.

Supports PyTorch, VLLM, and API backends.

Example

from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRPyTorchConfig

extractor = DotsOCRTextExtractor(
        backend=DotsOCRPyTorchConfig(model="rednote-hilab/dots.ocr")
    )
result = extractor.extract(image, include_layout=True)
print(result.content)
for elem in result.layout:
        print(f"{elem.category}: {elem.bbox}")

DotsOCRTextExtractor ¶

DotsOCRTextExtractor(backend: DotsOCRBackendConfig)

Bases: BaseTextExtractor

Dots OCR Vision-Language Model text extractor with layout detection.

Extracts text from document images with layout information including: - 11 layout categories (Caption, Footnote, Formula, List-item, etc.) - Bounding boxes (normalized to 0-1024) - Multi-format text (Markdown, LaTeX, HTML) - Reading order preservation

Supports PyTorch, VLLM, and API backends.

Example

from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRPyTorchConfig

# Initialize with PyTorch backend
extractor = DotsOCRTextExtractor(
        backend=DotsOCRPyTorchConfig(model="rednote-hilab/dots.ocr")
    )

# Extract with layout
result = extractor.extract(image, include_layout=True)
print(f"Found {result.num_layout_elements} elements")
print(result.content)

Initialize Dots OCR text extractor.

PARAMETER	DESCRIPTION
`backend`	Backend configuration. One of: - DotsOCRPyTorchConfig: PyTorch/HuggingFace backend - DotsOCRVLLMConfig: VLLM high-throughput backend - DotsOCRAPIConfig: API backend (online VLLM server) TYPE: `DotsOCRBackendConfig`

Source code in omnidocs/tasks/text_extraction/dotsocr/extractor.py

def __init__(self, backend: DotsOCRBackendConfig):
    """
    Initialize Dots OCR text extractor.

    Args:
        backend: Backend configuration. One of:
            - DotsOCRPyTorchConfig: PyTorch/HuggingFace backend
            - DotsOCRVLLMConfig: VLLM high-throughput backend
            - DotsOCRAPIConfig: API backend (online VLLM server)
    """
    self.backend_config = backend
    self._backend: Any = None
    self._processor: Any = None
    self._model: Any = None
    self._loaded = False

    # Load model
    self._load_model()

extract ¶

extract(
    image: Union[Image, ndarray, str, Path],
    output_format: Literal[
        "markdown", "html", "json"
    ] = "markdown",
    include_layout: bool = False,
    custom_prompt: Optional[str] = None,
    max_tokens: int = 8192,
) -> DotsOCRTextOutput

Extract text from image using Dots OCR.

PARAMETER	DESCRIPTION
`image`	Input image (PIL Image, numpy array, or file path) TYPE: `Union[Image, ndarray, str, Path]`
`output_format`	Output format ("markdown", "html", or "json") TYPE: `Literal['markdown', 'html', 'json']` DEFAULT: `'markdown'`
`include_layout`	Include layout bounding boxes in output TYPE: `bool` DEFAULT: `False`
`custom_prompt`	Override default extraction prompt TYPE: `Optional[str]` DEFAULT: `None`
`max_tokens`	Maximum tokens for generation TYPE: `int` DEFAULT: `8192`

RETURNS	DESCRIPTION
`DotsOCRTextOutput`	DotsOCRTextOutput with extracted content and optional layout

RAISES	DESCRIPTION
`RuntimeError`	If model is not loaded or inference fails

Source code in omnidocs/tasks/text_extraction/dotsocr/extractor.py

def extract(
    self,
    image: Union[Image.Image, np.ndarray, str, Path],
    output_format: Literal["markdown", "html", "json"] = "markdown",
    include_layout: bool = False,
    custom_prompt: Optional[str] = None,
    max_tokens: int = 8192,
) -> DotsOCRTextOutput:
    """
    Extract text from image using Dots OCR.

    Args:
        image: Input image (PIL Image, numpy array, or file path)
        output_format: Output format ("markdown", "html", or "json")
        include_layout: Include layout bounding boxes in output
        custom_prompt: Override default extraction prompt
        max_tokens: Maximum tokens for generation

    Returns:
        DotsOCRTextOutput with extracted content and optional layout

    Raises:
        RuntimeError: If model is not loaded or inference fails
    """
    if not self._loaded:
        raise RuntimeError("Model not loaded. Call _load_model() first.")

    # Prepare image
    img = self._prepare_image(image)

    # Get prompt
    prompt = custom_prompt or DOTS_OCR_PROMPT

    # Run inference based on backend
    config_type = type(self.backend_config).__name__

    if config_type == "DotsOCRPyTorchConfig":
        raw_output = self._infer_pytorch(img, prompt, max_tokens)
    elif config_type == "DotsOCRVLLMConfig":
        raw_output = self._infer_vllm(img, prompt, max_tokens)
    elif config_type == "DotsOCRAPIConfig":
        raw_output = self._infer_api(img, prompt, max_tokens)
    else:
        raise RuntimeError(f"Unknown backend: {config_type}")

    # Parse output
    return self._parse_output(
        raw_output,
        img.size,
        output_format,
        include_layout,
    )

pytorch ¶

PyTorch backend configuration for Dots OCR.

DotsOCRPyTorchConfig ¶

Bases: BaseModel

PyTorch/HuggingFace backend configuration for Dots OCR.

Dots OCR provides layout-aware text extraction with 11 predefined layout categories (Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Section-header, Table, Text, Title).

Example

from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRPyTorchConfig

config = DotsOCRPyTorchConfig(
        model="rednote-hilab/dots.ocr",
        device="cuda",
        torch_dtype="bfloat16",
    )
extractor = DotsOCRTextExtractor(backend=config)

vllm ¶

VLLM backend configuration for Dots OCR.

DotsOCRVLLMConfig ¶

Bases: BaseModel

VLLM backend configuration for Dots OCR.

VLLM provides high-throughput inference with optimizations like: - PagedAttention for efficient KV cache management - Continuous batching for higher throughput - Optimized CUDA kernels

Example

from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRVLLMConfig

config = DotsOCRVLLMConfig(
        model="rednote-hilab/dots.ocr",
        tensor_parallel_size=2,
        gpu_memory_utilization=0.9,
    )
extractor = DotsOCRTextExtractor(backend=config)