Skip to content

Overview

Text Extraction Module.

Provides extractors for converting document images to structured text formats (HTML, Markdown, JSON). Uses Vision-Language Models for accurate text extraction with formatting preservation and optional layout detection.

Available Extractors
  • QwenTextExtractor: Qwen3-VL based extractor (multi-backend)
  • DotsOCRTextExtractor: Dots OCR with layout-aware extraction (PyTorch/VLLM/API)
  • NanonetsTextExtractor: Nanonets OCR2-3B for text extraction (PyTorch/VLLM)
  • GraniteDoclingTextExtractor: IBM Granite Docling for document conversion (multi-backend)
  • MinerUVLTextExtractor: MinerU VL 1.2B with layout-aware two-step extraction (multi-backend)
Example
from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig

extractor = QwenTextExtractor(
        backend=QwenTextPyTorchConfig(model="Qwen/Qwen3-VL-8B-Instruct")
    )
result = extractor.extract(image, output_format="markdown")
print(result.content)

BaseTextExtractor

Bases: ABC

Abstract base class for text extractors.

All text extraction models must inherit from this class and implement the required methods.

Example
class MyTextExtractor(BaseTextExtractor):
        def __init__(self, config: MyConfig):
            self.config = config
            self._load_model()

        def _load_model(self):
            # Load model weights
            pass

        def extract(self, image, output_format="markdown"):
            # Run extraction
            return TextOutput(...)

extract abstractmethod

extract(
    image: Union[Image, ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput

Extract text from an image.

PARAMETER DESCRIPTION
image

Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file

TYPE: Union[Image, ndarray, str, Path]

output_format

Desired output format: - "html": Structured HTML - "markdown": Markdown format

TYPE: Literal['html', 'markdown'] DEFAULT: 'markdown'

RETURNS DESCRIPTION
TextOutput

TextOutput containing extracted text content

RAISES DESCRIPTION
ValueError

If image format or output_format is not supported

RuntimeError

If model is not loaded or inference fails

Source code in omnidocs/tasks/text_extraction/base.py
@abstractmethod
def extract(
    self,
    image: Union[Image.Image, np.ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput:
    """
    Extract text from an image.

    Args:
        image: Input image as:
            - PIL.Image.Image: PIL image object
            - np.ndarray: Numpy array (HWC format, RGB)
            - str or Path: Path to image file
        output_format: Desired output format:
            - "html": Structured HTML
            - "markdown": Markdown format

    Returns:
        TextOutput containing extracted text content

    Raises:
        ValueError: If image format or output_format is not supported
        RuntimeError: If model is not loaded or inference fails
    """
    pass

batch_extract

batch_extract(
    images: List[Union[Image, ndarray, str, Path]],
    output_format: Literal["html", "markdown"] = "markdown",
    progress_callback: Optional[
        Callable[[int, int], None]
    ] = None,
) -> List[TextOutput]

Extract text from multiple images.

Default implementation loops over extract(). Subclasses can override for optimized batching (e.g., VLLM).

PARAMETER DESCRIPTION
images

List of images in any supported format

TYPE: List[Union[Image, ndarray, str, Path]]

output_format

Desired output format

TYPE: Literal['html', 'markdown'] DEFAULT: 'markdown'

progress_callback

Optional function(current, total) for progress

TYPE: Optional[Callable[[int, int], None]] DEFAULT: None

RETURNS DESCRIPTION
List[TextOutput]

List of TextOutput in same order as input

Examples:

images = [doc.get_page(i) for i in range(doc.page_count)]
results = extractor.batch_extract(images, output_format="markdown")
Source code in omnidocs/tasks/text_extraction/base.py
def batch_extract(
    self,
    images: List[Union[Image.Image, np.ndarray, str, Path]],
    output_format: Literal["html", "markdown"] = "markdown",
    progress_callback: Optional[Callable[[int, int], None]] = None,
) -> List[TextOutput]:
    """
    Extract text from multiple images.

    Default implementation loops over extract(). Subclasses can override
    for optimized batching (e.g., VLLM).

    Args:
        images: List of images in any supported format
        output_format: Desired output format
        progress_callback: Optional function(current, total) for progress

    Returns:
        List of TextOutput in same order as input

    Examples:
        ```python
        images = [doc.get_page(i) for i in range(doc.page_count)]
        results = extractor.batch_extract(images, output_format="markdown")
        ```
    """
    results = []
    total = len(images)

    for i, image in enumerate(images):
        if progress_callback:
            progress_callback(i + 1, total)

        result = self.extract(image, output_format=output_format)
        results.append(result)

    return results

extract_document

extract_document(
    document: Document,
    output_format: Literal["html", "markdown"] = "markdown",
    progress_callback: Optional[
        Callable[[int, int], None]
    ] = None,
) -> List[TextOutput]

Extract text from all pages of a document.

PARAMETER DESCRIPTION
document

Document instance

TYPE: Document

output_format

Desired output format

TYPE: Literal['html', 'markdown'] DEFAULT: 'markdown'

progress_callback

Optional function(current, total) for progress

TYPE: Optional[Callable[[int, int], None]] DEFAULT: None

RETURNS DESCRIPTION
List[TextOutput]

List of TextOutput, one per page

Examples:

doc = Document.from_pdf("paper.pdf")
results = extractor.extract_document(doc, output_format="markdown")
Source code in omnidocs/tasks/text_extraction/base.py
def extract_document(
    self,
    document: "Document",
    output_format: Literal["html", "markdown"] = "markdown",
    progress_callback: Optional[Callable[[int, int], None]] = None,
) -> List[TextOutput]:
    """
    Extract text from all pages of a document.

    Args:
        document: Document instance
        output_format: Desired output format
        progress_callback: Optional function(current, total) for progress

    Returns:
        List of TextOutput, one per page

    Examples:
        ```python
        doc = Document.from_pdf("paper.pdf")
        results = extractor.extract_document(doc, output_format="markdown")
        ```
    """
    results = []
    total = document.page_count

    for i, page in enumerate(document.iter_pages()):
        if progress_callback:
            progress_callback(i + 1, total)

        result = self.extract(page, output_format=output_format)
        results.append(result)

    return results

DotsOCRTextExtractor

DotsOCRTextExtractor(backend: DotsOCRBackendConfig)

Bases: BaseTextExtractor

Dots OCR Vision-Language Model text extractor with layout detection.

Extracts text from document images with layout information including: - 11 layout categories (Caption, Footnote, Formula, List-item, etc.) - Bounding boxes (normalized to 0-1024) - Multi-format text (Markdown, LaTeX, HTML) - Reading order preservation

Supports PyTorch, VLLM, and API backends.

Example
from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRPyTorchConfig

# Initialize with PyTorch backend
extractor = DotsOCRTextExtractor(
        backend=DotsOCRPyTorchConfig(model="rednote-hilab/dots.ocr")
    )

# Extract with layout
result = extractor.extract(image, include_layout=True)
print(f"Found {result.num_layout_elements} elements")
print(result.content)

Initialize Dots OCR text extractor.

PARAMETER DESCRIPTION
backend

Backend configuration. One of: - DotsOCRPyTorchConfig: PyTorch/HuggingFace backend - DotsOCRVLLMConfig: VLLM high-throughput backend - DotsOCRAPIConfig: API backend (online VLLM server)

TYPE: DotsOCRBackendConfig

Source code in omnidocs/tasks/text_extraction/dotsocr/extractor.py
def __init__(self, backend: DotsOCRBackendConfig):
    """
    Initialize Dots OCR text extractor.

    Args:
        backend: Backend configuration. One of:
            - DotsOCRPyTorchConfig: PyTorch/HuggingFace backend
            - DotsOCRVLLMConfig: VLLM high-throughput backend
            - DotsOCRAPIConfig: API backend (online VLLM server)
    """
    self.backend_config = backend
    self._backend: Any = None
    self._processor: Any = None
    self._model: Any = None
    self._loaded = False

    # Load model
    self._load_model()

extract

extract(
    image: Union[Image, ndarray, str, Path],
    output_format: Literal[
        "markdown", "html", "json"
    ] = "markdown",
    include_layout: bool = False,
    custom_prompt: Optional[str] = None,
    max_tokens: int = 8192,
) -> DotsOCRTextOutput

Extract text from image using Dots OCR.

PARAMETER DESCRIPTION
image

Input image (PIL Image, numpy array, or file path)

TYPE: Union[Image, ndarray, str, Path]

output_format

Output format ("markdown", "html", or "json")

TYPE: Literal['markdown', 'html', 'json'] DEFAULT: 'markdown'

include_layout

Include layout bounding boxes in output

TYPE: bool DEFAULT: False

custom_prompt

Override default extraction prompt

TYPE: Optional[str] DEFAULT: None

max_tokens

Maximum tokens for generation

TYPE: int DEFAULT: 8192

RETURNS DESCRIPTION
DotsOCRTextOutput

DotsOCRTextOutput with extracted content and optional layout

RAISES DESCRIPTION
RuntimeError

If model is not loaded or inference fails

Source code in omnidocs/tasks/text_extraction/dotsocr/extractor.py
def extract(
    self,
    image: Union[Image.Image, np.ndarray, str, Path],
    output_format: Literal["markdown", "html", "json"] = "markdown",
    include_layout: bool = False,
    custom_prompt: Optional[str] = None,
    max_tokens: int = 8192,
) -> DotsOCRTextOutput:
    """
    Extract text from image using Dots OCR.

    Args:
        image: Input image (PIL Image, numpy array, or file path)
        output_format: Output format ("markdown", "html", or "json")
        include_layout: Include layout bounding boxes in output
        custom_prompt: Override default extraction prompt
        max_tokens: Maximum tokens for generation

    Returns:
        DotsOCRTextOutput with extracted content and optional layout

    Raises:
        RuntimeError: If model is not loaded or inference fails
    """
    if not self._loaded:
        raise RuntimeError("Model not loaded. Call _load_model() first.")

    # Prepare image
    img = self._prepare_image(image)

    # Get prompt
    prompt = custom_prompt or DOTS_OCR_PROMPT

    # Run inference based on backend
    config_type = type(self.backend_config).__name__

    if config_type == "DotsOCRPyTorchConfig":
        raw_output = self._infer_pytorch(img, prompt, max_tokens)
    elif config_type == "DotsOCRVLLMConfig":
        raw_output = self._infer_vllm(img, prompt, max_tokens)
    elif config_type == "DotsOCRAPIConfig":
        raw_output = self._infer_api(img, prompt, max_tokens)
    else:
        raise RuntimeError(f"Unknown backend: {config_type}")

    # Parse output
    return self._parse_output(
        raw_output,
        img.size,
        output_format,
        include_layout,
    )

GraniteDoclingTextExtractor

GraniteDoclingTextExtractor(
    backend: GraniteDoclingTextBackendConfig,
)

Bases: BaseTextExtractor

Granite Docling text extractor supporting PyTorch, VLLM, MLX, and API backends.

Granite Docling is IBM's compact vision-language model optimized for document conversion. It outputs DocTags format which is converted to Markdown using the docling_core library.

Example

from omnidocs.tasks.text_extraction.granitedocling import ( ... GraniteDoclingTextExtractor, ... GraniteDoclingTextPyTorchConfig, ... ) config = GraniteDoclingTextPyTorchConfig(device="cuda") extractor = GraniteDoclingTextExtractor(backend=config) result = extractor.extract(image, output_format="markdown") print(result.content)

Initialize Granite Docling extractor with backend configuration.

PARAMETER DESCRIPTION
backend

Backend configuration (PyTorch, VLLM, MLX, or API config)

TYPE: GraniteDoclingTextBackendConfig

Source code in omnidocs/tasks/text_extraction/granitedocling/extractor.py
def __init__(self, backend: GraniteDoclingTextBackendConfig):
    """
    Initialize Granite Docling extractor with backend configuration.

    Args:
        backend: Backend configuration (PyTorch, VLLM, MLX, or API config)
    """
    self.backend_config = backend
    self._backend: Any = None
    self._processor: Any = None
    self._loaded: bool = False

    # Backend-specific helpers
    self._mlx_config: Any = None
    self._apply_chat_template: Any = None
    self._generate: Any = None
    self._sampling_params_class: Any = None
    self._device: str = "cpu"

    self._load_model()

extract

extract(
    image: Union[Image, ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput

Extract text from an image using Granite Docling.

PARAMETER DESCRIPTION
image

Input image (PIL Image, numpy array, or file path)

TYPE: Union[Image, ndarray, str, Path]

output_format

Output format ("markdown" or "html")

TYPE: Literal['html', 'markdown'] DEFAULT: 'markdown'

RETURNS DESCRIPTION
TextOutput

TextOutput with extracted content

Source code in omnidocs/tasks/text_extraction/granitedocling/extractor.py
def extract(
    self,
    image: Union[Image.Image, np.ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput:
    """
    Extract text from an image using Granite Docling.

    Args:
        image: Input image (PIL Image, numpy array, or file path)
        output_format: Output format ("markdown" or "html")

    Returns:
        TextOutput with extracted content
    """
    if not self._loaded:
        raise RuntimeError("Model not loaded")

    if output_format not in ("html", "markdown"):
        raise ValueError(f"Invalid output_format: {output_format}")

    pil_image = self._prepare_image(image)
    width, height = pil_image.size

    # Dispatch to backend-specific inference
    config_type = type(self.backend_config).__name__

    if config_type == "GraniteDoclingTextPyTorchConfig":
        raw_output = self._infer_pytorch(pil_image)
    elif config_type == "GraniteDoclingTextVLLMConfig":
        raw_output = self._infer_vllm(pil_image)
    elif config_type == "GraniteDoclingTextMLXConfig":
        raw_output = self._infer_mlx(pil_image)
    elif config_type == "GraniteDoclingTextAPIConfig":
        raw_output = self._infer_api(pil_image)
    else:
        raise RuntimeError(f"Unknown backend: {config_type}")

    # Convert DocTags to Markdown
    markdown_output = self._convert_doctags_to_markdown(raw_output, pil_image)

    # For HTML output, wrap in basic HTML structure
    if output_format == "html":
        content = f"<html><body>\n{markdown_output}\n</body></html>"
    else:
        content = markdown_output

    return TextOutput(
        content=content,
        format=OutputFormat(output_format),
        raw_output=raw_output,
        plain_text=self._extract_plain_text(markdown_output),
        image_width=width,
        image_height=height,
        model_name=f"Granite-Docling-258M ({config_type.replace('Config', '')})",
    )

MinerUVLTextExtractor

MinerUVLTextExtractor(backend: MinerUVLTextBackendConfig)

Bases: BaseTextExtractor

MinerU VL text extractor with layout-aware extraction.

Performs two-step extraction: 1. Layout detection (detect regions) 2. Content recognition (extract text/table/equation from each region)

Supports multiple backends: - PyTorch (HuggingFace Transformers) - VLLM (high-throughput GPU) - MLX (Apple Silicon) - API (VLLM OpenAI-compatible server)

Example
from omnidocs.tasks.text_extraction import MinerUVLTextExtractor
from omnidocs.tasks.text_extraction.mineruvl import MinerUVLTextPyTorchConfig

extractor = MinerUVLTextExtractor(
    backend=MinerUVLTextPyTorchConfig(device="cuda")
)
result = extractor.extract(image)

print(result.content)  # Combined text + tables + equations
print(result.blocks)   # List of ContentBlock objects

Initialize MinerU VL text extractor.

PARAMETER DESCRIPTION
backend

Backend configuration (PyTorch, VLLM, MLX, or API)

TYPE: MinerUVLTextBackendConfig

Source code in omnidocs/tasks/text_extraction/mineruvl/extractor.py
def __init__(self, backend: MinerUVLTextBackendConfig):
    """
    Initialize MinerU VL text extractor.

    Args:
        backend: Backend configuration (PyTorch, VLLM, MLX, or API)
    """
    self.backend_config = backend
    self._client = None
    self._loaded = False
    self._load_model()

extract

extract(
    image: Union[Image, ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput

Extract text with layout-aware two-step extraction.

PARAMETER DESCRIPTION
image

Input image (PIL Image, numpy array, or file path)

TYPE: Union[Image, ndarray, str, Path]

output_format

Output format ('html' or 'markdown')

TYPE: Literal['html', 'markdown'] DEFAULT: 'markdown'

RETURNS DESCRIPTION
TextOutput

TextOutput with extracted content and metadata

Source code in omnidocs/tasks/text_extraction/mineruvl/extractor.py
def extract(
    self,
    image: Union[Image.Image, np.ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput:
    """
    Extract text with layout-aware two-step extraction.

    Args:
        image: Input image (PIL Image, numpy array, or file path)
        output_format: Output format ('html' or 'markdown')

    Returns:
        TextOutput with extracted content and metadata
    """
    if not self._loaded:
        raise RuntimeError("Model not loaded. Call _load_model() first.")

    pil_image = self._prepare_image(image)
    width, height = pil_image.size

    # Step 1: Layout detection
    blocks = self._detect_layout(pil_image)

    # Step 2: Content extraction for each block
    blocks = self._extract_content(pil_image, blocks)

    # Post-process (OTSL to HTML for tables)
    blocks = simple_post_process(blocks)

    # Combine content
    content = self._combine_content(blocks, output_format)

    # Build raw output with blocks info
    raw_output = self._build_raw_output(blocks)

    return TextOutput(
        content=content,
        format=OutputFormat(output_format),
        raw_output=raw_output,
        image_width=width,
        image_height=height,
        model_name="MinerU2.5-2509-1.2B",
    )

extract_with_blocks

extract_with_blocks(
    image: Union[Image, ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> tuple[TextOutput, List[ContentBlock]]

Extract text and return both TextOutput and ContentBlocks.

This method provides access to the detailed block information including bounding boxes and block types.

PARAMETER DESCRIPTION
image

Input image

TYPE: Union[Image, ndarray, str, Path]

output_format

Output format

TYPE: Literal['html', 'markdown'] DEFAULT: 'markdown'

RETURNS DESCRIPTION
tuple[TextOutput, List[ContentBlock]]

Tuple of (TextOutput, List[ContentBlock])

Source code in omnidocs/tasks/text_extraction/mineruvl/extractor.py
def extract_with_blocks(
    self,
    image: Union[Image.Image, np.ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> tuple[TextOutput, List[ContentBlock]]:
    """
    Extract text and return both TextOutput and ContentBlocks.

    This method provides access to the detailed block information
    including bounding boxes and block types.

    Args:
        image: Input image
        output_format: Output format

    Returns:
        Tuple of (TextOutput, List[ContentBlock])
    """
    if not self._loaded:
        raise RuntimeError("Model not loaded.")

    pil_image = self._prepare_image(image)
    width, height = pil_image.size

    # Two-step extraction
    blocks = self._detect_layout(pil_image)
    blocks = self._extract_content(pil_image, blocks)
    blocks = simple_post_process(blocks)

    content = self._combine_content(blocks, output_format)
    raw_output = self._build_raw_output(blocks)

    text_output = TextOutput(
        content=content,
        format=OutputFormat(output_format),
        raw_output=raw_output,
        image_width=width,
        image_height=height,
        model_name="MinerU2.5-2509-1.2B",
    )

    return text_output, blocks

DotsOCRTextOutput

Bases: BaseModel

Text extraction output from Dots OCR with layout information.

Dots OCR provides structured output with: - Layout detection (11 categories) - Bounding boxes (normalized to 0-1024) - Multi-format text (Markdown/LaTeX/HTML) - Reading order preservation

Layout Categories

Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Section-header, Table, Text, Title

Text Formatting
  • Text/Title/Section-header: Markdown
  • Formula: LaTeX
  • Table: HTML
  • Picture: (text omitted)
Example
from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
result = extractor.extract(image, include_layout=True)
print(result.content)  # Full text with formatting
for elem in result.layout:
        print(f"{elem.category}: {elem.bbox}")

num_layout_elements property

num_layout_elements: int

Number of detected layout elements.

content_length property

content_length: int

Length of extracted content in characters.

LayoutElement

Bases: BaseModel

Single layout element from document layout detection.

Represents a detected region in the document with its bounding box, category label, and extracted text content.

ATTRIBUTE DESCRIPTION
bbox

Bounding box coordinates [x1, y1, x2, y2] (normalized to 0-1024)

TYPE: List[int]

category

Layout category (e.g., "Text", "Title", "Table", "Formula")

TYPE: str

text

Extracted text content (None for pictures)

TYPE: Optional[str]

confidence

Detection confidence score (optional)

TYPE: Optional[float]

OutputFormat

Bases: str, Enum

Supported text extraction output formats.

Each format has different characteristics
  • HTML: Structured with div elements, preserves layout semantics
  • MARKDOWN: Portable, human-readable, good for documentation
  • JSON: Structured data with layout information (Dots OCR)

TextOutput

Bases: BaseModel

Text extraction output from a document image.

Contains the extracted text content in the requested format, along with optional raw output and plain text versions.

Example
result = extractor.extract(image, output_format="markdown")
print(result.content)  # Clean markdown
print(result.plain_text)  # Plain text without formatting

content_length property

content_length: int

Length of the extracted content in characters.

word_count property

word_count: int

Approximate word count of the plain text.

NanonetsTextExtractor

NanonetsTextExtractor(backend: NanonetsTextBackendConfig)

Bases: BaseTextExtractor

Nanonets OCR2-3B Vision-Language Model text extractor.

Extracts text from document images with support for: - Tables (output as HTML) - Equations (output as LaTeX) - Image captions (wrapped in tags) - Watermarks (wrapped in tags) - Page numbers (wrapped in tags) - Checkboxes (using ☐ and ☑ symbols)

Supports PyTorch, VLLM, and MLX backends.

Example
from omnidocs.tasks.text_extraction import NanonetsTextExtractor
from omnidocs.tasks.text_extraction.nanonets import NanonetsTextPyTorchConfig

# Initialize with PyTorch backend
extractor = NanonetsTextExtractor(
        backend=NanonetsTextPyTorchConfig()
    )

# Extract text
result = extractor.extract(image)
print(result.content)

Initialize Nanonets text extractor.

PARAMETER DESCRIPTION
backend

Backend configuration. One of: - NanonetsTextPyTorchConfig: PyTorch/HuggingFace backend - NanonetsTextVLLMConfig: VLLM high-throughput backend - NanonetsTextMLXConfig: MLX backend for Apple Silicon

TYPE: NanonetsTextBackendConfig

Source code in omnidocs/tasks/text_extraction/nanonets/extractor.py
def __init__(self, backend: NanonetsTextBackendConfig):
    """
    Initialize Nanonets text extractor.

    Args:
        backend: Backend configuration. One of:
            - NanonetsTextPyTorchConfig: PyTorch/HuggingFace backend
            - NanonetsTextVLLMConfig: VLLM high-throughput backend
            - NanonetsTextMLXConfig: MLX backend for Apple Silicon
    """
    self.backend_config = backend
    self._backend: Any = None
    self._processor: Any = None
    self._loaded = False

    # Backend-specific helpers
    self._process_vision_info: Any = None
    self._sampling_params_class: Any = None
    self._device: str = "cpu"

    # MLX-specific helpers
    self._mlx_config: Any = None
    self._apply_chat_template: Any = None
    self._generate: Any = None

    # Load model
    self._load_model()

extract

extract(
    image: Union[Image, ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput

Extract text from an image.

Note: Nanonets OCR2 produces a unified output format that includes tables as HTML and equations as LaTeX inline. The output_format parameter is accepted for API compatibility but does not change the output structure.

PARAMETER DESCRIPTION
image

Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file

TYPE: Union[Image, ndarray, str, Path]

output_format

Accepted for API compatibility (default: "markdown")

TYPE: Literal['html', 'markdown'] DEFAULT: 'markdown'

RETURNS DESCRIPTION
TextOutput

TextOutput containing extracted text content

RAISES DESCRIPTION
RuntimeError

If model is not loaded

ValueError

If image format is not supported

Source code in omnidocs/tasks/text_extraction/nanonets/extractor.py
def extract(
    self,
    image: Union[Image.Image, np.ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput:
    """
    Extract text from an image.

    Note: Nanonets OCR2 produces a unified output format that includes
    tables as HTML and equations as LaTeX inline. The output_format
    parameter is accepted for API compatibility but does not change
    the output structure.

    Args:
        image: Input image as:
            - PIL.Image.Image: PIL image object
            - np.ndarray: Numpy array (HWC format, RGB)
            - str or Path: Path to image file
        output_format: Accepted for API compatibility (default: "markdown")

    Returns:
        TextOutput containing extracted text content

    Raises:
        RuntimeError: If model is not loaded
        ValueError: If image format is not supported
    """
    if not self._loaded:
        raise RuntimeError("Model not loaded. Call _load_model() first.")

    # Prepare image
    pil_image = self._prepare_image(image)
    width, height = pil_image.size

    # Run inference based on backend
    config_type = type(self.backend_config).__name__
    if config_type == "NanonetsTextPyTorchConfig":
        raw_output = self._infer_pytorch(pil_image)
    elif config_type == "NanonetsTextVLLMConfig":
        raw_output = self._infer_vllm(pil_image)
    elif config_type == "NanonetsTextMLXConfig":
        raw_output = self._infer_mlx(pil_image)
    else:
        raise RuntimeError(f"Unknown backend: {config_type}")

    # Clean output
    cleaned_output = raw_output.replace("<|im_end|>", "").strip()

    return TextOutput(
        content=cleaned_output,
        format=OutputFormat(output_format),
        raw_output=raw_output,
        plain_text=cleaned_output,
        image_width=width,
        image_height=height,
        model_name=f"Nanonets-OCR2-3B ({type(self.backend_config).__name__})",
    )

QwenTextExtractor

QwenTextExtractor(backend: QwenTextBackendConfig)

Bases: BaseTextExtractor

Qwen3-VL Vision-Language Model text extractor.

Extracts text from document images and outputs as structured HTML or Markdown. Uses Qwen3-VL's built-in document parsing prompts.

Supports PyTorch, VLLM, MLX, and API backends.

Example
from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig

# Initialize with PyTorch backend
extractor = QwenTextExtractor(
        backend=QwenTextPyTorchConfig(model="Qwen/Qwen3-VL-8B-Instruct")
    )

# Extract as Markdown
result = extractor.extract(image, output_format="markdown")
print(result.content)

# Extract as HTML
result = extractor.extract(image, output_format="html")
print(result.content)

Initialize Qwen text extractor.

PARAMETER DESCRIPTION
backend

Backend configuration. One of: - QwenTextPyTorchConfig: PyTorch/HuggingFace backend - QwenTextVLLMConfig: VLLM high-throughput backend - QwenTextMLXConfig: MLX backend for Apple Silicon - QwenTextAPIConfig: API backend (OpenRouter, etc.)

TYPE: QwenTextBackendConfig

Source code in omnidocs/tasks/text_extraction/qwen/extractor.py
def __init__(self, backend: QwenTextBackendConfig):
    """
    Initialize Qwen text extractor.

    Args:
        backend: Backend configuration. One of:
            - QwenTextPyTorchConfig: PyTorch/HuggingFace backend
            - QwenTextVLLMConfig: VLLM high-throughput backend
            - QwenTextMLXConfig: MLX backend for Apple Silicon
            - QwenTextAPIConfig: API backend (OpenRouter, etc.)
    """
    self.backend_config = backend
    self._backend: Any = None
    self._processor: Any = None
    self._loaded = False

    # Backend-specific helpers
    self._process_vision_info: Any = None
    self._sampling_params_class: Any = None
    self._mlx_config: Any = None
    self._apply_chat_template: Any = None
    self._generate: Any = None

    # Load model
    self._load_model()

extract

extract(
    image: Union[Image, ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput

Extract text from an image.

PARAMETER DESCRIPTION
image

Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file

TYPE: Union[Image, ndarray, str, Path]

output_format

Desired output format: - "html": Structured HTML with div elements - "markdown": Markdown format

TYPE: Literal['html', 'markdown'] DEFAULT: 'markdown'

RETURNS DESCRIPTION
TextOutput

TextOutput containing extracted text content

RAISES DESCRIPTION
RuntimeError

If model is not loaded

ValueError

If image format or output_format is not supported

Source code in omnidocs/tasks/text_extraction/qwen/extractor.py
def extract(
    self,
    image: Union[Image.Image, np.ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput:
    """
    Extract text from an image.

    Args:
        image: Input image as:
            - PIL.Image.Image: PIL image object
            - np.ndarray: Numpy array (HWC format, RGB)
            - str or Path: Path to image file
        output_format: Desired output format:
            - "html": Structured HTML with div elements
            - "markdown": Markdown format

    Returns:
        TextOutput containing extracted text content

    Raises:
        RuntimeError: If model is not loaded
        ValueError: If image format or output_format is not supported
    """
    if not self._loaded:
        raise RuntimeError("Model not loaded. Call _load_model() first.")

    if output_format not in ("html", "markdown"):
        raise ValueError(f"Invalid output_format: {output_format}. Expected 'html' or 'markdown'.")

    # Prepare image
    pil_image = self._prepare_image(image)
    width, height = pil_image.size

    # Get prompt for output format
    prompt = QWEN_PROMPTS[output_format]

    # Run inference based on backend
    config_type = type(self.backend_config).__name__
    if config_type == "QwenTextPyTorchConfig":
        raw_output = self._infer_pytorch(pil_image, prompt)
    elif config_type == "QwenTextVLLMConfig":
        raw_output = self._infer_vllm(pil_image, prompt)
    elif config_type == "QwenTextMLXConfig":
        raw_output = self._infer_mlx(pil_image, prompt)
    elif config_type == "QwenTextAPIConfig":
        raw_output = self._infer_api(pil_image, prompt)
    else:
        raise RuntimeError(f"Unknown backend: {config_type}")

    # Clean output
    if output_format == "html":
        cleaned_output = _clean_html_output(raw_output)
    else:
        cleaned_output = _clean_markdown_output(raw_output)

    # Extract plain text
    plain_text = _extract_plain_text(raw_output, output_format)

    return TextOutput(
        content=cleaned_output,
        format=OutputFormat(output_format),
        raw_output=raw_output,
        plain_text=plain_text,
        image_width=width,
        image_height=height,
        model_name=f"Qwen3-VL ({type(self.backend_config).__name__})",
    )

VLMTextExtractor

VLMTextExtractor(config: VLMAPIConfig)

Bases: BaseTextExtractor

Provider-agnostic VLM text extractor using litellm.

Works with any cloud VLM API: Gemini, OpenRouter, Azure, OpenAI, Anthropic, etc. Supports custom prompts for specialized extraction.

Example
from omnidocs.vlm import VLMAPIConfig
from omnidocs.tasks.text_extraction import VLMTextExtractor

# Gemini (reads GOOGLE_API_KEY from env)
config = VLMAPIConfig(model="gemini/gemini-2.5-flash")
extractor = VLMTextExtractor(config=config)

# Default extraction
result = extractor.extract("document.png", output_format="markdown")

# Custom prompt
result = extractor.extract(
    "document.png",
    prompt="Extract only the table data as markdown",
)

Initialize VLM text extractor.

PARAMETER DESCRIPTION
config

VLM API configuration with model and provider details.

TYPE: VLMAPIConfig

Source code in omnidocs/tasks/text_extraction/vlm.py
def __init__(self, config: VLMAPIConfig):
    """
    Initialize VLM text extractor.

    Args:
        config: VLM API configuration with model and provider details.
    """
    self.config = config
    self._loaded = True

extract

extract(
    image: Union[Image, ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
    prompt: Optional[str] = None,
) -> TextOutput

Extract text from an image.

PARAMETER DESCRIPTION
image

Input image (PIL Image, numpy array, or file path).

TYPE: Union[Image, ndarray, str, Path]

output_format

Desired output format ("html" or "markdown").

TYPE: Literal['html', 'markdown'] DEFAULT: 'markdown'

prompt

Custom prompt. If None, uses a task-specific default prompt.

TYPE: Optional[str] DEFAULT: None

RETURNS DESCRIPTION
TextOutput

TextOutput containing extracted text content.

Source code in omnidocs/tasks/text_extraction/vlm.py
def extract(
    self,
    image: Union[Image.Image, np.ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
    prompt: Optional[str] = None,
) -> TextOutput:
    """
    Extract text from an image.

    Args:
        image: Input image (PIL Image, numpy array, or file path).
        output_format: Desired output format ("html" or "markdown").
        prompt: Custom prompt. If None, uses a task-specific default prompt.

    Returns:
        TextOutput containing extracted text content.
    """
    if output_format not in ("html", "markdown"):
        raise ValueError(f"Invalid output_format: {output_format}. Expected 'html' or 'markdown'.")

    pil_image = self._prepare_image(image)
    width, height = pil_image.size

    final_prompt = prompt or DEFAULT_PROMPTS[output_format]
    raw_output = vlm_completion(self.config, final_prompt, pil_image)
    plain_text = _extract_plain_text(raw_output, output_format)

    return TextOutput(
        content=raw_output,
        format=OutputFormat(output_format),
        raw_output=raw_output,
        plain_text=plain_text,
        image_width=width,
        image_height=height,
        model_name=f"VLM ({self.config.model})",
    )

base

Base class for text extractors.

Defines the abstract interface that all text extractors must implement.

BaseTextExtractor

Bases: ABC

Abstract base class for text extractors.

All text extraction models must inherit from this class and implement the required methods.

Example
class MyTextExtractor(BaseTextExtractor):
        def __init__(self, config: MyConfig):
            self.config = config
            self._load_model()

        def _load_model(self):
            # Load model weights
            pass

        def extract(self, image, output_format="markdown"):
            # Run extraction
            return TextOutput(...)

extract abstractmethod

extract(
    image: Union[Image, ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput

Extract text from an image.

PARAMETER DESCRIPTION
image

Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file

TYPE: Union[Image, ndarray, str, Path]

output_format

Desired output format: - "html": Structured HTML - "markdown": Markdown format

TYPE: Literal['html', 'markdown'] DEFAULT: 'markdown'

RETURNS DESCRIPTION
TextOutput

TextOutput containing extracted text content

RAISES DESCRIPTION
ValueError

If image format or output_format is not supported

RuntimeError

If model is not loaded or inference fails

Source code in omnidocs/tasks/text_extraction/base.py
@abstractmethod
def extract(
    self,
    image: Union[Image.Image, np.ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput:
    """
    Extract text from an image.

    Args:
        image: Input image as:
            - PIL.Image.Image: PIL image object
            - np.ndarray: Numpy array (HWC format, RGB)
            - str or Path: Path to image file
        output_format: Desired output format:
            - "html": Structured HTML
            - "markdown": Markdown format

    Returns:
        TextOutput containing extracted text content

    Raises:
        ValueError: If image format or output_format is not supported
        RuntimeError: If model is not loaded or inference fails
    """
    pass

batch_extract

batch_extract(
    images: List[Union[Image, ndarray, str, Path]],
    output_format: Literal["html", "markdown"] = "markdown",
    progress_callback: Optional[
        Callable[[int, int], None]
    ] = None,
) -> List[TextOutput]

Extract text from multiple images.

Default implementation loops over extract(). Subclasses can override for optimized batching (e.g., VLLM).

PARAMETER DESCRIPTION
images

List of images in any supported format

TYPE: List[Union[Image, ndarray, str, Path]]

output_format

Desired output format

TYPE: Literal['html', 'markdown'] DEFAULT: 'markdown'

progress_callback

Optional function(current, total) for progress

TYPE: Optional[Callable[[int, int], None]] DEFAULT: None

RETURNS DESCRIPTION
List[TextOutput]

List of TextOutput in same order as input

Examples:

images = [doc.get_page(i) for i in range(doc.page_count)]
results = extractor.batch_extract(images, output_format="markdown")
Source code in omnidocs/tasks/text_extraction/base.py
def batch_extract(
    self,
    images: List[Union[Image.Image, np.ndarray, str, Path]],
    output_format: Literal["html", "markdown"] = "markdown",
    progress_callback: Optional[Callable[[int, int], None]] = None,
) -> List[TextOutput]:
    """
    Extract text from multiple images.

    Default implementation loops over extract(). Subclasses can override
    for optimized batching (e.g., VLLM).

    Args:
        images: List of images in any supported format
        output_format: Desired output format
        progress_callback: Optional function(current, total) for progress

    Returns:
        List of TextOutput in same order as input

    Examples:
        ```python
        images = [doc.get_page(i) for i in range(doc.page_count)]
        results = extractor.batch_extract(images, output_format="markdown")
        ```
    """
    results = []
    total = len(images)

    for i, image in enumerate(images):
        if progress_callback:
            progress_callback(i + 1, total)

        result = self.extract(image, output_format=output_format)
        results.append(result)

    return results

extract_document

extract_document(
    document: Document,
    output_format: Literal["html", "markdown"] = "markdown",
    progress_callback: Optional[
        Callable[[int, int], None]
    ] = None,
) -> List[TextOutput]

Extract text from all pages of a document.

PARAMETER DESCRIPTION
document

Document instance

TYPE: Document

output_format

Desired output format

TYPE: Literal['html', 'markdown'] DEFAULT: 'markdown'

progress_callback

Optional function(current, total) for progress

TYPE: Optional[Callable[[int, int], None]] DEFAULT: None

RETURNS DESCRIPTION
List[TextOutput]

List of TextOutput, one per page

Examples:

doc = Document.from_pdf("paper.pdf")
results = extractor.extract_document(doc, output_format="markdown")
Source code in omnidocs/tasks/text_extraction/base.py
def extract_document(
    self,
    document: "Document",
    output_format: Literal["html", "markdown"] = "markdown",
    progress_callback: Optional[Callable[[int, int], None]] = None,
) -> List[TextOutput]:
    """
    Extract text from all pages of a document.

    Args:
        document: Document instance
        output_format: Desired output format
        progress_callback: Optional function(current, total) for progress

    Returns:
        List of TextOutput, one per page

    Examples:
        ```python
        doc = Document.from_pdf("paper.pdf")
        results = extractor.extract_document(doc, output_format="markdown")
        ```
    """
    results = []
    total = document.page_count

    for i, page in enumerate(document.iter_pages()):
        if progress_callback:
            progress_callback(i + 1, total)

        result = self.extract(page, output_format=output_format)
        results.append(result)

    return results

dotsocr

Dots OCR text extractor and backend configurations.

Available backends: - PyTorch: DotsOCRPyTorchConfig (local GPU inference) - VLLM: DotsOCRVLLMConfig (offline batch inference) - API: DotsOCRAPIConfig (online VLLM server via OpenAI-compatible API)

DotsOCRAPIConfig

Bases: BaseModel

API backend configuration for Dots OCR.

This config is for accessing a deployed VLLM server via OpenAI-compatible API. Typically used with modal_dotsocr_vllm_online.py deployment.

Example
from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRAPIConfig

config = DotsOCRAPIConfig(
        model="dotsocr",
        api_base="https://your-modal-app.modal.run/v1",
        api_key="optional-key",
    )
extractor = DotsOCRTextExtractor(backend=config)

DotsOCRTextExtractor

DotsOCRTextExtractor(backend: DotsOCRBackendConfig)

Bases: BaseTextExtractor

Dots OCR Vision-Language Model text extractor with layout detection.

Extracts text from document images with layout information including: - 11 layout categories (Caption, Footnote, Formula, List-item, etc.) - Bounding boxes (normalized to 0-1024) - Multi-format text (Markdown, LaTeX, HTML) - Reading order preservation

Supports PyTorch, VLLM, and API backends.

Example
from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRPyTorchConfig

# Initialize with PyTorch backend
extractor = DotsOCRTextExtractor(
        backend=DotsOCRPyTorchConfig(model="rednote-hilab/dots.ocr")
    )

# Extract with layout
result = extractor.extract(image, include_layout=True)
print(f"Found {result.num_layout_elements} elements")
print(result.content)

Initialize Dots OCR text extractor.

PARAMETER DESCRIPTION
backend

Backend configuration. One of: - DotsOCRPyTorchConfig: PyTorch/HuggingFace backend - DotsOCRVLLMConfig: VLLM high-throughput backend - DotsOCRAPIConfig: API backend (online VLLM server)

TYPE: DotsOCRBackendConfig

Source code in omnidocs/tasks/text_extraction/dotsocr/extractor.py
def __init__(self, backend: DotsOCRBackendConfig):
    """
    Initialize Dots OCR text extractor.

    Args:
        backend: Backend configuration. One of:
            - DotsOCRPyTorchConfig: PyTorch/HuggingFace backend
            - DotsOCRVLLMConfig: VLLM high-throughput backend
            - DotsOCRAPIConfig: API backend (online VLLM server)
    """
    self.backend_config = backend
    self._backend: Any = None
    self._processor: Any = None
    self._model: Any = None
    self._loaded = False

    # Load model
    self._load_model()

extract

extract(
    image: Union[Image, ndarray, str, Path],
    output_format: Literal[
        "markdown", "html", "json"
    ] = "markdown",
    include_layout: bool = False,
    custom_prompt: Optional[str] = None,
    max_tokens: int = 8192,
) -> DotsOCRTextOutput

Extract text from image using Dots OCR.

PARAMETER DESCRIPTION
image

Input image (PIL Image, numpy array, or file path)

TYPE: Union[Image, ndarray, str, Path]

output_format

Output format ("markdown", "html", or "json")

TYPE: Literal['markdown', 'html', 'json'] DEFAULT: 'markdown'

include_layout

Include layout bounding boxes in output

TYPE: bool DEFAULT: False

custom_prompt

Override default extraction prompt

TYPE: Optional[str] DEFAULT: None

max_tokens

Maximum tokens for generation

TYPE: int DEFAULT: 8192

RETURNS DESCRIPTION
DotsOCRTextOutput

DotsOCRTextOutput with extracted content and optional layout

RAISES DESCRIPTION
RuntimeError

If model is not loaded or inference fails

Source code in omnidocs/tasks/text_extraction/dotsocr/extractor.py
def extract(
    self,
    image: Union[Image.Image, np.ndarray, str, Path],
    output_format: Literal["markdown", "html", "json"] = "markdown",
    include_layout: bool = False,
    custom_prompt: Optional[str] = None,
    max_tokens: int = 8192,
) -> DotsOCRTextOutput:
    """
    Extract text from image using Dots OCR.

    Args:
        image: Input image (PIL Image, numpy array, or file path)
        output_format: Output format ("markdown", "html", or "json")
        include_layout: Include layout bounding boxes in output
        custom_prompt: Override default extraction prompt
        max_tokens: Maximum tokens for generation

    Returns:
        DotsOCRTextOutput with extracted content and optional layout

    Raises:
        RuntimeError: If model is not loaded or inference fails
    """
    if not self._loaded:
        raise RuntimeError("Model not loaded. Call _load_model() first.")

    # Prepare image
    img = self._prepare_image(image)

    # Get prompt
    prompt = custom_prompt or DOTS_OCR_PROMPT

    # Run inference based on backend
    config_type = type(self.backend_config).__name__

    if config_type == "DotsOCRPyTorchConfig":
        raw_output = self._infer_pytorch(img, prompt, max_tokens)
    elif config_type == "DotsOCRVLLMConfig":
        raw_output = self._infer_vllm(img, prompt, max_tokens)
    elif config_type == "DotsOCRAPIConfig":
        raw_output = self._infer_api(img, prompt, max_tokens)
    else:
        raise RuntimeError(f"Unknown backend: {config_type}")

    # Parse output
    return self._parse_output(
        raw_output,
        img.size,
        output_format,
        include_layout,
    )

DotsOCRPyTorchConfig

Bases: BaseModel

PyTorch/HuggingFace backend configuration for Dots OCR.

Dots OCR provides layout-aware text extraction with 11 predefined layout categories (Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Section-header, Table, Text, Title).

Example
from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRPyTorchConfig

config = DotsOCRPyTorchConfig(
        model="rednote-hilab/dots.ocr",
        device="cuda",
        torch_dtype="bfloat16",
    )
extractor = DotsOCRTextExtractor(backend=config)

DotsOCRVLLMConfig

Bases: BaseModel

VLLM backend configuration for Dots OCR.

VLLM provides high-throughput inference with optimizations like: - PagedAttention for efficient KV cache management - Continuous batching for higher throughput - Optimized CUDA kernels

Example
from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRVLLMConfig

config = DotsOCRVLLMConfig(
        model="rednote-hilab/dots.ocr",
        tensor_parallel_size=2,
        gpu_memory_utilization=0.9,
    )
extractor = DotsOCRTextExtractor(backend=config)

api

API backend configuration for Dots OCR (VLLM online server).

DotsOCRAPIConfig

Bases: BaseModel

API backend configuration for Dots OCR.

This config is for accessing a deployed VLLM server via OpenAI-compatible API. Typically used with modal_dotsocr_vllm_online.py deployment.

Example
from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRAPIConfig

config = DotsOCRAPIConfig(
        model="dotsocr",
        api_base="https://your-modal-app.modal.run/v1",
        api_key="optional-key",
    )
extractor = DotsOCRTextExtractor(backend=config)

extractor

Dots OCR text extractor with layout-aware extraction.

A Vision-Language Model optimized for document OCR with structured output containing layout information, bounding boxes, and multi-format text.

Supports PyTorch, VLLM, and API backends.

Example
from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRPyTorchConfig

extractor = DotsOCRTextExtractor(
        backend=DotsOCRPyTorchConfig(model="rednote-hilab/dots.ocr")
    )
result = extractor.extract(image, include_layout=True)
print(result.content)
for elem in result.layout:
        print(f"{elem.category}: {elem.bbox}")

DotsOCRTextExtractor

DotsOCRTextExtractor(backend: DotsOCRBackendConfig)

Bases: BaseTextExtractor

Dots OCR Vision-Language Model text extractor with layout detection.

Extracts text from document images with layout information including: - 11 layout categories (Caption, Footnote, Formula, List-item, etc.) - Bounding boxes (normalized to 0-1024) - Multi-format text (Markdown, LaTeX, HTML) - Reading order preservation

Supports PyTorch, VLLM, and API backends.

Example
from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRPyTorchConfig

# Initialize with PyTorch backend
extractor = DotsOCRTextExtractor(
        backend=DotsOCRPyTorchConfig(model="rednote-hilab/dots.ocr")
    )

# Extract with layout
result = extractor.extract(image, include_layout=True)
print(f"Found {result.num_layout_elements} elements")
print(result.content)

Initialize Dots OCR text extractor.

PARAMETER DESCRIPTION
backend

Backend configuration. One of: - DotsOCRPyTorchConfig: PyTorch/HuggingFace backend - DotsOCRVLLMConfig: VLLM high-throughput backend - DotsOCRAPIConfig: API backend (online VLLM server)

TYPE: DotsOCRBackendConfig

Source code in omnidocs/tasks/text_extraction/dotsocr/extractor.py
def __init__(self, backend: DotsOCRBackendConfig):
    """
    Initialize Dots OCR text extractor.

    Args:
        backend: Backend configuration. One of:
            - DotsOCRPyTorchConfig: PyTorch/HuggingFace backend
            - DotsOCRVLLMConfig: VLLM high-throughput backend
            - DotsOCRAPIConfig: API backend (online VLLM server)
    """
    self.backend_config = backend
    self._backend: Any = None
    self._processor: Any = None
    self._model: Any = None
    self._loaded = False

    # Load model
    self._load_model()
extract
extract(
    image: Union[Image, ndarray, str, Path],
    output_format: Literal[
        "markdown", "html", "json"
    ] = "markdown",
    include_layout: bool = False,
    custom_prompt: Optional[str] = None,
    max_tokens: int = 8192,
) -> DotsOCRTextOutput

Extract text from image using Dots OCR.

PARAMETER DESCRIPTION
image

Input image (PIL Image, numpy array, or file path)

TYPE: Union[Image, ndarray, str, Path]

output_format

Output format ("markdown", "html", or "json")

TYPE: Literal['markdown', 'html', 'json'] DEFAULT: 'markdown'

include_layout

Include layout bounding boxes in output

TYPE: bool DEFAULT: False

custom_prompt

Override default extraction prompt

TYPE: Optional[str] DEFAULT: None

max_tokens

Maximum tokens for generation

TYPE: int DEFAULT: 8192

RETURNS DESCRIPTION
DotsOCRTextOutput

DotsOCRTextOutput with extracted content and optional layout

RAISES DESCRIPTION
RuntimeError

If model is not loaded or inference fails

Source code in omnidocs/tasks/text_extraction/dotsocr/extractor.py
def extract(
    self,
    image: Union[Image.Image, np.ndarray, str, Path],
    output_format: Literal["markdown", "html", "json"] = "markdown",
    include_layout: bool = False,
    custom_prompt: Optional[str] = None,
    max_tokens: int = 8192,
) -> DotsOCRTextOutput:
    """
    Extract text from image using Dots OCR.

    Args:
        image: Input image (PIL Image, numpy array, or file path)
        output_format: Output format ("markdown", "html", or "json")
        include_layout: Include layout bounding boxes in output
        custom_prompt: Override default extraction prompt
        max_tokens: Maximum tokens for generation

    Returns:
        DotsOCRTextOutput with extracted content and optional layout

    Raises:
        RuntimeError: If model is not loaded or inference fails
    """
    if not self._loaded:
        raise RuntimeError("Model not loaded. Call _load_model() first.")

    # Prepare image
    img = self._prepare_image(image)

    # Get prompt
    prompt = custom_prompt or DOTS_OCR_PROMPT

    # Run inference based on backend
    config_type = type(self.backend_config).__name__

    if config_type == "DotsOCRPyTorchConfig":
        raw_output = self._infer_pytorch(img, prompt, max_tokens)
    elif config_type == "DotsOCRVLLMConfig":
        raw_output = self._infer_vllm(img, prompt, max_tokens)
    elif config_type == "DotsOCRAPIConfig":
        raw_output = self._infer_api(img, prompt, max_tokens)
    else:
        raise RuntimeError(f"Unknown backend: {config_type}")

    # Parse output
    return self._parse_output(
        raw_output,
        img.size,
        output_format,
        include_layout,
    )

pytorch

PyTorch backend configuration for Dots OCR.

DotsOCRPyTorchConfig

Bases: BaseModel

PyTorch/HuggingFace backend configuration for Dots OCR.

Dots OCR provides layout-aware text extraction with 11 predefined layout categories (Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Section-header, Table, Text, Title).

Example
from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRPyTorchConfig

config = DotsOCRPyTorchConfig(
        model="rednote-hilab/dots.ocr",
        device="cuda",
        torch_dtype="bfloat16",
    )
extractor = DotsOCRTextExtractor(backend=config)

vllm

VLLM backend configuration for Dots OCR.

DotsOCRVLLMConfig

Bases: BaseModel

VLLM backend configuration for Dots OCR.

VLLM provides high-throughput inference with optimizations like: - PagedAttention for efficient KV cache management - Continuous batching for higher throughput - Optimized CUDA kernels

Example
from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRVLLMConfig

config = DotsOCRVLLMConfig(
        model="rednote-hilab/dots.ocr",
        tensor_parallel_size=2,
        gpu_memory_utilization=0.9,
    )
extractor = DotsOCRTextExtractor(backend=config)

granitedocling

Granite Docling text extraction with multi-backend support.

GraniteDoclingTextAPIConfig

Bases: BaseModel

Configuration for Granite Docling text extraction via API.

Uses litellm for provider-agnostic API access. Supports OpenRouter, Gemini, Azure, OpenAI, and any other litellm-compatible provider.

API keys can be passed directly or read from environment variables.

Example
# OpenRouter
config = GraniteDoclingTextAPIConfig(
    model="openrouter/ibm-granite/granite-docling-258M",
)

GraniteDoclingTextExtractor

GraniteDoclingTextExtractor(
    backend: GraniteDoclingTextBackendConfig,
)

Bases: BaseTextExtractor

Granite Docling text extractor supporting PyTorch, VLLM, MLX, and API backends.

Granite Docling is IBM's compact vision-language model optimized for document conversion. It outputs DocTags format which is converted to Markdown using the docling_core library.

Example

from omnidocs.tasks.text_extraction.granitedocling import ( ... GraniteDoclingTextExtractor, ... GraniteDoclingTextPyTorchConfig, ... ) config = GraniteDoclingTextPyTorchConfig(device="cuda") extractor = GraniteDoclingTextExtractor(backend=config) result = extractor.extract(image, output_format="markdown") print(result.content)

Initialize Granite Docling extractor with backend configuration.

PARAMETER DESCRIPTION
backend

Backend configuration (PyTorch, VLLM, MLX, or API config)

TYPE: GraniteDoclingTextBackendConfig

Source code in omnidocs/tasks/text_extraction/granitedocling/extractor.py
def __init__(self, backend: GraniteDoclingTextBackendConfig):
    """
    Initialize Granite Docling extractor with backend configuration.

    Args:
        backend: Backend configuration (PyTorch, VLLM, MLX, or API config)
    """
    self.backend_config = backend
    self._backend: Any = None
    self._processor: Any = None
    self._loaded: bool = False

    # Backend-specific helpers
    self._mlx_config: Any = None
    self._apply_chat_template: Any = None
    self._generate: Any = None
    self._sampling_params_class: Any = None
    self._device: str = "cpu"

    self._load_model()

extract

extract(
    image: Union[Image, ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput

Extract text from an image using Granite Docling.

PARAMETER DESCRIPTION
image

Input image (PIL Image, numpy array, or file path)

TYPE: Union[Image, ndarray, str, Path]

output_format

Output format ("markdown" or "html")

TYPE: Literal['html', 'markdown'] DEFAULT: 'markdown'

RETURNS DESCRIPTION
TextOutput

TextOutput with extracted content

Source code in omnidocs/tasks/text_extraction/granitedocling/extractor.py
def extract(
    self,
    image: Union[Image.Image, np.ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput:
    """
    Extract text from an image using Granite Docling.

    Args:
        image: Input image (PIL Image, numpy array, or file path)
        output_format: Output format ("markdown" or "html")

    Returns:
        TextOutput with extracted content
    """
    if not self._loaded:
        raise RuntimeError("Model not loaded")

    if output_format not in ("html", "markdown"):
        raise ValueError(f"Invalid output_format: {output_format}")

    pil_image = self._prepare_image(image)
    width, height = pil_image.size

    # Dispatch to backend-specific inference
    config_type = type(self.backend_config).__name__

    if config_type == "GraniteDoclingTextPyTorchConfig":
        raw_output = self._infer_pytorch(pil_image)
    elif config_type == "GraniteDoclingTextVLLMConfig":
        raw_output = self._infer_vllm(pil_image)
    elif config_type == "GraniteDoclingTextMLXConfig":
        raw_output = self._infer_mlx(pil_image)
    elif config_type == "GraniteDoclingTextAPIConfig":
        raw_output = self._infer_api(pil_image)
    else:
        raise RuntimeError(f"Unknown backend: {config_type}")

    # Convert DocTags to Markdown
    markdown_output = self._convert_doctags_to_markdown(raw_output, pil_image)

    # For HTML output, wrap in basic HTML structure
    if output_format == "html":
        content = f"<html><body>\n{markdown_output}\n</body></html>"
    else:
        content = markdown_output

    return TextOutput(
        content=content,
        format=OutputFormat(output_format),
        raw_output=raw_output,
        plain_text=self._extract_plain_text(markdown_output),
        image_width=width,
        image_height=height,
        model_name=f"Granite-Docling-258M ({config_type.replace('Config', '')})",
    )

GraniteDoclingTextMLXConfig

Bases: BaseModel

Configuration for Granite Docling text extraction with MLX backend.

This backend is optimized for Apple Silicon Macs (M1/M2/M3/M4). Uses the MLX-optimized model variant.

GraniteDoclingTextPyTorchConfig

Bases: BaseModel

Configuration for Granite Docling text extraction with PyTorch backend.

GraniteDoclingTextVLLMConfig

Bases: BaseModel

Configuration for Granite Docling text extraction with VLLM backend.

IMPORTANT: This config uses revision="untied" by default, which is required for VLLM compatibility with Granite Docling's tied weights.

api

API backend configuration for Granite Docling text extraction.

Uses litellm for provider-agnostic inference (OpenRouter, Gemini, Azure, etc.).

GraniteDoclingTextAPIConfig

Bases: BaseModel

Configuration for Granite Docling text extraction via API.

Uses litellm for provider-agnostic API access. Supports OpenRouter, Gemini, Azure, OpenAI, and any other litellm-compatible provider.

API keys can be passed directly or read from environment variables.

Example
# OpenRouter
config = GraniteDoclingTextAPIConfig(
    model="openrouter/ibm-granite/granite-docling-258M",
)

extractor

Granite Docling text extractor with multi-backend support.

GraniteDoclingTextExtractor

GraniteDoclingTextExtractor(
    backend: GraniteDoclingTextBackendConfig,
)

Bases: BaseTextExtractor

Granite Docling text extractor supporting PyTorch, VLLM, MLX, and API backends.

Granite Docling is IBM's compact vision-language model optimized for document conversion. It outputs DocTags format which is converted to Markdown using the docling_core library.

Example

from omnidocs.tasks.text_extraction.granitedocling import ( ... GraniteDoclingTextExtractor, ... GraniteDoclingTextPyTorchConfig, ... ) config = GraniteDoclingTextPyTorchConfig(device="cuda") extractor = GraniteDoclingTextExtractor(backend=config) result = extractor.extract(image, output_format="markdown") print(result.content)

Initialize Granite Docling extractor with backend configuration.

PARAMETER DESCRIPTION
backend

Backend configuration (PyTorch, VLLM, MLX, or API config)

TYPE: GraniteDoclingTextBackendConfig

Source code in omnidocs/tasks/text_extraction/granitedocling/extractor.py
def __init__(self, backend: GraniteDoclingTextBackendConfig):
    """
    Initialize Granite Docling extractor with backend configuration.

    Args:
        backend: Backend configuration (PyTorch, VLLM, MLX, or API config)
    """
    self.backend_config = backend
    self._backend: Any = None
    self._processor: Any = None
    self._loaded: bool = False

    # Backend-specific helpers
    self._mlx_config: Any = None
    self._apply_chat_template: Any = None
    self._generate: Any = None
    self._sampling_params_class: Any = None
    self._device: str = "cpu"

    self._load_model()
extract
extract(
    image: Union[Image, ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput

Extract text from an image using Granite Docling.

PARAMETER DESCRIPTION
image

Input image (PIL Image, numpy array, or file path)

TYPE: Union[Image, ndarray, str, Path]

output_format

Output format ("markdown" or "html")

TYPE: Literal['html', 'markdown'] DEFAULT: 'markdown'

RETURNS DESCRIPTION
TextOutput

TextOutput with extracted content

Source code in omnidocs/tasks/text_extraction/granitedocling/extractor.py
def extract(
    self,
    image: Union[Image.Image, np.ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput:
    """
    Extract text from an image using Granite Docling.

    Args:
        image: Input image (PIL Image, numpy array, or file path)
        output_format: Output format ("markdown" or "html")

    Returns:
        TextOutput with extracted content
    """
    if not self._loaded:
        raise RuntimeError("Model not loaded")

    if output_format not in ("html", "markdown"):
        raise ValueError(f"Invalid output_format: {output_format}")

    pil_image = self._prepare_image(image)
    width, height = pil_image.size

    # Dispatch to backend-specific inference
    config_type = type(self.backend_config).__name__

    if config_type == "GraniteDoclingTextPyTorchConfig":
        raw_output = self._infer_pytorch(pil_image)
    elif config_type == "GraniteDoclingTextVLLMConfig":
        raw_output = self._infer_vllm(pil_image)
    elif config_type == "GraniteDoclingTextMLXConfig":
        raw_output = self._infer_mlx(pil_image)
    elif config_type == "GraniteDoclingTextAPIConfig":
        raw_output = self._infer_api(pil_image)
    else:
        raise RuntimeError(f"Unknown backend: {config_type}")

    # Convert DocTags to Markdown
    markdown_output = self._convert_doctags_to_markdown(raw_output, pil_image)

    # For HTML output, wrap in basic HTML structure
    if output_format == "html":
        content = f"<html><body>\n{markdown_output}\n</body></html>"
    else:
        content = markdown_output

    return TextOutput(
        content=content,
        format=OutputFormat(output_format),
        raw_output=raw_output,
        plain_text=self._extract_plain_text(markdown_output),
        image_width=width,
        image_height=height,
        model_name=f"Granite-Docling-258M ({config_type.replace('Config', '')})",
    )

mlx

MLX backend configuration for Granite Docling text extraction (Apple Silicon).

GraniteDoclingTextMLXConfig

Bases: BaseModel

Configuration for Granite Docling text extraction with MLX backend.

This backend is optimized for Apple Silicon Macs (M1/M2/M3/M4). Uses the MLX-optimized model variant.

pytorch

PyTorch backend configuration for Granite Docling text extraction.

GraniteDoclingTextPyTorchConfig

Bases: BaseModel

Configuration for Granite Docling text extraction with PyTorch backend.

vllm

VLLM backend configuration for Granite Docling text extraction.

GraniteDoclingTextVLLMConfig

Bases: BaseModel

Configuration for Granite Docling text extraction with VLLM backend.

IMPORTANT: This config uses revision="untied" by default, which is required for VLLM compatibility with Granite Docling's tied weights.

mineruvl

MinerU VL text extraction module.

MinerU VL is a vision-language model for document layout detection and text/table/equation recognition. It performs two-step extraction: 1. Layout Detection: Detect regions with types (text, table, equation, etc.) 2. Content Recognition: Extract content from each detected region

Example
from omnidocs.tasks.text_extraction import MinerUVLTextExtractor
from omnidocs.tasks.text_extraction.mineruvl import MinerUVLTextPyTorchConfig

# Initialize with PyTorch backend
extractor = MinerUVLTextExtractor(
    backend=MinerUVLTextPyTorchConfig(device="cuda")
)

# Extract text
result = extractor.extract(image)
print(result.content)

# Extract with detailed blocks
result, blocks = extractor.extract_with_blocks(image)
for block in blocks:
    print(f"{block.type}: {block.content[:50]}...")

MinerUVLTextAPIConfig

Bases: BaseModel

API backend config for MinerU VL text extraction.

Connects to a deployed VLLM server with OpenAI-compatible API.

Example
from omnidocs.tasks.text_extraction import MinerUVLTextExtractor
from omnidocs.tasks.text_extraction.mineruvl import MinerUVLTextAPIConfig

extractor = MinerUVLTextExtractor(
    backend=MinerUVLTextAPIConfig(
        server_url="https://your-server.modal.run"
    )
)
result = extractor.extract(image)

MinerUVLTextExtractor

MinerUVLTextExtractor(backend: MinerUVLTextBackendConfig)

Bases: BaseTextExtractor

MinerU VL text extractor with layout-aware extraction.

Performs two-step extraction: 1. Layout detection (detect regions) 2. Content recognition (extract text/table/equation from each region)

Supports multiple backends: - PyTorch (HuggingFace Transformers) - VLLM (high-throughput GPU) - MLX (Apple Silicon) - API (VLLM OpenAI-compatible server)

Example
from omnidocs.tasks.text_extraction import MinerUVLTextExtractor
from omnidocs.tasks.text_extraction.mineruvl import MinerUVLTextPyTorchConfig

extractor = MinerUVLTextExtractor(
    backend=MinerUVLTextPyTorchConfig(device="cuda")
)
result = extractor.extract(image)

print(result.content)  # Combined text + tables + equations
print(result.blocks)   # List of ContentBlock objects

Initialize MinerU VL text extractor.

PARAMETER DESCRIPTION
backend

Backend configuration (PyTorch, VLLM, MLX, or API)

TYPE: MinerUVLTextBackendConfig

Source code in omnidocs/tasks/text_extraction/mineruvl/extractor.py
def __init__(self, backend: MinerUVLTextBackendConfig):
    """
    Initialize MinerU VL text extractor.

    Args:
        backend: Backend configuration (PyTorch, VLLM, MLX, or API)
    """
    self.backend_config = backend
    self._client = None
    self._loaded = False
    self._load_model()

extract

extract(
    image: Union[Image, ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput

Extract text with layout-aware two-step extraction.

PARAMETER DESCRIPTION
image

Input image (PIL Image, numpy array, or file path)

TYPE: Union[Image, ndarray, str, Path]

output_format

Output format ('html' or 'markdown')

TYPE: Literal['html', 'markdown'] DEFAULT: 'markdown'

RETURNS DESCRIPTION
TextOutput

TextOutput with extracted content and metadata

Source code in omnidocs/tasks/text_extraction/mineruvl/extractor.py
def extract(
    self,
    image: Union[Image.Image, np.ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput:
    """
    Extract text with layout-aware two-step extraction.

    Args:
        image: Input image (PIL Image, numpy array, or file path)
        output_format: Output format ('html' or 'markdown')

    Returns:
        TextOutput with extracted content and metadata
    """
    if not self._loaded:
        raise RuntimeError("Model not loaded. Call _load_model() first.")

    pil_image = self._prepare_image(image)
    width, height = pil_image.size

    # Step 1: Layout detection
    blocks = self._detect_layout(pil_image)

    # Step 2: Content extraction for each block
    blocks = self._extract_content(pil_image, blocks)

    # Post-process (OTSL to HTML for tables)
    blocks = simple_post_process(blocks)

    # Combine content
    content = self._combine_content(blocks, output_format)

    # Build raw output with blocks info
    raw_output = self._build_raw_output(blocks)

    return TextOutput(
        content=content,
        format=OutputFormat(output_format),
        raw_output=raw_output,
        image_width=width,
        image_height=height,
        model_name="MinerU2.5-2509-1.2B",
    )

extract_with_blocks

extract_with_blocks(
    image: Union[Image, ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> tuple[TextOutput, List[ContentBlock]]

Extract text and return both TextOutput and ContentBlocks.

This method provides access to the detailed block information including bounding boxes and block types.

PARAMETER DESCRIPTION
image

Input image

TYPE: Union[Image, ndarray, str, Path]

output_format

Output format

TYPE: Literal['html', 'markdown'] DEFAULT: 'markdown'

RETURNS DESCRIPTION
tuple[TextOutput, List[ContentBlock]]

Tuple of (TextOutput, List[ContentBlock])

Source code in omnidocs/tasks/text_extraction/mineruvl/extractor.py
def extract_with_blocks(
    self,
    image: Union[Image.Image, np.ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> tuple[TextOutput, List[ContentBlock]]:
    """
    Extract text and return both TextOutput and ContentBlocks.

    This method provides access to the detailed block information
    including bounding boxes and block types.

    Args:
        image: Input image
        output_format: Output format

    Returns:
        Tuple of (TextOutput, List[ContentBlock])
    """
    if not self._loaded:
        raise RuntimeError("Model not loaded.")

    pil_image = self._prepare_image(image)
    width, height = pil_image.size

    # Two-step extraction
    blocks = self._detect_layout(pil_image)
    blocks = self._extract_content(pil_image, blocks)
    blocks = simple_post_process(blocks)

    content = self._combine_content(blocks, output_format)
    raw_output = self._build_raw_output(blocks)

    text_output = TextOutput(
        content=content,
        format=OutputFormat(output_format),
        raw_output=raw_output,
        image_width=width,
        image_height=height,
        model_name="MinerU2.5-2509-1.2B",
    )

    return text_output, blocks

MinerUVLTextMLXConfig

Bases: BaseModel

MLX backend config for MinerU VL text extraction on Apple Silicon.

Uses MLX-VLM for efficient inference on M1/M2/M3/M4 chips.

Example
from omnidocs.tasks.text_extraction import MinerUVLTextExtractor
from omnidocs.tasks.text_extraction.mineruvl import MinerUVLTextMLXConfig

extractor = MinerUVLTextExtractor(
    backend=MinerUVLTextMLXConfig()
)
result = extractor.extract(image)

MinerUVLTextPyTorchConfig

Bases: BaseModel

PyTorch/HuggingFace backend config for MinerU VL text extraction.

Uses HuggingFace Transformers with Qwen2VLForConditionalGeneration.

Example
from omnidocs.tasks.text_extraction import MinerUVLTextExtractor
from omnidocs.tasks.text_extraction.mineruvl import MinerUVLTextPyTorchConfig

extractor = MinerUVLTextExtractor(
    backend=MinerUVLTextPyTorchConfig(device="cuda")
)
result = extractor.extract(image)

BlockType

Bases: str, Enum

MinerU VL block types (22 categories).

ContentBlock

Bases: BaseModel

A detected content block with type, bounding box, angle, and content.

Coordinates are normalized to [0, 1] range relative to image dimensions.

to_absolute

to_absolute(
    image_width: int, image_height: int
) -> List[int]

Convert normalized bbox to absolute pixel coordinates.

Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
def to_absolute(self, image_width: int, image_height: int) -> List[int]:
    """Convert normalized bbox to absolute pixel coordinates."""
    x1, y1, x2, y2 = self.bbox
    return [
        int(x1 * image_width),
        int(y1 * image_height),
        int(x2 * image_width),
        int(y2 * image_height),
    ]

MinerUSamplingParams

MinerUSamplingParams(
    temperature: Optional[float] = 0.0,
    top_p: Optional[float] = 0.01,
    top_k: Optional[int] = 1,
    presence_penalty: Optional[float] = 0.0,
    frequency_penalty: Optional[float] = 0.0,
    repetition_penalty: Optional[float] = 1.0,
    no_repeat_ngram_size: Optional[int] = 100,
    max_new_tokens: Optional[int] = None,
)

Bases: SamplingParams

Default sampling parameters optimized for MinerU VL.

Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
def __init__(
    self,
    temperature: Optional[float] = 0.0,
    top_p: Optional[float] = 0.01,
    top_k: Optional[int] = 1,
    presence_penalty: Optional[float] = 0.0,
    frequency_penalty: Optional[float] = 0.0,
    repetition_penalty: Optional[float] = 1.0,
    no_repeat_ngram_size: Optional[int] = 100,
    max_new_tokens: Optional[int] = None,
):
    super().__init__(
        temperature,
        top_p,
        top_k,
        presence_penalty,
        frequency_penalty,
        repetition_penalty,
        no_repeat_ngram_size,
        max_new_tokens,
    )

SamplingParams dataclass

SamplingParams(
    temperature: Optional[float] = None,
    top_p: Optional[float] = None,
    top_k: Optional[int] = None,
    presence_penalty: Optional[float] = None,
    frequency_penalty: Optional[float] = None,
    repetition_penalty: Optional[float] = None,
    no_repeat_ngram_size: Optional[int] = None,
    max_new_tokens: Optional[int] = None,
)

Sampling parameters for text generation.

MinerUVLTextVLLMConfig

Bases: BaseModel

VLLM backend config for MinerU VL text extraction.

Uses VLLM for high-throughput GPU inference with: - PagedAttention for efficient KV cache - Continuous batching - Optimized CUDA kernels

Example
from omnidocs.tasks.text_extraction import MinerUVLTextExtractor
from omnidocs.tasks.text_extraction.mineruvl import MinerUVLTextVLLMConfig

extractor = MinerUVLTextExtractor(
    backend=MinerUVLTextVLLMConfig(
        tensor_parallel_size=1,
        gpu_memory_utilization=0.85,
    )
)
result = extractor.extract(image)

convert_otsl_to_html

convert_otsl_to_html(otsl_content: str) -> str

Convert OTSL table format to HTML.

Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
def convert_otsl_to_html(otsl_content: str) -> str:
    """Convert OTSL table format to HTML."""
    if otsl_content.startswith("<table") and otsl_content.endswith("</table>"):
        return otsl_content

    pattern = r"(" + r"|".join(ALL_OTSL_TOKENS) + r")"
    tokens = re.findall(pattern, otsl_content)
    text_parts = re.split(pattern, otsl_content)
    text_parts = [part for part in text_parts if part.strip()]

    split_row_tokens = [list(y) for x, y in itertools.groupby(tokens, lambda z: z == OTSL_NL) if not x]
    if not split_row_tokens:
        return ""

    max_cols = max(len(row) for row in split_row_tokens)
    for row in split_row_tokens:
        while len(row) < max_cols:
            row.append(OTSL_ECEL)

    def count_right(tokens_grid, c, r, which_tokens):
        span = 0
        c_iter = c
        while c_iter < len(tokens_grid[r]) and tokens_grid[r][c_iter] in which_tokens:
            c_iter += 1
            span += 1
        return span

    def count_down(tokens_grid, c, r, which_tokens):
        span = 0
        r_iter = r
        while r_iter < len(tokens_grid) and tokens_grid[r_iter][c] in which_tokens:
            r_iter += 1
            span += 1
        return span

    table_cells = []
    r_idx = 0
    c_idx = 0

    for i, text in enumerate(text_parts):
        if text in [OTSL_FCEL, OTSL_ECEL]:
            row_span = 1
            col_span = 1
            cell_text = ""
            right_offset = 1

            if text != OTSL_ECEL and i + 1 < len(text_parts):
                next_text = text_parts[i + 1]
                if next_text not in ALL_OTSL_TOKENS:
                    cell_text = next_text
                    right_offset = 2

            if i + right_offset < len(text_parts):
                next_right = text_parts[i + right_offset]
                if next_right in [OTSL_LCEL, OTSL_XCEL]:
                    col_span += count_right(split_row_tokens, c_idx + 1, r_idx, [OTSL_LCEL, OTSL_XCEL])

            if r_idx + 1 < len(split_row_tokens) and c_idx < len(split_row_tokens[r_idx + 1]):
                next_bottom = split_row_tokens[r_idx + 1][c_idx]
                if next_bottom in [OTSL_UCEL, OTSL_XCEL]:
                    row_span += count_down(split_row_tokens, c_idx, r_idx + 1, [OTSL_UCEL, OTSL_XCEL])

            table_cells.append(
                {
                    "text": cell_text.strip(),
                    "row_span": row_span,
                    "col_span": col_span,
                    "start_row": r_idx,
                    "start_col": c_idx,
                }
            )

        if text in [OTSL_FCEL, OTSL_ECEL, OTSL_LCEL, OTSL_UCEL, OTSL_XCEL]:
            c_idx += 1
        if text == OTSL_NL:
            r_idx += 1
            c_idx = 0

    num_rows = len(split_row_tokens)
    num_cols = max_cols
    grid = [[None for _ in range(num_cols)] for _ in range(num_rows)]

    for cell in table_cells:
        for i in range(cell["start_row"], min(cell["start_row"] + cell["row_span"], num_rows)):
            for j in range(cell["start_col"], min(cell["start_col"] + cell["col_span"], num_cols)):
                grid[i][j] = cell

    html_parts = []
    for i in range(num_rows):
        html_parts.append("<tr>")
        for j in range(num_cols):
            cell = grid[i][j]
            if cell is None:
                continue
            if cell["start_row"] != i or cell["start_col"] != j:
                continue

            content = html.escape(cell["text"])
            tag = "td"
            parts = [f"<{tag}"]
            if cell["row_span"] > 1:
                parts.append(f' rowspan="{cell["row_span"]}"')
            if cell["col_span"] > 1:
                parts.append(f' colspan="{cell["col_span"]}"')
            parts.append(f">{content}</{tag}>")
            html_parts.append("".join(parts))
        html_parts.append("</tr>")

    return f"<table>{''.join(html_parts)}</table>"

parse_layout_output

parse_layout_output(output: str) -> List[ContentBlock]

Parse layout detection model output into ContentBlocks.

Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
def parse_layout_output(output: str) -> List[ContentBlock]:
    """Parse layout detection model output into ContentBlocks."""
    blocks = []
    for line in output.split("\n"):
        match = re.match(LAYOUT_REGEX, line)
        if not match:
            continue
        x1, y1, x2, y2, ref_type, tail = match.groups()
        bbox = convert_bbox((x1, y1, x2, y2))
        if bbox is None:
            continue
        ref_type = ref_type.lower()
        if ref_type not in BLOCK_TYPES:
            continue
        angle = parse_angle(tail)
        blocks.append(
            ContentBlock(
                type=BlockType(ref_type),
                bbox=bbox,
                angle=angle,
            )
        )
    return blocks

api

API backend configuration for MinerU VL text extraction.

MinerUVLTextAPIConfig

Bases: BaseModel

API backend config for MinerU VL text extraction.

Connects to a deployed VLLM server with OpenAI-compatible API.

Example
from omnidocs.tasks.text_extraction import MinerUVLTextExtractor
from omnidocs.tasks.text_extraction.mineruvl import MinerUVLTextAPIConfig

extractor = MinerUVLTextExtractor(
    backend=MinerUVLTextAPIConfig(
        server_url="https://your-server.modal.run"
    )
)
result = extractor.extract(image)

extractor

MinerU VL text extractor with layout-aware two-step extraction.

MinerU VL performs document extraction in two steps: 1. Layout Detection: Detect regions with types (text, table, equation, etc.) 2. Content Recognition: Extract text/table/equation content from each region

MinerUVLTextExtractor

MinerUVLTextExtractor(backend: MinerUVLTextBackendConfig)

Bases: BaseTextExtractor

MinerU VL text extractor with layout-aware extraction.

Performs two-step extraction: 1. Layout detection (detect regions) 2. Content recognition (extract text/table/equation from each region)

Supports multiple backends: - PyTorch (HuggingFace Transformers) - VLLM (high-throughput GPU) - MLX (Apple Silicon) - API (VLLM OpenAI-compatible server)

Example
from omnidocs.tasks.text_extraction import MinerUVLTextExtractor
from omnidocs.tasks.text_extraction.mineruvl import MinerUVLTextPyTorchConfig

extractor = MinerUVLTextExtractor(
    backend=MinerUVLTextPyTorchConfig(device="cuda")
)
result = extractor.extract(image)

print(result.content)  # Combined text + tables + equations
print(result.blocks)   # List of ContentBlock objects

Initialize MinerU VL text extractor.

PARAMETER DESCRIPTION
backend

Backend configuration (PyTorch, VLLM, MLX, or API)

TYPE: MinerUVLTextBackendConfig

Source code in omnidocs/tasks/text_extraction/mineruvl/extractor.py
def __init__(self, backend: MinerUVLTextBackendConfig):
    """
    Initialize MinerU VL text extractor.

    Args:
        backend: Backend configuration (PyTorch, VLLM, MLX, or API)
    """
    self.backend_config = backend
    self._client = None
    self._loaded = False
    self._load_model()
extract
extract(
    image: Union[Image, ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput

Extract text with layout-aware two-step extraction.

PARAMETER DESCRIPTION
image

Input image (PIL Image, numpy array, or file path)

TYPE: Union[Image, ndarray, str, Path]

output_format

Output format ('html' or 'markdown')

TYPE: Literal['html', 'markdown'] DEFAULT: 'markdown'

RETURNS DESCRIPTION
TextOutput

TextOutput with extracted content and metadata

Source code in omnidocs/tasks/text_extraction/mineruvl/extractor.py
def extract(
    self,
    image: Union[Image.Image, np.ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput:
    """
    Extract text with layout-aware two-step extraction.

    Args:
        image: Input image (PIL Image, numpy array, or file path)
        output_format: Output format ('html' or 'markdown')

    Returns:
        TextOutput with extracted content and metadata
    """
    if not self._loaded:
        raise RuntimeError("Model not loaded. Call _load_model() first.")

    pil_image = self._prepare_image(image)
    width, height = pil_image.size

    # Step 1: Layout detection
    blocks = self._detect_layout(pil_image)

    # Step 2: Content extraction for each block
    blocks = self._extract_content(pil_image, blocks)

    # Post-process (OTSL to HTML for tables)
    blocks = simple_post_process(blocks)

    # Combine content
    content = self._combine_content(blocks, output_format)

    # Build raw output with blocks info
    raw_output = self._build_raw_output(blocks)

    return TextOutput(
        content=content,
        format=OutputFormat(output_format),
        raw_output=raw_output,
        image_width=width,
        image_height=height,
        model_name="MinerU2.5-2509-1.2B",
    )
extract_with_blocks
extract_with_blocks(
    image: Union[Image, ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> tuple[TextOutput, List[ContentBlock]]

Extract text and return both TextOutput and ContentBlocks.

This method provides access to the detailed block information including bounding boxes and block types.

PARAMETER DESCRIPTION
image

Input image

TYPE: Union[Image, ndarray, str, Path]

output_format

Output format

TYPE: Literal['html', 'markdown'] DEFAULT: 'markdown'

RETURNS DESCRIPTION
tuple[TextOutput, List[ContentBlock]]

Tuple of (TextOutput, List[ContentBlock])

Source code in omnidocs/tasks/text_extraction/mineruvl/extractor.py
def extract_with_blocks(
    self,
    image: Union[Image.Image, np.ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> tuple[TextOutput, List[ContentBlock]]:
    """
    Extract text and return both TextOutput and ContentBlocks.

    This method provides access to the detailed block information
    including bounding boxes and block types.

    Args:
        image: Input image
        output_format: Output format

    Returns:
        Tuple of (TextOutput, List[ContentBlock])
    """
    if not self._loaded:
        raise RuntimeError("Model not loaded.")

    pil_image = self._prepare_image(image)
    width, height = pil_image.size

    # Two-step extraction
    blocks = self._detect_layout(pil_image)
    blocks = self._extract_content(pil_image, blocks)
    blocks = simple_post_process(blocks)

    content = self._combine_content(blocks, output_format)
    raw_output = self._build_raw_output(blocks)

    text_output = TextOutput(
        content=content,
        format=OutputFormat(output_format),
        raw_output=raw_output,
        image_width=width,
        image_height=height,
        model_name="MinerU2.5-2509-1.2B",
    )

    return text_output, blocks

mlx

MLX backend configuration for MinerU VL text extraction (Apple Silicon).

MinerUVLTextMLXConfig

Bases: BaseModel

MLX backend config for MinerU VL text extraction on Apple Silicon.

Uses MLX-VLM for efficient inference on M1/M2/M3/M4 chips.

Example
from omnidocs.tasks.text_extraction import MinerUVLTextExtractor
from omnidocs.tasks.text_extraction.mineruvl import MinerUVLTextMLXConfig

extractor = MinerUVLTextExtractor(
    backend=MinerUVLTextMLXConfig()
)
result = extractor.extract(image)

pytorch

PyTorch/HuggingFace backend configuration for MinerU VL text extraction.

MinerUVLTextPyTorchConfig

Bases: BaseModel

PyTorch/HuggingFace backend config for MinerU VL text extraction.

Uses HuggingFace Transformers with Qwen2VLForConditionalGeneration.

Example
from omnidocs.tasks.text_extraction import MinerUVLTextExtractor
from omnidocs.tasks.text_extraction.mineruvl import MinerUVLTextPyTorchConfig

extractor = MinerUVLTextExtractor(
    backend=MinerUVLTextPyTorchConfig(device="cuda")
)
result = extractor.extract(image)

utils

MinerU VL utilities for document extraction.

Contains data structures, parsing, prompts, and post-processing functions for MinerU VL document extraction pipeline.

This file contains code adapted from mineru-vl-utils

https://github.com/opendatalab/mineru-vl-utils https://pypi.org/project/mineru-vl-utils/

The original mineru-vl-utils is licensed under AGPL-3.0: Copyright (c) OpenDataLab https://github.com/opendatalab/mineru-vl-utils/blob/main/LICENSE.md

Adapted components
  • BlockType enum (from structs.py)
  • ContentBlock data structure (from structs.py)
  • OTSL to HTML table conversion (from post_process/otsl2html.py)

BlockType

Bases: str, Enum

MinerU VL block types (22 categories).

ContentBlock

Bases: BaseModel

A detected content block with type, bounding box, angle, and content.

Coordinates are normalized to [0, 1] range relative to image dimensions.

to_absolute
to_absolute(
    image_width: int, image_height: int
) -> List[int]

Convert normalized bbox to absolute pixel coordinates.

Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
def to_absolute(self, image_width: int, image_height: int) -> List[int]:
    """Convert normalized bbox to absolute pixel coordinates."""
    x1, y1, x2, y2 = self.bbox
    return [
        int(x1 * image_width),
        int(y1 * image_height),
        int(x2 * image_width),
        int(y2 * image_height),
    ]

SamplingParams dataclass

SamplingParams(
    temperature: Optional[float] = None,
    top_p: Optional[float] = None,
    top_k: Optional[int] = None,
    presence_penalty: Optional[float] = None,
    frequency_penalty: Optional[float] = None,
    repetition_penalty: Optional[float] = None,
    no_repeat_ngram_size: Optional[int] = None,
    max_new_tokens: Optional[int] = None,
)

Sampling parameters for text generation.

MinerUSamplingParams

MinerUSamplingParams(
    temperature: Optional[float] = 0.0,
    top_p: Optional[float] = 0.01,
    top_k: Optional[int] = 1,
    presence_penalty: Optional[float] = 0.0,
    frequency_penalty: Optional[float] = 0.0,
    repetition_penalty: Optional[float] = 1.0,
    no_repeat_ngram_size: Optional[int] = 100,
    max_new_tokens: Optional[int] = None,
)

Bases: SamplingParams

Default sampling parameters optimized for MinerU VL.

Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
def __init__(
    self,
    temperature: Optional[float] = 0.0,
    top_p: Optional[float] = 0.01,
    top_k: Optional[int] = 1,
    presence_penalty: Optional[float] = 0.0,
    frequency_penalty: Optional[float] = 0.0,
    repetition_penalty: Optional[float] = 1.0,
    no_repeat_ngram_size: Optional[int] = 100,
    max_new_tokens: Optional[int] = None,
):
    super().__init__(
        temperature,
        top_p,
        top_k,
        presence_penalty,
        frequency_penalty,
        repetition_penalty,
        no_repeat_ngram_size,
        max_new_tokens,
    )

convert_bbox

convert_bbox(bbox: Sequence) -> Optional[List[float]]

Convert bbox from model output (0-1000) to normalized format (0-1).

Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
def convert_bbox(bbox: Sequence) -> Optional[List[float]]:
    """Convert bbox from model output (0-1000) to normalized format (0-1)."""
    bbox = tuple(map(int, bbox))
    if any(coord < 0 or coord > 1000 for coord in bbox):
        return None
    x1, y1, x2, y2 = bbox
    x1, x2 = (x2, x1) if x2 < x1 else (x1, x2)
    y1, y2 = (y2, y1) if y2 < y1 else (y1, y2)
    if x1 == x2 or y1 == y2:
        return None
    return [coord / 1000.0 for coord in (x1, y1, x2, y2)]

parse_angle

parse_angle(tail: str) -> Literal[None, 0, 90, 180, 270]

Parse rotation angle from model output tail string.

Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
def parse_angle(tail: str) -> Literal[None, 0, 90, 180, 270]:
    """Parse rotation angle from model output tail string."""
    for token, angle in ANGLE_MAPPING.items():
        if token in tail:
            return angle
    return None

parse_layout_output

parse_layout_output(output: str) -> List[ContentBlock]

Parse layout detection model output into ContentBlocks.

Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
def parse_layout_output(output: str) -> List[ContentBlock]:
    """Parse layout detection model output into ContentBlocks."""
    blocks = []
    for line in output.split("\n"):
        match = re.match(LAYOUT_REGEX, line)
        if not match:
            continue
        x1, y1, x2, y2, ref_type, tail = match.groups()
        bbox = convert_bbox((x1, y1, x2, y2))
        if bbox is None:
            continue
        ref_type = ref_type.lower()
        if ref_type not in BLOCK_TYPES:
            continue
        angle = parse_angle(tail)
        blocks.append(
            ContentBlock(
                type=BlockType(ref_type),
                bbox=bbox,
                angle=angle,
            )
        )
    return blocks

get_rgb_image

get_rgb_image(image: Image) -> Image.Image

Convert image to RGB mode.

Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
def get_rgb_image(image: Image.Image) -> Image.Image:
    """Convert image to RGB mode."""
    if image.mode == "P":
        image = image.convert("RGBA")
    if image.mode != "RGB":
        image = image.convert("RGB")
    return image

prepare_for_layout

prepare_for_layout(
    image: Image,
    layout_size: Tuple[int, int] = LAYOUT_IMAGE_SIZE,
) -> Image.Image

Prepare image for layout detection.

Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
def prepare_for_layout(
    image: Image.Image,
    layout_size: Tuple[int, int] = LAYOUT_IMAGE_SIZE,
) -> Image.Image:
    """Prepare image for layout detection."""
    image = get_rgb_image(image)
    image = image.resize(layout_size, Image.Resampling.BICUBIC)
    return image

resize_by_need

resize_by_need(
    image: Image, min_edge: int = 28, max_ratio: float = 50
) -> Image.Image

Resize image if needed based on aspect ratio constraints.

Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
def resize_by_need(
    image: Image.Image,
    min_edge: int = 28,
    max_ratio: float = 50,
) -> Image.Image:
    """Resize image if needed based on aspect ratio constraints."""
    edge_ratio = max(image.size) / min(image.size)
    if edge_ratio > max_ratio:
        width, height = image.size
        if width > height:
            new_w, new_h = width, math.ceil(width / max_ratio)
        else:
            new_w, new_h = math.ceil(height / max_ratio), height
        new_image = Image.new(image.mode, (new_w, new_h), (255, 255, 255))
        new_image.paste(image, (int((new_w - width) / 2), int((new_h - height) / 2)))
        image = new_image
    if min(image.size) < min_edge:
        scale = min_edge / min(image.size)
        new_w, new_h = round(image.width * scale), round(image.height * scale)
        image = image.resize((new_w, new_h), Image.Resampling.BICUBIC)
    return image

prepare_for_extract

prepare_for_extract(
    image: Image,
    blocks: List[ContentBlock],
    prompts: Dict[str, str] = None,
    sampling_params: Dict[str, SamplingParams] = None,
    skip_types: set = None,
) -> Tuple[
    List[Image.Image],
    List[str],
    List[SamplingParams],
    List[int],
]

Prepare cropped images for content extraction.

Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
def prepare_for_extract(
    image: Image.Image,
    blocks: List[ContentBlock],
    prompts: Dict[str, str] = None,
    sampling_params: Dict[str, SamplingParams] = None,
    skip_types: set = None,
) -> Tuple[List[Image.Image], List[str], List[SamplingParams], List[int]]:
    """Prepare cropped images for content extraction."""
    if prompts is None:
        prompts = DEFAULT_PROMPTS
    if sampling_params is None:
        sampling_params = DEFAULT_SAMPLING_PARAMS
    if skip_types is None:
        skip_types = {"image", "list", "equation_block"}

    image = get_rgb_image(image)
    width, height = image.size

    block_images = []
    prompt_list = []
    params_list = []
    indices = []

    for idx, block in enumerate(blocks):
        if block.type.value in skip_types:
            continue

        x1, y1, x2, y2 = block.bbox
        scaled_bbox = (x1 * width, y1 * height, x2 * width, y2 * height)
        block_image = image.crop(scaled_bbox)

        if block_image.width < 1 or block_image.height < 1:
            continue

        if block.angle in [90, 180, 270]:
            block_image = block_image.rotate(block.angle, expand=True)

        block_image = resize_by_need(block_image)
        block_images.append(block_image)

        block_type = block.type.value
        prompt = prompts.get(block_type) or prompts.get("[default]")
        prompt_list.append(prompt)

        params = sampling_params.get(block_type) or sampling_params.get("[default]")
        params_list.append(params)
        indices.append(idx)

    return block_images, prompt_list, params_list, indices

convert_otsl_to_html

convert_otsl_to_html(otsl_content: str) -> str

Convert OTSL table format to HTML.

Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
def convert_otsl_to_html(otsl_content: str) -> str:
    """Convert OTSL table format to HTML."""
    if otsl_content.startswith("<table") and otsl_content.endswith("</table>"):
        return otsl_content

    pattern = r"(" + r"|".join(ALL_OTSL_TOKENS) + r")"
    tokens = re.findall(pattern, otsl_content)
    text_parts = re.split(pattern, otsl_content)
    text_parts = [part for part in text_parts if part.strip()]

    split_row_tokens = [list(y) for x, y in itertools.groupby(tokens, lambda z: z == OTSL_NL) if not x]
    if not split_row_tokens:
        return ""

    max_cols = max(len(row) for row in split_row_tokens)
    for row in split_row_tokens:
        while len(row) < max_cols:
            row.append(OTSL_ECEL)

    def count_right(tokens_grid, c, r, which_tokens):
        span = 0
        c_iter = c
        while c_iter < len(tokens_grid[r]) and tokens_grid[r][c_iter] in which_tokens:
            c_iter += 1
            span += 1
        return span

    def count_down(tokens_grid, c, r, which_tokens):
        span = 0
        r_iter = r
        while r_iter < len(tokens_grid) and tokens_grid[r_iter][c] in which_tokens:
            r_iter += 1
            span += 1
        return span

    table_cells = []
    r_idx = 0
    c_idx = 0

    for i, text in enumerate(text_parts):
        if text in [OTSL_FCEL, OTSL_ECEL]:
            row_span = 1
            col_span = 1
            cell_text = ""
            right_offset = 1

            if text != OTSL_ECEL and i + 1 < len(text_parts):
                next_text = text_parts[i + 1]
                if next_text not in ALL_OTSL_TOKENS:
                    cell_text = next_text
                    right_offset = 2

            if i + right_offset < len(text_parts):
                next_right = text_parts[i + right_offset]
                if next_right in [OTSL_LCEL, OTSL_XCEL]:
                    col_span += count_right(split_row_tokens, c_idx + 1, r_idx, [OTSL_LCEL, OTSL_XCEL])

            if r_idx + 1 < len(split_row_tokens) and c_idx < len(split_row_tokens[r_idx + 1]):
                next_bottom = split_row_tokens[r_idx + 1][c_idx]
                if next_bottom in [OTSL_UCEL, OTSL_XCEL]:
                    row_span += count_down(split_row_tokens, c_idx, r_idx + 1, [OTSL_UCEL, OTSL_XCEL])

            table_cells.append(
                {
                    "text": cell_text.strip(),
                    "row_span": row_span,
                    "col_span": col_span,
                    "start_row": r_idx,
                    "start_col": c_idx,
                }
            )

        if text in [OTSL_FCEL, OTSL_ECEL, OTSL_LCEL, OTSL_UCEL, OTSL_XCEL]:
            c_idx += 1
        if text == OTSL_NL:
            r_idx += 1
            c_idx = 0

    num_rows = len(split_row_tokens)
    num_cols = max_cols
    grid = [[None for _ in range(num_cols)] for _ in range(num_rows)]

    for cell in table_cells:
        for i in range(cell["start_row"], min(cell["start_row"] + cell["row_span"], num_rows)):
            for j in range(cell["start_col"], min(cell["start_col"] + cell["col_span"], num_cols)):
                grid[i][j] = cell

    html_parts = []
    for i in range(num_rows):
        html_parts.append("<tr>")
        for j in range(num_cols):
            cell = grid[i][j]
            if cell is None:
                continue
            if cell["start_row"] != i or cell["start_col"] != j:
                continue

            content = html.escape(cell["text"])
            tag = "td"
            parts = [f"<{tag}"]
            if cell["row_span"] > 1:
                parts.append(f' rowspan="{cell["row_span"]}"')
            if cell["col_span"] > 1:
                parts.append(f' colspan="{cell["col_span"]}"')
            parts.append(f">{content}</{tag}>")
            html_parts.append("".join(parts))
        html_parts.append("</tr>")

    return f"<table>{''.join(html_parts)}</table>"

simple_post_process

simple_post_process(
    blocks: List[ContentBlock],
) -> List[ContentBlock]

Simple post-processing: convert OTSL tables to HTML.

Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
def simple_post_process(blocks: List[ContentBlock]) -> List[ContentBlock]:
    """Simple post-processing: convert OTSL tables to HTML."""
    for block in blocks:
        if block.type == BlockType.TABLE and block.content:
            try:
                block.content = convert_otsl_to_html(block.content)
            except Exception:
                pass
    return blocks

vllm

VLLM backend configuration for MinerU VL text extraction.

MinerUVLTextVLLMConfig

Bases: BaseModel

VLLM backend config for MinerU VL text extraction.

Uses VLLM for high-throughput GPU inference with: - PagedAttention for efficient KV cache - Continuous batching - Optimized CUDA kernels

Example
from omnidocs.tasks.text_extraction import MinerUVLTextExtractor
from omnidocs.tasks.text_extraction.mineruvl import MinerUVLTextVLLMConfig

extractor = MinerUVLTextExtractor(
    backend=MinerUVLTextVLLMConfig(
        tensor_parallel_size=1,
        gpu_memory_utilization=0.85,
    )
)
result = extractor.extract(image)

models

Pydantic models for text extraction outputs.

Defines output types and format enums for text extraction.

OutputFormat

Bases: str, Enum

Supported text extraction output formats.

Each format has different characteristics
  • HTML: Structured with div elements, preserves layout semantics
  • MARKDOWN: Portable, human-readable, good for documentation
  • JSON: Structured data with layout information (Dots OCR)

TextOutput

Bases: BaseModel

Text extraction output from a document image.

Contains the extracted text content in the requested format, along with optional raw output and plain text versions.

Example
result = extractor.extract(image, output_format="markdown")
print(result.content)  # Clean markdown
print(result.plain_text)  # Plain text without formatting

content_length property

content_length: int

Length of the extracted content in characters.

word_count property

word_count: int

Approximate word count of the plain text.

LayoutElement

Bases: BaseModel

Single layout element from document layout detection.

Represents a detected region in the document with its bounding box, category label, and extracted text content.

ATTRIBUTE DESCRIPTION
bbox

Bounding box coordinates [x1, y1, x2, y2] (normalized to 0-1024)

TYPE: List[int]

category

Layout category (e.g., "Text", "Title", "Table", "Formula")

TYPE: str

text

Extracted text content (None for pictures)

TYPE: Optional[str]

confidence

Detection confidence score (optional)

TYPE: Optional[float]

DotsOCRTextOutput

Bases: BaseModel

Text extraction output from Dots OCR with layout information.

Dots OCR provides structured output with: - Layout detection (11 categories) - Bounding boxes (normalized to 0-1024) - Multi-format text (Markdown/LaTeX/HTML) - Reading order preservation

Layout Categories

Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Section-header, Table, Text, Title

Text Formatting
  • Text/Title/Section-header: Markdown
  • Formula: LaTeX
  • Table: HTML
  • Picture: (text omitted)
Example
from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
result = extractor.extract(image, include_layout=True)
print(result.content)  # Full text with formatting
for elem in result.layout:
        print(f"{elem.category}: {elem.bbox}")

num_layout_elements property

num_layout_elements: int

Number of detected layout elements.

content_length property

content_length: int

Length of extracted content in characters.

nanonets

Nanonets OCR2-3B backend configurations and extractor for text extraction.

Available backends
  • NanonetsTextPyTorchConfig: PyTorch/HuggingFace backend
  • NanonetsTextVLLMConfig: VLLM high-throughput backend
  • NanonetsTextMLXConfig: MLX backend for Apple Silicon
Example
from omnidocs.tasks.text_extraction.nanonets import NanonetsTextPyTorchConfig
config = NanonetsTextPyTorchConfig()

NanonetsTextExtractor

NanonetsTextExtractor(backend: NanonetsTextBackendConfig)

Bases: BaseTextExtractor

Nanonets OCR2-3B Vision-Language Model text extractor.

Extracts text from document images with support for: - Tables (output as HTML) - Equations (output as LaTeX) - Image captions (wrapped in tags) - Watermarks (wrapped in tags) - Page numbers (wrapped in tags) - Checkboxes (using ☐ and ☑ symbols)

Supports PyTorch, VLLM, and MLX backends.

Example
from omnidocs.tasks.text_extraction import NanonetsTextExtractor
from omnidocs.tasks.text_extraction.nanonets import NanonetsTextPyTorchConfig

# Initialize with PyTorch backend
extractor = NanonetsTextExtractor(
        backend=NanonetsTextPyTorchConfig()
    )

# Extract text
result = extractor.extract(image)
print(result.content)

Initialize Nanonets text extractor.

PARAMETER DESCRIPTION
backend

Backend configuration. One of: - NanonetsTextPyTorchConfig: PyTorch/HuggingFace backend - NanonetsTextVLLMConfig: VLLM high-throughput backend - NanonetsTextMLXConfig: MLX backend for Apple Silicon

TYPE: NanonetsTextBackendConfig

Source code in omnidocs/tasks/text_extraction/nanonets/extractor.py
def __init__(self, backend: NanonetsTextBackendConfig):
    """
    Initialize Nanonets text extractor.

    Args:
        backend: Backend configuration. One of:
            - NanonetsTextPyTorchConfig: PyTorch/HuggingFace backend
            - NanonetsTextVLLMConfig: VLLM high-throughput backend
            - NanonetsTextMLXConfig: MLX backend for Apple Silicon
    """
    self.backend_config = backend
    self._backend: Any = None
    self._processor: Any = None
    self._loaded = False

    # Backend-specific helpers
    self._process_vision_info: Any = None
    self._sampling_params_class: Any = None
    self._device: str = "cpu"

    # MLX-specific helpers
    self._mlx_config: Any = None
    self._apply_chat_template: Any = None
    self._generate: Any = None

    # Load model
    self._load_model()

extract

extract(
    image: Union[Image, ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput

Extract text from an image.

Note: Nanonets OCR2 produces a unified output format that includes tables as HTML and equations as LaTeX inline. The output_format parameter is accepted for API compatibility but does not change the output structure.

PARAMETER DESCRIPTION
image

Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file

TYPE: Union[Image, ndarray, str, Path]

output_format

Accepted for API compatibility (default: "markdown")

TYPE: Literal['html', 'markdown'] DEFAULT: 'markdown'

RETURNS DESCRIPTION
TextOutput

TextOutput containing extracted text content

RAISES DESCRIPTION
RuntimeError

If model is not loaded

ValueError

If image format is not supported

Source code in omnidocs/tasks/text_extraction/nanonets/extractor.py
def extract(
    self,
    image: Union[Image.Image, np.ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput:
    """
    Extract text from an image.

    Note: Nanonets OCR2 produces a unified output format that includes
    tables as HTML and equations as LaTeX inline. The output_format
    parameter is accepted for API compatibility but does not change
    the output structure.

    Args:
        image: Input image as:
            - PIL.Image.Image: PIL image object
            - np.ndarray: Numpy array (HWC format, RGB)
            - str or Path: Path to image file
        output_format: Accepted for API compatibility (default: "markdown")

    Returns:
        TextOutput containing extracted text content

    Raises:
        RuntimeError: If model is not loaded
        ValueError: If image format is not supported
    """
    if not self._loaded:
        raise RuntimeError("Model not loaded. Call _load_model() first.")

    # Prepare image
    pil_image = self._prepare_image(image)
    width, height = pil_image.size

    # Run inference based on backend
    config_type = type(self.backend_config).__name__
    if config_type == "NanonetsTextPyTorchConfig":
        raw_output = self._infer_pytorch(pil_image)
    elif config_type == "NanonetsTextVLLMConfig":
        raw_output = self._infer_vllm(pil_image)
    elif config_type == "NanonetsTextMLXConfig":
        raw_output = self._infer_mlx(pil_image)
    else:
        raise RuntimeError(f"Unknown backend: {config_type}")

    # Clean output
    cleaned_output = raw_output.replace("<|im_end|>", "").strip()

    return TextOutput(
        content=cleaned_output,
        format=OutputFormat(output_format),
        raw_output=raw_output,
        plain_text=cleaned_output,
        image_width=width,
        image_height=height,
        model_name=f"Nanonets-OCR2-3B ({type(self.backend_config).__name__})",
    )

NanonetsTextMLXConfig

Bases: BaseModel

MLX backend configuration for Nanonets OCR2-3B text extraction.

This backend uses MLX for Apple Silicon native inference. Best for local development and testing on macOS M1/M2/M3/M4+. Requires: mlx, mlx-vlm

Note: This backend only works on Apple Silicon Macs. Do NOT use for Modal/cloud deployments.

Example
config = NanonetsTextMLXConfig(
        model="mlx-community/Nanonets-OCR2-3B-bf16",
    )

NanonetsTextPyTorchConfig

Bases: BaseModel

PyTorch/HuggingFace backend configuration for Nanonets OCR2-3B text extraction.

This backend uses the transformers library with PyTorch for local GPU inference. Requires: torch, transformers, accelerate

Example
config = NanonetsTextPyTorchConfig(
        device="cuda",
        torch_dtype="float16",
    )

NanonetsTextVLLMConfig

Bases: BaseModel

VLLM backend configuration for Nanonets OCR2-3B text extraction.

This backend uses VLLM for high-throughput inference. Best for batch processing and production deployments. Requires: vllm, torch, transformers, qwen-vl-utils

Example
config = NanonetsTextVLLMConfig(
        tensor_parallel_size=1,
        gpu_memory_utilization=0.85,
    )

extractor

Nanonets OCR2-3B text extractor.

A Vision-Language Model for extracting text from document images with support for tables (HTML), equations (LaTeX), and image captions.

Supports PyTorch and VLLM backends.

Example
from omnidocs.tasks.text_extraction import NanonetsTextExtractor
from omnidocs.tasks.text_extraction.nanonets import NanonetsTextPyTorchConfig

extractor = NanonetsTextExtractor(
        backend=NanonetsTextPyTorchConfig()
    )
result = extractor.extract(image)
print(result.content)

NanonetsTextExtractor

NanonetsTextExtractor(backend: NanonetsTextBackendConfig)

Bases: BaseTextExtractor

Nanonets OCR2-3B Vision-Language Model text extractor.

Extracts text from document images with support for: - Tables (output as HTML) - Equations (output as LaTeX) - Image captions (wrapped in tags) - Watermarks (wrapped in tags) - Page numbers (wrapped in tags) - Checkboxes (using ☐ and ☑ symbols)

Supports PyTorch, VLLM, and MLX backends.

Example
from omnidocs.tasks.text_extraction import NanonetsTextExtractor
from omnidocs.tasks.text_extraction.nanonets import NanonetsTextPyTorchConfig

# Initialize with PyTorch backend
extractor = NanonetsTextExtractor(
        backend=NanonetsTextPyTorchConfig()
    )

# Extract text
result = extractor.extract(image)
print(result.content)

Initialize Nanonets text extractor.

PARAMETER DESCRIPTION
backend

Backend configuration. One of: - NanonetsTextPyTorchConfig: PyTorch/HuggingFace backend - NanonetsTextVLLMConfig: VLLM high-throughput backend - NanonetsTextMLXConfig: MLX backend for Apple Silicon

TYPE: NanonetsTextBackendConfig

Source code in omnidocs/tasks/text_extraction/nanonets/extractor.py
def __init__(self, backend: NanonetsTextBackendConfig):
    """
    Initialize Nanonets text extractor.

    Args:
        backend: Backend configuration. One of:
            - NanonetsTextPyTorchConfig: PyTorch/HuggingFace backend
            - NanonetsTextVLLMConfig: VLLM high-throughput backend
            - NanonetsTextMLXConfig: MLX backend for Apple Silicon
    """
    self.backend_config = backend
    self._backend: Any = None
    self._processor: Any = None
    self._loaded = False

    # Backend-specific helpers
    self._process_vision_info: Any = None
    self._sampling_params_class: Any = None
    self._device: str = "cpu"

    # MLX-specific helpers
    self._mlx_config: Any = None
    self._apply_chat_template: Any = None
    self._generate: Any = None

    # Load model
    self._load_model()
extract
extract(
    image: Union[Image, ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput

Extract text from an image.

Note: Nanonets OCR2 produces a unified output format that includes tables as HTML and equations as LaTeX inline. The output_format parameter is accepted for API compatibility but does not change the output structure.

PARAMETER DESCRIPTION
image

Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file

TYPE: Union[Image, ndarray, str, Path]

output_format

Accepted for API compatibility (default: "markdown")

TYPE: Literal['html', 'markdown'] DEFAULT: 'markdown'

RETURNS DESCRIPTION
TextOutput

TextOutput containing extracted text content

RAISES DESCRIPTION
RuntimeError

If model is not loaded

ValueError

If image format is not supported

Source code in omnidocs/tasks/text_extraction/nanonets/extractor.py
def extract(
    self,
    image: Union[Image.Image, np.ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput:
    """
    Extract text from an image.

    Note: Nanonets OCR2 produces a unified output format that includes
    tables as HTML and equations as LaTeX inline. The output_format
    parameter is accepted for API compatibility but does not change
    the output structure.

    Args:
        image: Input image as:
            - PIL.Image.Image: PIL image object
            - np.ndarray: Numpy array (HWC format, RGB)
            - str or Path: Path to image file
        output_format: Accepted for API compatibility (default: "markdown")

    Returns:
        TextOutput containing extracted text content

    Raises:
        RuntimeError: If model is not loaded
        ValueError: If image format is not supported
    """
    if not self._loaded:
        raise RuntimeError("Model not loaded. Call _load_model() first.")

    # Prepare image
    pil_image = self._prepare_image(image)
    width, height = pil_image.size

    # Run inference based on backend
    config_type = type(self.backend_config).__name__
    if config_type == "NanonetsTextPyTorchConfig":
        raw_output = self._infer_pytorch(pil_image)
    elif config_type == "NanonetsTextVLLMConfig":
        raw_output = self._infer_vllm(pil_image)
    elif config_type == "NanonetsTextMLXConfig":
        raw_output = self._infer_mlx(pil_image)
    else:
        raise RuntimeError(f"Unknown backend: {config_type}")

    # Clean output
    cleaned_output = raw_output.replace("<|im_end|>", "").strip()

    return TextOutput(
        content=cleaned_output,
        format=OutputFormat(output_format),
        raw_output=raw_output,
        plain_text=cleaned_output,
        image_width=width,
        image_height=height,
        model_name=f"Nanonets-OCR2-3B ({type(self.backend_config).__name__})",
    )

mlx

MLX backend configuration for Nanonets OCR2-3B text extraction.

NanonetsTextMLXConfig

Bases: BaseModel

MLX backend configuration for Nanonets OCR2-3B text extraction.

This backend uses MLX for Apple Silicon native inference. Best for local development and testing on macOS M1/M2/M3/M4+. Requires: mlx, mlx-vlm

Note: This backend only works on Apple Silicon Macs. Do NOT use for Modal/cloud deployments.

Example
config = NanonetsTextMLXConfig(
        model="mlx-community/Nanonets-OCR2-3B-bf16",
    )

pytorch

PyTorch/HuggingFace backend configuration for Nanonets OCR2-3B text extraction.

NanonetsTextPyTorchConfig

Bases: BaseModel

PyTorch/HuggingFace backend configuration for Nanonets OCR2-3B text extraction.

This backend uses the transformers library with PyTorch for local GPU inference. Requires: torch, transformers, accelerate

Example
config = NanonetsTextPyTorchConfig(
        device="cuda",
        torch_dtype="float16",
    )

vllm

VLLM backend configuration for Nanonets OCR2-3B text extraction.

NanonetsTextVLLMConfig

Bases: BaseModel

VLLM backend configuration for Nanonets OCR2-3B text extraction.

This backend uses VLLM for high-throughput inference. Best for batch processing and production deployments. Requires: vllm, torch, transformers, qwen-vl-utils

Example
config = NanonetsTextVLLMConfig(
        tensor_parallel_size=1,
        gpu_memory_utilization=0.85,
    )

qwen

Qwen3-VL backend configurations and extractor for text extraction.

Available backends
  • QwenTextPyTorchConfig: PyTorch/HuggingFace backend
  • QwenTextVLLMConfig: VLLM high-throughput backend
  • QwenTextMLXConfig: MLX backend for Apple Silicon
  • QwenTextAPIConfig: API backend (OpenRouter, etc.)
Example
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig
config = QwenTextPyTorchConfig(model="Qwen/Qwen3-VL-8B-Instruct")

QwenTextAPIConfig

Bases: BaseModel

API backend configuration for Qwen text extraction.

Uses litellm for provider-agnostic API access. Supports OpenRouter, Gemini, Azure, OpenAI, and any other litellm-compatible provider.

API keys can be passed directly or read from environment variables.

Example
# OpenRouter (reads OPENROUTER_API_KEY from env)
config = QwenTextAPIConfig(
    model="openrouter/qwen/qwen3-vl-8b-instruct",
)

# With explicit key
config = QwenTextAPIConfig(
    model="openrouter/qwen/qwen3-vl-8b-instruct",
    api_key=os.environ["OPENROUTER_API_KEY"],
    api_base="https://openrouter.ai/api/v1",
)

QwenTextExtractor

QwenTextExtractor(backend: QwenTextBackendConfig)

Bases: BaseTextExtractor

Qwen3-VL Vision-Language Model text extractor.

Extracts text from document images and outputs as structured HTML or Markdown. Uses Qwen3-VL's built-in document parsing prompts.

Supports PyTorch, VLLM, MLX, and API backends.

Example
from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig

# Initialize with PyTorch backend
extractor = QwenTextExtractor(
        backend=QwenTextPyTorchConfig(model="Qwen/Qwen3-VL-8B-Instruct")
    )

# Extract as Markdown
result = extractor.extract(image, output_format="markdown")
print(result.content)

# Extract as HTML
result = extractor.extract(image, output_format="html")
print(result.content)

Initialize Qwen text extractor.

PARAMETER DESCRIPTION
backend

Backend configuration. One of: - QwenTextPyTorchConfig: PyTorch/HuggingFace backend - QwenTextVLLMConfig: VLLM high-throughput backend - QwenTextMLXConfig: MLX backend for Apple Silicon - QwenTextAPIConfig: API backend (OpenRouter, etc.)

TYPE: QwenTextBackendConfig

Source code in omnidocs/tasks/text_extraction/qwen/extractor.py
def __init__(self, backend: QwenTextBackendConfig):
    """
    Initialize Qwen text extractor.

    Args:
        backend: Backend configuration. One of:
            - QwenTextPyTorchConfig: PyTorch/HuggingFace backend
            - QwenTextVLLMConfig: VLLM high-throughput backend
            - QwenTextMLXConfig: MLX backend for Apple Silicon
            - QwenTextAPIConfig: API backend (OpenRouter, etc.)
    """
    self.backend_config = backend
    self._backend: Any = None
    self._processor: Any = None
    self._loaded = False

    # Backend-specific helpers
    self._process_vision_info: Any = None
    self._sampling_params_class: Any = None
    self._mlx_config: Any = None
    self._apply_chat_template: Any = None
    self._generate: Any = None

    # Load model
    self._load_model()

extract

extract(
    image: Union[Image, ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput

Extract text from an image.

PARAMETER DESCRIPTION
image

Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file

TYPE: Union[Image, ndarray, str, Path]

output_format

Desired output format: - "html": Structured HTML with div elements - "markdown": Markdown format

TYPE: Literal['html', 'markdown'] DEFAULT: 'markdown'

RETURNS DESCRIPTION
TextOutput

TextOutput containing extracted text content

RAISES DESCRIPTION
RuntimeError

If model is not loaded

ValueError

If image format or output_format is not supported

Source code in omnidocs/tasks/text_extraction/qwen/extractor.py
def extract(
    self,
    image: Union[Image.Image, np.ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput:
    """
    Extract text from an image.

    Args:
        image: Input image as:
            - PIL.Image.Image: PIL image object
            - np.ndarray: Numpy array (HWC format, RGB)
            - str or Path: Path to image file
        output_format: Desired output format:
            - "html": Structured HTML with div elements
            - "markdown": Markdown format

    Returns:
        TextOutput containing extracted text content

    Raises:
        RuntimeError: If model is not loaded
        ValueError: If image format or output_format is not supported
    """
    if not self._loaded:
        raise RuntimeError("Model not loaded. Call _load_model() first.")

    if output_format not in ("html", "markdown"):
        raise ValueError(f"Invalid output_format: {output_format}. Expected 'html' or 'markdown'.")

    # Prepare image
    pil_image = self._prepare_image(image)
    width, height = pil_image.size

    # Get prompt for output format
    prompt = QWEN_PROMPTS[output_format]

    # Run inference based on backend
    config_type = type(self.backend_config).__name__
    if config_type == "QwenTextPyTorchConfig":
        raw_output = self._infer_pytorch(pil_image, prompt)
    elif config_type == "QwenTextVLLMConfig":
        raw_output = self._infer_vllm(pil_image, prompt)
    elif config_type == "QwenTextMLXConfig":
        raw_output = self._infer_mlx(pil_image, prompt)
    elif config_type == "QwenTextAPIConfig":
        raw_output = self._infer_api(pil_image, prompt)
    else:
        raise RuntimeError(f"Unknown backend: {config_type}")

    # Clean output
    if output_format == "html":
        cleaned_output = _clean_html_output(raw_output)
    else:
        cleaned_output = _clean_markdown_output(raw_output)

    # Extract plain text
    plain_text = _extract_plain_text(raw_output, output_format)

    return TextOutput(
        content=cleaned_output,
        format=OutputFormat(output_format),
        raw_output=raw_output,
        plain_text=plain_text,
        image_width=width,
        image_height=height,
        model_name=f"Qwen3-VL ({type(self.backend_config).__name__})",
    )

QwenTextMLXConfig

Bases: BaseModel

MLX backend configuration for Qwen text extraction.

This backend uses MLX for Apple Silicon native inference. Best for local development and testing on macOS M1/M2/M3+. Requires: mlx, mlx-vlm

Note: This backend only works on Apple Silicon Macs. Do NOT use for Modal/cloud deployments.

Example
config = QwenTextMLXConfig(
        model="mlx-community/Qwen3-VL-8B-Instruct-4bit",
    )

QwenTextPyTorchConfig

Bases: BaseModel

PyTorch/HuggingFace backend configuration for Qwen text extraction.

This backend uses the transformers library with PyTorch for local GPU inference. Requires: torch, transformers, accelerate, qwen-vl-utils

Example
config = QwenTextPyTorchConfig(
        model="Qwen/Qwen3-VL-8B-Instruct",
        device="cuda",
        torch_dtype="bfloat16",
    )

QwenTextVLLMConfig

Bases: BaseModel

VLLM backend configuration for Qwen text extraction.

This backend uses VLLM for high-throughput inference. Best for batch processing and production deployments. Requires: vllm, torch, transformers, qwen-vl-utils

Example
config = QwenTextVLLMConfig(
        model="Qwen/Qwen3-VL-8B-Instruct",
        tensor_parallel_size=2,
        gpu_memory_utilization=0.9,
    )

api

API backend configuration for Qwen3-VL text extraction.

Uses litellm for provider-agnostic inference (OpenRouter, Gemini, Azure, etc.).

QwenTextAPIConfig

Bases: BaseModel

API backend configuration for Qwen text extraction.

Uses litellm for provider-agnostic API access. Supports OpenRouter, Gemini, Azure, OpenAI, and any other litellm-compatible provider.

API keys can be passed directly or read from environment variables.

Example
# OpenRouter (reads OPENROUTER_API_KEY from env)
config = QwenTextAPIConfig(
    model="openrouter/qwen/qwen3-vl-8b-instruct",
)

# With explicit key
config = QwenTextAPIConfig(
    model="openrouter/qwen/qwen3-vl-8b-instruct",
    api_key=os.environ["OPENROUTER_API_KEY"],
    api_base="https://openrouter.ai/api/v1",
)

extractor

Qwen3-VL text extractor.

A Vision-Language Model for extracting text from document images as structured HTML or Markdown.

Supports PyTorch, VLLM, MLX, and API backends.

Example
from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig

extractor = QwenTextExtractor(
        backend=QwenTextPyTorchConfig(model="Qwen/Qwen3-VL-8B-Instruct")
    )
result = extractor.extract(image, output_format="markdown")
print(result.content)

QwenTextExtractor

QwenTextExtractor(backend: QwenTextBackendConfig)

Bases: BaseTextExtractor

Qwen3-VL Vision-Language Model text extractor.

Extracts text from document images and outputs as structured HTML or Markdown. Uses Qwen3-VL's built-in document parsing prompts.

Supports PyTorch, VLLM, MLX, and API backends.

Example
from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig

# Initialize with PyTorch backend
extractor = QwenTextExtractor(
        backend=QwenTextPyTorchConfig(model="Qwen/Qwen3-VL-8B-Instruct")
    )

# Extract as Markdown
result = extractor.extract(image, output_format="markdown")
print(result.content)

# Extract as HTML
result = extractor.extract(image, output_format="html")
print(result.content)

Initialize Qwen text extractor.

PARAMETER DESCRIPTION
backend

Backend configuration. One of: - QwenTextPyTorchConfig: PyTorch/HuggingFace backend - QwenTextVLLMConfig: VLLM high-throughput backend - QwenTextMLXConfig: MLX backend for Apple Silicon - QwenTextAPIConfig: API backend (OpenRouter, etc.)

TYPE: QwenTextBackendConfig

Source code in omnidocs/tasks/text_extraction/qwen/extractor.py
def __init__(self, backend: QwenTextBackendConfig):
    """
    Initialize Qwen text extractor.

    Args:
        backend: Backend configuration. One of:
            - QwenTextPyTorchConfig: PyTorch/HuggingFace backend
            - QwenTextVLLMConfig: VLLM high-throughput backend
            - QwenTextMLXConfig: MLX backend for Apple Silicon
            - QwenTextAPIConfig: API backend (OpenRouter, etc.)
    """
    self.backend_config = backend
    self._backend: Any = None
    self._processor: Any = None
    self._loaded = False

    # Backend-specific helpers
    self._process_vision_info: Any = None
    self._sampling_params_class: Any = None
    self._mlx_config: Any = None
    self._apply_chat_template: Any = None
    self._generate: Any = None

    # Load model
    self._load_model()
extract
extract(
    image: Union[Image, ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput

Extract text from an image.

PARAMETER DESCRIPTION
image

Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file

TYPE: Union[Image, ndarray, str, Path]

output_format

Desired output format: - "html": Structured HTML with div elements - "markdown": Markdown format

TYPE: Literal['html', 'markdown'] DEFAULT: 'markdown'

RETURNS DESCRIPTION
TextOutput

TextOutput containing extracted text content

RAISES DESCRIPTION
RuntimeError

If model is not loaded

ValueError

If image format or output_format is not supported

Source code in omnidocs/tasks/text_extraction/qwen/extractor.py
def extract(
    self,
    image: Union[Image.Image, np.ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput:
    """
    Extract text from an image.

    Args:
        image: Input image as:
            - PIL.Image.Image: PIL image object
            - np.ndarray: Numpy array (HWC format, RGB)
            - str or Path: Path to image file
        output_format: Desired output format:
            - "html": Structured HTML with div elements
            - "markdown": Markdown format

    Returns:
        TextOutput containing extracted text content

    Raises:
        RuntimeError: If model is not loaded
        ValueError: If image format or output_format is not supported
    """
    if not self._loaded:
        raise RuntimeError("Model not loaded. Call _load_model() first.")

    if output_format not in ("html", "markdown"):
        raise ValueError(f"Invalid output_format: {output_format}. Expected 'html' or 'markdown'.")

    # Prepare image
    pil_image = self._prepare_image(image)
    width, height = pil_image.size

    # Get prompt for output format
    prompt = QWEN_PROMPTS[output_format]

    # Run inference based on backend
    config_type = type(self.backend_config).__name__
    if config_type == "QwenTextPyTorchConfig":
        raw_output = self._infer_pytorch(pil_image, prompt)
    elif config_type == "QwenTextVLLMConfig":
        raw_output = self._infer_vllm(pil_image, prompt)
    elif config_type == "QwenTextMLXConfig":
        raw_output = self._infer_mlx(pil_image, prompt)
    elif config_type == "QwenTextAPIConfig":
        raw_output = self._infer_api(pil_image, prompt)
    else:
        raise RuntimeError(f"Unknown backend: {config_type}")

    # Clean output
    if output_format == "html":
        cleaned_output = _clean_html_output(raw_output)
    else:
        cleaned_output = _clean_markdown_output(raw_output)

    # Extract plain text
    plain_text = _extract_plain_text(raw_output, output_format)

    return TextOutput(
        content=cleaned_output,
        format=OutputFormat(output_format),
        raw_output=raw_output,
        plain_text=plain_text,
        image_width=width,
        image_height=height,
        model_name=f"Qwen3-VL ({type(self.backend_config).__name__})",
    )

mlx

MLX backend configuration for Qwen3-VL text extraction.

QwenTextMLXConfig

Bases: BaseModel

MLX backend configuration for Qwen text extraction.

This backend uses MLX for Apple Silicon native inference. Best for local development and testing on macOS M1/M2/M3+. Requires: mlx, mlx-vlm

Note: This backend only works on Apple Silicon Macs. Do NOT use for Modal/cloud deployments.

Example
config = QwenTextMLXConfig(
        model="mlx-community/Qwen3-VL-8B-Instruct-4bit",
    )

pytorch

PyTorch/HuggingFace backend configuration for Qwen3-VL text extraction.

QwenTextPyTorchConfig

Bases: BaseModel

PyTorch/HuggingFace backend configuration for Qwen text extraction.

This backend uses the transformers library with PyTorch for local GPU inference. Requires: torch, transformers, accelerate, qwen-vl-utils

Example
config = QwenTextPyTorchConfig(
        model="Qwen/Qwen3-VL-8B-Instruct",
        device="cuda",
        torch_dtype="bfloat16",
    )

vllm

VLLM backend configuration for Qwen3-VL text extraction.

QwenTextVLLMConfig

Bases: BaseModel

VLLM backend configuration for Qwen text extraction.

This backend uses VLLM for high-throughput inference. Best for batch processing and production deployments. Requires: vllm, torch, transformers, qwen-vl-utils

Example
config = QwenTextVLLMConfig(
        model="Qwen/Qwen3-VL-8B-Instruct",
        tensor_parallel_size=2,
        gpu_memory_utilization=0.9,
    )

vlm

VLM text extractor.

A provider-agnostic Vision-Language Model text extractor using litellm. Works with any cloud API: Gemini, OpenRouter, Azure, OpenAI, Anthropic, etc.

Example
from omnidocs.vlm import VLMAPIConfig
from omnidocs.tasks.text_extraction import VLMTextExtractor

config = VLMAPIConfig(model="gemini/gemini-2.5-flash")
extractor = VLMTextExtractor(config=config)
result = extractor.extract("document.png", output_format="markdown")
print(result.content)

# With custom prompt
result = extractor.extract("document.png", prompt="Extract only table data as markdown")

VLMTextExtractor

VLMTextExtractor(config: VLMAPIConfig)

Bases: BaseTextExtractor

Provider-agnostic VLM text extractor using litellm.

Works with any cloud VLM API: Gemini, OpenRouter, Azure, OpenAI, Anthropic, etc. Supports custom prompts for specialized extraction.

Example
from omnidocs.vlm import VLMAPIConfig
from omnidocs.tasks.text_extraction import VLMTextExtractor

# Gemini (reads GOOGLE_API_KEY from env)
config = VLMAPIConfig(model="gemini/gemini-2.5-flash")
extractor = VLMTextExtractor(config=config)

# Default extraction
result = extractor.extract("document.png", output_format="markdown")

# Custom prompt
result = extractor.extract(
    "document.png",
    prompt="Extract only the table data as markdown",
)

Initialize VLM text extractor.

PARAMETER DESCRIPTION
config

VLM API configuration with model and provider details.

TYPE: VLMAPIConfig

Source code in omnidocs/tasks/text_extraction/vlm.py
def __init__(self, config: VLMAPIConfig):
    """
    Initialize VLM text extractor.

    Args:
        config: VLM API configuration with model and provider details.
    """
    self.config = config
    self._loaded = True

extract

extract(
    image: Union[Image, ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
    prompt: Optional[str] = None,
) -> TextOutput

Extract text from an image.

PARAMETER DESCRIPTION
image

Input image (PIL Image, numpy array, or file path).

TYPE: Union[Image, ndarray, str, Path]

output_format

Desired output format ("html" or "markdown").

TYPE: Literal['html', 'markdown'] DEFAULT: 'markdown'

prompt

Custom prompt. If None, uses a task-specific default prompt.

TYPE: Optional[str] DEFAULT: None

RETURNS DESCRIPTION
TextOutput

TextOutput containing extracted text content.

Source code in omnidocs/tasks/text_extraction/vlm.py
def extract(
    self,
    image: Union[Image.Image, np.ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
    prompt: Optional[str] = None,
) -> TextOutput:
    """
    Extract text from an image.

    Args:
        image: Input image (PIL Image, numpy array, or file path).
        output_format: Desired output format ("html" or "markdown").
        prompt: Custom prompt. If None, uses a task-specific default prompt.

    Returns:
        TextOutput containing extracted text content.
    """
    if output_format not in ("html", "markdown"):
        raise ValueError(f"Invalid output_format: {output_format}. Expected 'html' or 'markdown'.")

    pil_image = self._prepare_image(image)
    width, height = pil_image.size

    final_prompt = prompt or DEFAULT_PROMPTS[output_format]
    raw_output = vlm_completion(self.config, final_prompt, pil_image)
    plain_text = _extract_plain_text(raw_output, output_format)

    return TextOutput(
        content=raw_output,
        format=OutputFormat(output_format),
        raw_output=raw_output,
        plain_text=plain_text,
        image_width=width,
        image_height=height,
        model_name=f"VLM ({self.config.model})",
    )