Skip to content

Overview

MinerU VL text extraction module.

MinerU VL is a vision-language model for document layout detection and text/table/equation recognition. It performs two-step extraction: 1. Layout Detection: Detect regions with types (text, table, equation, etc.) 2. Content Recognition: Extract content from each detected region

Example
from omnidocs.tasks.text_extraction import MinerUVLTextExtractor
from omnidocs.tasks.text_extraction.mineruvl import MinerUVLTextPyTorchConfig

# Initialize with PyTorch backend
extractor = MinerUVLTextExtractor(
    backend=MinerUVLTextPyTorchConfig(device="cuda")
)

# Extract text
result = extractor.extract(image)
print(result.content)

# Extract with detailed blocks
result, blocks = extractor.extract_with_blocks(image)
for block in blocks:
    print(f"{block.type}: {block.content[:50]}...")

MinerUVLTextAPIConfig

Bases: BaseModel

API backend config for MinerU VL text extraction.

Connects to a deployed VLLM server with OpenAI-compatible API.

Example
from omnidocs.tasks.text_extraction import MinerUVLTextExtractor
from omnidocs.tasks.text_extraction.mineruvl import MinerUVLTextAPIConfig

extractor = MinerUVLTextExtractor(
    backend=MinerUVLTextAPIConfig(
        server_url="https://your-server.modal.run"
    )
)
result = extractor.extract(image)

MinerUVLTextExtractor

MinerUVLTextExtractor(backend: MinerUVLTextBackendConfig)

Bases: BaseTextExtractor

MinerU VL text extractor with layout-aware extraction.

Performs two-step extraction: 1. Layout detection (detect regions) 2. Content recognition (extract text/table/equation from each region)

Supports multiple backends: - PyTorch (HuggingFace Transformers) - VLLM (high-throughput GPU) - MLX (Apple Silicon) - API (VLLM OpenAI-compatible server)

Example
from omnidocs.tasks.text_extraction import MinerUVLTextExtractor
from omnidocs.tasks.text_extraction.mineruvl import MinerUVLTextPyTorchConfig

extractor = MinerUVLTextExtractor(
    backend=MinerUVLTextPyTorchConfig(device="cuda")
)
result = extractor.extract(image)

print(result.content)  # Combined text + tables + equations
print(result.blocks)   # List of ContentBlock objects

Initialize MinerU VL text extractor.

PARAMETER DESCRIPTION
backend

Backend configuration (PyTorch, VLLM, MLX, or API)

TYPE: MinerUVLTextBackendConfig

Source code in omnidocs/tasks/text_extraction/mineruvl/extractor.py
def __init__(self, backend: MinerUVLTextBackendConfig):
    """
    Initialize MinerU VL text extractor.

    Args:
        backend: Backend configuration (PyTorch, VLLM, MLX, or API)
    """
    self.backend_config = backend
    self._client = None
    self._loaded = False
    self._load_model()

extract

extract(
    image: Union[Image, ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput

Extract text with layout-aware two-step extraction.

PARAMETER DESCRIPTION
image

Input image (PIL Image, numpy array, or file path)

TYPE: Union[Image, ndarray, str, Path]

output_format

Output format ('html' or 'markdown')

TYPE: Literal['html', 'markdown'] DEFAULT: 'markdown'

RETURNS DESCRIPTION
TextOutput

TextOutput with extracted content and metadata

Source code in omnidocs/tasks/text_extraction/mineruvl/extractor.py
def extract(
    self,
    image: Union[Image.Image, np.ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput:
    """
    Extract text with layout-aware two-step extraction.

    Args:
        image: Input image (PIL Image, numpy array, or file path)
        output_format: Output format ('html' or 'markdown')

    Returns:
        TextOutput with extracted content and metadata
    """
    if not self._loaded:
        raise RuntimeError("Model not loaded. Call _load_model() first.")

    pil_image = self._prepare_image(image)
    width, height = pil_image.size

    # Step 1: Layout detection
    blocks = self._detect_layout(pil_image)

    # Step 2: Content extraction for each block
    blocks = self._extract_content(pil_image, blocks)

    # Post-process (OTSL to HTML for tables)
    blocks = simple_post_process(blocks)

    # Combine content
    content = self._combine_content(blocks, output_format)

    # Build raw output with blocks info
    raw_output = self._build_raw_output(blocks)

    return TextOutput(
        content=content,
        format=OutputFormat(output_format),
        raw_output=raw_output,
        image_width=width,
        image_height=height,
        model_name="MinerU2.5-2509-1.2B",
    )

extract_with_blocks

extract_with_blocks(
    image: Union[Image, ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> tuple[TextOutput, List[ContentBlock]]

Extract text and return both TextOutput and ContentBlocks.

This method provides access to the detailed block information including bounding boxes and block types.

PARAMETER DESCRIPTION
image

Input image

TYPE: Union[Image, ndarray, str, Path]

output_format

Output format

TYPE: Literal['html', 'markdown'] DEFAULT: 'markdown'

RETURNS DESCRIPTION
tuple[TextOutput, List[ContentBlock]]

Tuple of (TextOutput, List[ContentBlock])

Source code in omnidocs/tasks/text_extraction/mineruvl/extractor.py
def extract_with_blocks(
    self,
    image: Union[Image.Image, np.ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> tuple[TextOutput, List[ContentBlock]]:
    """
    Extract text and return both TextOutput and ContentBlocks.

    This method provides access to the detailed block information
    including bounding boxes and block types.

    Args:
        image: Input image
        output_format: Output format

    Returns:
        Tuple of (TextOutput, List[ContentBlock])
    """
    if not self._loaded:
        raise RuntimeError("Model not loaded.")

    pil_image = self._prepare_image(image)
    width, height = pil_image.size

    # Two-step extraction
    blocks = self._detect_layout(pil_image)
    blocks = self._extract_content(pil_image, blocks)
    blocks = simple_post_process(blocks)

    content = self._combine_content(blocks, output_format)
    raw_output = self._build_raw_output(blocks)

    text_output = TextOutput(
        content=content,
        format=OutputFormat(output_format),
        raw_output=raw_output,
        image_width=width,
        image_height=height,
        model_name="MinerU2.5-2509-1.2B",
    )

    return text_output, blocks

MinerUVLTextMLXConfig

Bases: BaseModel

MLX backend config for MinerU VL text extraction on Apple Silicon.

Uses MLX-VLM for efficient inference on M1/M2/M3/M4 chips.

Example
from omnidocs.tasks.text_extraction import MinerUVLTextExtractor
from omnidocs.tasks.text_extraction.mineruvl import MinerUVLTextMLXConfig

extractor = MinerUVLTextExtractor(
    backend=MinerUVLTextMLXConfig()
)
result = extractor.extract(image)

MinerUVLTextPyTorchConfig

Bases: BaseModel

PyTorch/HuggingFace backend config for MinerU VL text extraction.

Uses HuggingFace Transformers with Qwen2VLForConditionalGeneration.

Example
from omnidocs.tasks.text_extraction import MinerUVLTextExtractor
from omnidocs.tasks.text_extraction.mineruvl import MinerUVLTextPyTorchConfig

extractor = MinerUVLTextExtractor(
    backend=MinerUVLTextPyTorchConfig(device="cuda")
)
result = extractor.extract(image)

BlockType

Bases: str, Enum

MinerU VL block types (22 categories).

ContentBlock

Bases: BaseModel

A detected content block with type, bounding box, angle, and content.

Coordinates are normalized to [0, 1] range relative to image dimensions.

to_absolute

to_absolute(
    image_width: int, image_height: int
) -> List[int]

Convert normalized bbox to absolute pixel coordinates.

Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
def to_absolute(self, image_width: int, image_height: int) -> List[int]:
    """Convert normalized bbox to absolute pixel coordinates."""
    x1, y1, x2, y2 = self.bbox
    return [
        int(x1 * image_width),
        int(y1 * image_height),
        int(x2 * image_width),
        int(y2 * image_height),
    ]

MinerUSamplingParams

MinerUSamplingParams(
    temperature: Optional[float] = 0.0,
    top_p: Optional[float] = 0.01,
    top_k: Optional[int] = 1,
    presence_penalty: Optional[float] = 0.0,
    frequency_penalty: Optional[float] = 0.0,
    repetition_penalty: Optional[float] = 1.0,
    no_repeat_ngram_size: Optional[int] = 100,
    max_new_tokens: Optional[int] = None,
)

Bases: SamplingParams

Default sampling parameters optimized for MinerU VL.

Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
def __init__(
    self,
    temperature: Optional[float] = 0.0,
    top_p: Optional[float] = 0.01,
    top_k: Optional[int] = 1,
    presence_penalty: Optional[float] = 0.0,
    frequency_penalty: Optional[float] = 0.0,
    repetition_penalty: Optional[float] = 1.0,
    no_repeat_ngram_size: Optional[int] = 100,
    max_new_tokens: Optional[int] = None,
):
    super().__init__(
        temperature,
        top_p,
        top_k,
        presence_penalty,
        frequency_penalty,
        repetition_penalty,
        no_repeat_ngram_size,
        max_new_tokens,
    )

SamplingParams dataclass

SamplingParams(
    temperature: Optional[float] = None,
    top_p: Optional[float] = None,
    top_k: Optional[int] = None,
    presence_penalty: Optional[float] = None,
    frequency_penalty: Optional[float] = None,
    repetition_penalty: Optional[float] = None,
    no_repeat_ngram_size: Optional[int] = None,
    max_new_tokens: Optional[int] = None,
)

Sampling parameters for text generation.

MinerUVLTextVLLMConfig

Bases: BaseModel

VLLM backend config for MinerU VL text extraction.

Uses VLLM for high-throughput GPU inference with: - PagedAttention for efficient KV cache - Continuous batching - Optimized CUDA kernels

Example
from omnidocs.tasks.text_extraction import MinerUVLTextExtractor
from omnidocs.tasks.text_extraction.mineruvl import MinerUVLTextVLLMConfig

extractor = MinerUVLTextExtractor(
    backend=MinerUVLTextVLLMConfig(
        tensor_parallel_size=1,
        gpu_memory_utilization=0.85,
    )
)
result = extractor.extract(image)

convert_otsl_to_html

convert_otsl_to_html(otsl_content: str) -> str

Convert OTSL table format to HTML.

Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
def convert_otsl_to_html(otsl_content: str) -> str:
    """Convert OTSL table format to HTML."""
    if otsl_content.startswith("<table") and otsl_content.endswith("</table>"):
        return otsl_content

    pattern = r"(" + r"|".join(ALL_OTSL_TOKENS) + r")"
    tokens = re.findall(pattern, otsl_content)
    text_parts = re.split(pattern, otsl_content)
    text_parts = [part for part in text_parts if part.strip()]

    split_row_tokens = [list(y) for x, y in itertools.groupby(tokens, lambda z: z == OTSL_NL) if not x]
    if not split_row_tokens:
        return ""

    max_cols = max(len(row) for row in split_row_tokens)
    for row in split_row_tokens:
        while len(row) < max_cols:
            row.append(OTSL_ECEL)

    def count_right(tokens_grid, c, r, which_tokens):
        span = 0
        c_iter = c
        while c_iter < len(tokens_grid[r]) and tokens_grid[r][c_iter] in which_tokens:
            c_iter += 1
            span += 1
        return span

    def count_down(tokens_grid, c, r, which_tokens):
        span = 0
        r_iter = r
        while r_iter < len(tokens_grid) and tokens_grid[r_iter][c] in which_tokens:
            r_iter += 1
            span += 1
        return span

    table_cells = []
    r_idx = 0
    c_idx = 0

    for i, text in enumerate(text_parts):
        if text in [OTSL_FCEL, OTSL_ECEL]:
            row_span = 1
            col_span = 1
            cell_text = ""
            right_offset = 1

            if text != OTSL_ECEL and i + 1 < len(text_parts):
                next_text = text_parts[i + 1]
                if next_text not in ALL_OTSL_TOKENS:
                    cell_text = next_text
                    right_offset = 2

            if i + right_offset < len(text_parts):
                next_right = text_parts[i + right_offset]
                if next_right in [OTSL_LCEL, OTSL_XCEL]:
                    col_span += count_right(split_row_tokens, c_idx + 1, r_idx, [OTSL_LCEL, OTSL_XCEL])

            if r_idx + 1 < len(split_row_tokens) and c_idx < len(split_row_tokens[r_idx + 1]):
                next_bottom = split_row_tokens[r_idx + 1][c_idx]
                if next_bottom in [OTSL_UCEL, OTSL_XCEL]:
                    row_span += count_down(split_row_tokens, c_idx, r_idx + 1, [OTSL_UCEL, OTSL_XCEL])

            table_cells.append(
                {
                    "text": cell_text.strip(),
                    "row_span": row_span,
                    "col_span": col_span,
                    "start_row": r_idx,
                    "start_col": c_idx,
                }
            )

        if text in [OTSL_FCEL, OTSL_ECEL, OTSL_LCEL, OTSL_UCEL, OTSL_XCEL]:
            c_idx += 1
        if text == OTSL_NL:
            r_idx += 1
            c_idx = 0

    num_rows = len(split_row_tokens)
    num_cols = max_cols
    grid = [[None for _ in range(num_cols)] for _ in range(num_rows)]

    for cell in table_cells:
        for i in range(cell["start_row"], min(cell["start_row"] + cell["row_span"], num_rows)):
            for j in range(cell["start_col"], min(cell["start_col"] + cell["col_span"], num_cols)):
                grid[i][j] = cell

    html_parts = []
    for i in range(num_rows):
        html_parts.append("<tr>")
        for j in range(num_cols):
            cell = grid[i][j]
            if cell is None:
                continue
            if cell["start_row"] != i or cell["start_col"] != j:
                continue

            content = html.escape(cell["text"])
            tag = "td"
            parts = [f"<{tag}"]
            if cell["row_span"] > 1:
                parts.append(f' rowspan="{cell["row_span"]}"')
            if cell["col_span"] > 1:
                parts.append(f' colspan="{cell["col_span"]}"')
            parts.append(f">{content}</{tag}>")
            html_parts.append("".join(parts))
        html_parts.append("</tr>")

    return f"<table>{''.join(html_parts)}</table>"

parse_layout_output

parse_layout_output(output: str) -> List[ContentBlock]

Parse layout detection model output into ContentBlocks.

Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
def parse_layout_output(output: str) -> List[ContentBlock]:
    """Parse layout detection model output into ContentBlocks."""
    blocks = []
    for line in output.split("\n"):
        match = re.match(LAYOUT_REGEX, line)
        if not match:
            continue
        x1, y1, x2, y2, ref_type, tail = match.groups()
        bbox = convert_bbox((x1, y1, x2, y2))
        if bbox is None:
            continue
        ref_type = ref_type.lower()
        if ref_type not in BLOCK_TYPES:
            continue
        angle = parse_angle(tail)
        blocks.append(
            ContentBlock(
                type=BlockType(ref_type),
                bbox=bbox,
                angle=angle,
            )
        )
    return blocks

api

API backend configuration for MinerU VL text extraction.

MinerUVLTextAPIConfig

Bases: BaseModel

API backend config for MinerU VL text extraction.

Connects to a deployed VLLM server with OpenAI-compatible API.

Example
from omnidocs.tasks.text_extraction import MinerUVLTextExtractor
from omnidocs.tasks.text_extraction.mineruvl import MinerUVLTextAPIConfig

extractor = MinerUVLTextExtractor(
    backend=MinerUVLTextAPIConfig(
        server_url="https://your-server.modal.run"
    )
)
result = extractor.extract(image)

extractor

MinerU VL text extractor with layout-aware two-step extraction.

MinerU VL performs document extraction in two steps: 1. Layout Detection: Detect regions with types (text, table, equation, etc.) 2. Content Recognition: Extract text/table/equation content from each region

MinerUVLTextExtractor

MinerUVLTextExtractor(backend: MinerUVLTextBackendConfig)

Bases: BaseTextExtractor

MinerU VL text extractor with layout-aware extraction.

Performs two-step extraction: 1. Layout detection (detect regions) 2. Content recognition (extract text/table/equation from each region)

Supports multiple backends: - PyTorch (HuggingFace Transformers) - VLLM (high-throughput GPU) - MLX (Apple Silicon) - API (VLLM OpenAI-compatible server)

Example
from omnidocs.tasks.text_extraction import MinerUVLTextExtractor
from omnidocs.tasks.text_extraction.mineruvl import MinerUVLTextPyTorchConfig

extractor = MinerUVLTextExtractor(
    backend=MinerUVLTextPyTorchConfig(device="cuda")
)
result = extractor.extract(image)

print(result.content)  # Combined text + tables + equations
print(result.blocks)   # List of ContentBlock objects

Initialize MinerU VL text extractor.

PARAMETER DESCRIPTION
backend

Backend configuration (PyTorch, VLLM, MLX, or API)

TYPE: MinerUVLTextBackendConfig

Source code in omnidocs/tasks/text_extraction/mineruvl/extractor.py
def __init__(self, backend: MinerUVLTextBackendConfig):
    """
    Initialize MinerU VL text extractor.

    Args:
        backend: Backend configuration (PyTorch, VLLM, MLX, or API)
    """
    self.backend_config = backend
    self._client = None
    self._loaded = False
    self._load_model()

extract

extract(
    image: Union[Image, ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput

Extract text with layout-aware two-step extraction.

PARAMETER DESCRIPTION
image

Input image (PIL Image, numpy array, or file path)

TYPE: Union[Image, ndarray, str, Path]

output_format

Output format ('html' or 'markdown')

TYPE: Literal['html', 'markdown'] DEFAULT: 'markdown'

RETURNS DESCRIPTION
TextOutput

TextOutput with extracted content and metadata

Source code in omnidocs/tasks/text_extraction/mineruvl/extractor.py
def extract(
    self,
    image: Union[Image.Image, np.ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput:
    """
    Extract text with layout-aware two-step extraction.

    Args:
        image: Input image (PIL Image, numpy array, or file path)
        output_format: Output format ('html' or 'markdown')

    Returns:
        TextOutput with extracted content and metadata
    """
    if not self._loaded:
        raise RuntimeError("Model not loaded. Call _load_model() first.")

    pil_image = self._prepare_image(image)
    width, height = pil_image.size

    # Step 1: Layout detection
    blocks = self._detect_layout(pil_image)

    # Step 2: Content extraction for each block
    blocks = self._extract_content(pil_image, blocks)

    # Post-process (OTSL to HTML for tables)
    blocks = simple_post_process(blocks)

    # Combine content
    content = self._combine_content(blocks, output_format)

    # Build raw output with blocks info
    raw_output = self._build_raw_output(blocks)

    return TextOutput(
        content=content,
        format=OutputFormat(output_format),
        raw_output=raw_output,
        image_width=width,
        image_height=height,
        model_name="MinerU2.5-2509-1.2B",
    )

extract_with_blocks

extract_with_blocks(
    image: Union[Image, ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> tuple[TextOutput, List[ContentBlock]]

Extract text and return both TextOutput and ContentBlocks.

This method provides access to the detailed block information including bounding boxes and block types.

PARAMETER DESCRIPTION
image

Input image

TYPE: Union[Image, ndarray, str, Path]

output_format

Output format

TYPE: Literal['html', 'markdown'] DEFAULT: 'markdown'

RETURNS DESCRIPTION
tuple[TextOutput, List[ContentBlock]]

Tuple of (TextOutput, List[ContentBlock])

Source code in omnidocs/tasks/text_extraction/mineruvl/extractor.py
def extract_with_blocks(
    self,
    image: Union[Image.Image, np.ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> tuple[TextOutput, List[ContentBlock]]:
    """
    Extract text and return both TextOutput and ContentBlocks.

    This method provides access to the detailed block information
    including bounding boxes and block types.

    Args:
        image: Input image
        output_format: Output format

    Returns:
        Tuple of (TextOutput, List[ContentBlock])
    """
    if not self._loaded:
        raise RuntimeError("Model not loaded.")

    pil_image = self._prepare_image(image)
    width, height = pil_image.size

    # Two-step extraction
    blocks = self._detect_layout(pil_image)
    blocks = self._extract_content(pil_image, blocks)
    blocks = simple_post_process(blocks)

    content = self._combine_content(blocks, output_format)
    raw_output = self._build_raw_output(blocks)

    text_output = TextOutput(
        content=content,
        format=OutputFormat(output_format),
        raw_output=raw_output,
        image_width=width,
        image_height=height,
        model_name="MinerU2.5-2509-1.2B",
    )

    return text_output, blocks

mlx

MLX backend configuration for MinerU VL text extraction (Apple Silicon).

MinerUVLTextMLXConfig

Bases: BaseModel

MLX backend config for MinerU VL text extraction on Apple Silicon.

Uses MLX-VLM for efficient inference on M1/M2/M3/M4 chips.

Example
from omnidocs.tasks.text_extraction import MinerUVLTextExtractor
from omnidocs.tasks.text_extraction.mineruvl import MinerUVLTextMLXConfig

extractor = MinerUVLTextExtractor(
    backend=MinerUVLTextMLXConfig()
)
result = extractor.extract(image)

pytorch

PyTorch/HuggingFace backend configuration for MinerU VL text extraction.

MinerUVLTextPyTorchConfig

Bases: BaseModel

PyTorch/HuggingFace backend config for MinerU VL text extraction.

Uses HuggingFace Transformers with Qwen2VLForConditionalGeneration.

Example
from omnidocs.tasks.text_extraction import MinerUVLTextExtractor
from omnidocs.tasks.text_extraction.mineruvl import MinerUVLTextPyTorchConfig

extractor = MinerUVLTextExtractor(
    backend=MinerUVLTextPyTorchConfig(device="cuda")
)
result = extractor.extract(image)

utils

MinerU VL utilities for document extraction.

Contains data structures, parsing, prompts, and post-processing functions for MinerU VL document extraction pipeline.

This file contains code adapted from mineru-vl-utils

https://github.com/opendatalab/mineru-vl-utils https://pypi.org/project/mineru-vl-utils/

The original mineru-vl-utils is licensed under AGPL-3.0: Copyright (c) OpenDataLab https://github.com/opendatalab/mineru-vl-utils/blob/main/LICENSE.md

Adapted components
  • BlockType enum (from structs.py)
  • ContentBlock data structure (from structs.py)
  • OTSL to HTML table conversion (from post_process/otsl2html.py)

BlockType

Bases: str, Enum

MinerU VL block types (22 categories).

ContentBlock

Bases: BaseModel

A detected content block with type, bounding box, angle, and content.

Coordinates are normalized to [0, 1] range relative to image dimensions.

to_absolute

to_absolute(
    image_width: int, image_height: int
) -> List[int]

Convert normalized bbox to absolute pixel coordinates.

Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
def to_absolute(self, image_width: int, image_height: int) -> List[int]:
    """Convert normalized bbox to absolute pixel coordinates."""
    x1, y1, x2, y2 = self.bbox
    return [
        int(x1 * image_width),
        int(y1 * image_height),
        int(x2 * image_width),
        int(y2 * image_height),
    ]

SamplingParams dataclass

SamplingParams(
    temperature: Optional[float] = None,
    top_p: Optional[float] = None,
    top_k: Optional[int] = None,
    presence_penalty: Optional[float] = None,
    frequency_penalty: Optional[float] = None,
    repetition_penalty: Optional[float] = None,
    no_repeat_ngram_size: Optional[int] = None,
    max_new_tokens: Optional[int] = None,
)

Sampling parameters for text generation.

MinerUSamplingParams

MinerUSamplingParams(
    temperature: Optional[float] = 0.0,
    top_p: Optional[float] = 0.01,
    top_k: Optional[int] = 1,
    presence_penalty: Optional[float] = 0.0,
    frequency_penalty: Optional[float] = 0.0,
    repetition_penalty: Optional[float] = 1.0,
    no_repeat_ngram_size: Optional[int] = 100,
    max_new_tokens: Optional[int] = None,
)

Bases: SamplingParams

Default sampling parameters optimized for MinerU VL.

Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
def __init__(
    self,
    temperature: Optional[float] = 0.0,
    top_p: Optional[float] = 0.01,
    top_k: Optional[int] = 1,
    presence_penalty: Optional[float] = 0.0,
    frequency_penalty: Optional[float] = 0.0,
    repetition_penalty: Optional[float] = 1.0,
    no_repeat_ngram_size: Optional[int] = 100,
    max_new_tokens: Optional[int] = None,
):
    super().__init__(
        temperature,
        top_p,
        top_k,
        presence_penalty,
        frequency_penalty,
        repetition_penalty,
        no_repeat_ngram_size,
        max_new_tokens,
    )

convert_bbox

convert_bbox(bbox: Sequence) -> Optional[List[float]]

Convert bbox from model output (0-1000) to normalized format (0-1).

Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
def convert_bbox(bbox: Sequence) -> Optional[List[float]]:
    """Convert bbox from model output (0-1000) to normalized format (0-1)."""
    bbox = tuple(map(int, bbox))
    if any(coord < 0 or coord > 1000 for coord in bbox):
        return None
    x1, y1, x2, y2 = bbox
    x1, x2 = (x2, x1) if x2 < x1 else (x1, x2)
    y1, y2 = (y2, y1) if y2 < y1 else (y1, y2)
    if x1 == x2 or y1 == y2:
        return None
    return [coord / 1000.0 for coord in (x1, y1, x2, y2)]

parse_angle

parse_angle(tail: str) -> Literal[None, 0, 90, 180, 270]

Parse rotation angle from model output tail string.

Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
def parse_angle(tail: str) -> Literal[None, 0, 90, 180, 270]:
    """Parse rotation angle from model output tail string."""
    for token, angle in ANGLE_MAPPING.items():
        if token in tail:
            return angle
    return None

parse_layout_output

parse_layout_output(output: str) -> List[ContentBlock]

Parse layout detection model output into ContentBlocks.

Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
def parse_layout_output(output: str) -> List[ContentBlock]:
    """Parse layout detection model output into ContentBlocks."""
    blocks = []
    for line in output.split("\n"):
        match = re.match(LAYOUT_REGEX, line)
        if not match:
            continue
        x1, y1, x2, y2, ref_type, tail = match.groups()
        bbox = convert_bbox((x1, y1, x2, y2))
        if bbox is None:
            continue
        ref_type = ref_type.lower()
        if ref_type not in BLOCK_TYPES:
            continue
        angle = parse_angle(tail)
        blocks.append(
            ContentBlock(
                type=BlockType(ref_type),
                bbox=bbox,
                angle=angle,
            )
        )
    return blocks

get_rgb_image

get_rgb_image(image: Image) -> Image.Image

Convert image to RGB mode.

Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
def get_rgb_image(image: Image.Image) -> Image.Image:
    """Convert image to RGB mode."""
    if image.mode == "P":
        image = image.convert("RGBA")
    if image.mode != "RGB":
        image = image.convert("RGB")
    return image

prepare_for_layout

prepare_for_layout(
    image: Image,
    layout_size: Tuple[int, int] = LAYOUT_IMAGE_SIZE,
) -> Image.Image

Prepare image for layout detection.

Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
def prepare_for_layout(
    image: Image.Image,
    layout_size: Tuple[int, int] = LAYOUT_IMAGE_SIZE,
) -> Image.Image:
    """Prepare image for layout detection."""
    image = get_rgb_image(image)
    image = image.resize(layout_size, Image.Resampling.BICUBIC)
    return image

resize_by_need

resize_by_need(
    image: Image, min_edge: int = 28, max_ratio: float = 50
) -> Image.Image

Resize image if needed based on aspect ratio constraints.

Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
def resize_by_need(
    image: Image.Image,
    min_edge: int = 28,
    max_ratio: float = 50,
) -> Image.Image:
    """Resize image if needed based on aspect ratio constraints."""
    edge_ratio = max(image.size) / min(image.size)
    if edge_ratio > max_ratio:
        width, height = image.size
        if width > height:
            new_w, new_h = width, math.ceil(width / max_ratio)
        else:
            new_w, new_h = math.ceil(height / max_ratio), height
        new_image = Image.new(image.mode, (new_w, new_h), (255, 255, 255))
        new_image.paste(image, (int((new_w - width) / 2), int((new_h - height) / 2)))
        image = new_image
    if min(image.size) < min_edge:
        scale = min_edge / min(image.size)
        new_w, new_h = round(image.width * scale), round(image.height * scale)
        image = image.resize((new_w, new_h), Image.Resampling.BICUBIC)
    return image

prepare_for_extract

prepare_for_extract(
    image: Image,
    blocks: List[ContentBlock],
    prompts: Dict[str, str] = None,
    sampling_params: Dict[str, SamplingParams] = None,
    skip_types: set = None,
) -> Tuple[
    List[Image.Image],
    List[str],
    List[SamplingParams],
    List[int],
]

Prepare cropped images for content extraction.

Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
def prepare_for_extract(
    image: Image.Image,
    blocks: List[ContentBlock],
    prompts: Dict[str, str] = None,
    sampling_params: Dict[str, SamplingParams] = None,
    skip_types: set = None,
) -> Tuple[List[Image.Image], List[str], List[SamplingParams], List[int]]:
    """Prepare cropped images for content extraction."""
    if prompts is None:
        prompts = DEFAULT_PROMPTS
    if sampling_params is None:
        sampling_params = DEFAULT_SAMPLING_PARAMS
    if skip_types is None:
        skip_types = {"image", "list", "equation_block"}

    image = get_rgb_image(image)
    width, height = image.size

    block_images = []
    prompt_list = []
    params_list = []
    indices = []

    for idx, block in enumerate(blocks):
        if block.type.value in skip_types:
            continue

        x1, y1, x2, y2 = block.bbox
        scaled_bbox = (x1 * width, y1 * height, x2 * width, y2 * height)
        block_image = image.crop(scaled_bbox)

        if block_image.width < 1 or block_image.height < 1:
            continue

        if block.angle in [90, 180, 270]:
            block_image = block_image.rotate(block.angle, expand=True)

        block_image = resize_by_need(block_image)
        block_images.append(block_image)

        block_type = block.type.value
        prompt = prompts.get(block_type) or prompts.get("[default]")
        prompt_list.append(prompt)

        params = sampling_params.get(block_type) or sampling_params.get("[default]")
        params_list.append(params)
        indices.append(idx)

    return block_images, prompt_list, params_list, indices

convert_otsl_to_html

convert_otsl_to_html(otsl_content: str) -> str

Convert OTSL table format to HTML.

Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
def convert_otsl_to_html(otsl_content: str) -> str:
    """Convert OTSL table format to HTML."""
    if otsl_content.startswith("<table") and otsl_content.endswith("</table>"):
        return otsl_content

    pattern = r"(" + r"|".join(ALL_OTSL_TOKENS) + r")"
    tokens = re.findall(pattern, otsl_content)
    text_parts = re.split(pattern, otsl_content)
    text_parts = [part for part in text_parts if part.strip()]

    split_row_tokens = [list(y) for x, y in itertools.groupby(tokens, lambda z: z == OTSL_NL) if not x]
    if not split_row_tokens:
        return ""

    max_cols = max(len(row) for row in split_row_tokens)
    for row in split_row_tokens:
        while len(row) < max_cols:
            row.append(OTSL_ECEL)

    def count_right(tokens_grid, c, r, which_tokens):
        span = 0
        c_iter = c
        while c_iter < len(tokens_grid[r]) and tokens_grid[r][c_iter] in which_tokens:
            c_iter += 1
            span += 1
        return span

    def count_down(tokens_grid, c, r, which_tokens):
        span = 0
        r_iter = r
        while r_iter < len(tokens_grid) and tokens_grid[r_iter][c] in which_tokens:
            r_iter += 1
            span += 1
        return span

    table_cells = []
    r_idx = 0
    c_idx = 0

    for i, text in enumerate(text_parts):
        if text in [OTSL_FCEL, OTSL_ECEL]:
            row_span = 1
            col_span = 1
            cell_text = ""
            right_offset = 1

            if text != OTSL_ECEL and i + 1 < len(text_parts):
                next_text = text_parts[i + 1]
                if next_text not in ALL_OTSL_TOKENS:
                    cell_text = next_text
                    right_offset = 2

            if i + right_offset < len(text_parts):
                next_right = text_parts[i + right_offset]
                if next_right in [OTSL_LCEL, OTSL_XCEL]:
                    col_span += count_right(split_row_tokens, c_idx + 1, r_idx, [OTSL_LCEL, OTSL_XCEL])

            if r_idx + 1 < len(split_row_tokens) and c_idx < len(split_row_tokens[r_idx + 1]):
                next_bottom = split_row_tokens[r_idx + 1][c_idx]
                if next_bottom in [OTSL_UCEL, OTSL_XCEL]:
                    row_span += count_down(split_row_tokens, c_idx, r_idx + 1, [OTSL_UCEL, OTSL_XCEL])

            table_cells.append(
                {
                    "text": cell_text.strip(),
                    "row_span": row_span,
                    "col_span": col_span,
                    "start_row": r_idx,
                    "start_col": c_idx,
                }
            )

        if text in [OTSL_FCEL, OTSL_ECEL, OTSL_LCEL, OTSL_UCEL, OTSL_XCEL]:
            c_idx += 1
        if text == OTSL_NL:
            r_idx += 1
            c_idx = 0

    num_rows = len(split_row_tokens)
    num_cols = max_cols
    grid = [[None for _ in range(num_cols)] for _ in range(num_rows)]

    for cell in table_cells:
        for i in range(cell["start_row"], min(cell["start_row"] + cell["row_span"], num_rows)):
            for j in range(cell["start_col"], min(cell["start_col"] + cell["col_span"], num_cols)):
                grid[i][j] = cell

    html_parts = []
    for i in range(num_rows):
        html_parts.append("<tr>")
        for j in range(num_cols):
            cell = grid[i][j]
            if cell is None:
                continue
            if cell["start_row"] != i or cell["start_col"] != j:
                continue

            content = html.escape(cell["text"])
            tag = "td"
            parts = [f"<{tag}"]
            if cell["row_span"] > 1:
                parts.append(f' rowspan="{cell["row_span"]}"')
            if cell["col_span"] > 1:
                parts.append(f' colspan="{cell["col_span"]}"')
            parts.append(f">{content}</{tag}>")
            html_parts.append("".join(parts))
        html_parts.append("</tr>")

    return f"<table>{''.join(html_parts)}</table>"

simple_post_process

simple_post_process(
    blocks: List[ContentBlock],
) -> List[ContentBlock]

Simple post-processing: convert OTSL tables to HTML.

Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
def simple_post_process(blocks: List[ContentBlock]) -> List[ContentBlock]:
    """Simple post-processing: convert OTSL tables to HTML."""
    for block in blocks:
        if block.type == BlockType.TABLE and block.content:
            try:
                block.content = convert_otsl_to_html(block.content)
            except Exception:
                pass
    return blocks

vllm

VLLM backend configuration for MinerU VL text extraction.

MinerUVLTextVLLMConfig

Bases: BaseModel

VLLM backend config for MinerU VL text extraction.

Uses VLLM for high-throughput GPU inference with: - PagedAttention for efficient KV cache - Continuous batching - Optimized CUDA kernels

Example
from omnidocs.tasks.text_extraction import MinerUVLTextExtractor
from omnidocs.tasks.text_extraction.mineruvl import MinerUVLTextVLLMConfig

extractor = MinerUVLTextExtractor(
    backend=MinerUVLTextVLLMConfig(
        tensor_parallel_size=1,
        gpu_memory_utilization=0.85,
    )
)
result = extractor.extract(image)