Extractor¶

MinerU VL text extractor with layout-aware two-step extraction.

MinerU VL performs document extraction in two steps: 1. Layout Detection: Detect regions with types (text, table, equation, etc.) 2. Content Recognition: Extract text/table/equation content from each region

MinerUVLTextExtractor ¶

MinerUVLTextExtractor(backend: MinerUVLTextBackendConfig)

Bases: BaseTextExtractor

MinerU VL text extractor with layout-aware extraction.

Performs two-step extraction: 1. Layout detection (detect regions) 2. Content recognition (extract text/table/equation from each region)

Supports multiple backends: - PyTorch (HuggingFace Transformers) - VLLM (high-throughput GPU) - MLX (Apple Silicon) - API (VLLM OpenAI-compatible server)

Example

from omnidocs.tasks.text_extraction import MinerUVLTextExtractor
from omnidocs.tasks.text_extraction.mineruvl import MinerUVLTextPyTorchConfig

extractor = MinerUVLTextExtractor(
    backend=MinerUVLTextPyTorchConfig(device="cuda")
)
result = extractor.extract(image)

print(result.content)  # Combined text + tables + equations
print(result.blocks)   # List of ContentBlock objects

Initialize MinerU VL text extractor.

PARAMETER	DESCRIPTION
`backend`	Backend configuration (PyTorch, VLLM, MLX, or API) TYPE: `MinerUVLTextBackendConfig`

Source code in omnidocs/tasks/text_extraction/mineruvl/extractor.py

def __init__(self, backend: MinerUVLTextBackendConfig):
    """
    Initialize MinerU VL text extractor.

    Args:
        backend: Backend configuration (PyTorch, VLLM, MLX, or API)
    """
    self.backend_config = backend
    self._client = None
    self._loaded = False
    self._load_model()

extract ¶

extract(
    image: Union[Image, ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput

Extract text with layout-aware two-step extraction.

PARAMETER	DESCRIPTION
`image`	Input image (PIL Image, numpy array, or file path) TYPE: `Union[Image, ndarray, str, Path]`
`output_format`	Output format ('html' or 'markdown') TYPE: `Literal['html', 'markdown']` DEFAULT: `'markdown'`

RETURNS	DESCRIPTION
`TextOutput`	TextOutput with extracted content and metadata

Source code in omnidocs/tasks/text_extraction/mineruvl/extractor.py

def extract(
    self,
    image: Union[Image.Image, np.ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput:
    """
    Extract text with layout-aware two-step extraction.

    Args:
        image: Input image (PIL Image, numpy array, or file path)
        output_format: Output format ('html' or 'markdown')

    Returns:
        TextOutput with extracted content and metadata
    """
    if not self._loaded:
        raise RuntimeError("Model not loaded. Call _load_model() first.")

    pil_image = self._prepare_image(image)
    width, height = pil_image.size

    # Step 1: Layout detection
    blocks = self._detect_layout(pil_image)

    # Step 2: Content extraction for each block
    blocks = self._extract_content(pil_image, blocks)

    # Post-process (OTSL to HTML for tables)
    blocks = simple_post_process(blocks)

    # Combine content
    content = self._combine_content(blocks, output_format)

    # Build raw output with blocks info
    raw_output = self._build_raw_output(blocks)

    return TextOutput(
        content=content,
        format=OutputFormat(output_format),
        raw_output=raw_output,
        image_width=width,
        image_height=height,
        model_name="MinerU2.5-2509-1.2B",
    )

extract_with_blocks ¶

extract_with_blocks(
    image: Union[Image, ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> tuple[TextOutput, List[ContentBlock]]

Extract text and return both TextOutput and ContentBlocks.

This method provides access to the detailed block information including bounding boxes and block types.

PARAMETER	DESCRIPTION
`image`	Input image TYPE: `Union[Image, ndarray, str, Path]`
`output_format`	Output format TYPE: `Literal['html', 'markdown']` DEFAULT: `'markdown'`

RETURNS	DESCRIPTION
`tuple[TextOutput, List[ContentBlock]]`	Tuple of (TextOutput, List[ContentBlock])

Source code in omnidocs/tasks/text_extraction/mineruvl/extractor.py

def extract_with_blocks(
    self,
    image: Union[Image.Image, np.ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> tuple[TextOutput, List[ContentBlock]]:
    """
    Extract text and return both TextOutput and ContentBlocks.

    This method provides access to the detailed block information
    including bounding boxes and block types.

    Args:
        image: Input image
        output_format: Output format

    Returns:
        Tuple of (TextOutput, List[ContentBlock])
    """
    if not self._loaded:
        raise RuntimeError("Model not loaded.")

    pil_image = self._prepare_image(image)
    width, height = pil_image.size

    # Two-step extraction
    blocks = self._detect_layout(pil_image)
    blocks = self._extract_content(pil_image, blocks)
    blocks = simple_post_process(blocks)

    content = self._combine_content(blocks, output_format)
    raw_output = self._build_raw_output(blocks)

    text_output = TextOutput(
        content=content,
        format=OutputFormat(output_format),
        raw_output=raw_output,
        image_width=width,
        image_height=height,
        model_name="MinerU2.5-2509-1.2B",
    )

    return text_output, blocks