Skip to content

Extractor

Nanonets OCR2-3B text extractor.

A Vision-Language Model for extracting text from document images with support for tables (HTML), equations (LaTeX), and image captions.

Supports PyTorch and VLLM backends.

Example
from omnidocs.tasks.text_extraction import NanonetsTextExtractor
from omnidocs.tasks.text_extraction.nanonets import NanonetsTextPyTorchConfig

extractor = NanonetsTextExtractor(
        backend=NanonetsTextPyTorchConfig()
    )
result = extractor.extract(image)
print(result.content)

NanonetsTextExtractor

NanonetsTextExtractor(backend: NanonetsTextBackendConfig)

Bases: BaseTextExtractor

Nanonets OCR2-3B Vision-Language Model text extractor.

Extracts text from document images with support for: - Tables (output as HTML) - Equations (output as LaTeX) - Image captions (wrapped in tags) - Watermarks (wrapped in tags) - Page numbers (wrapped in tags) - Checkboxes (using ☐ and ☑ symbols)

Supports PyTorch, VLLM, and MLX backends.

Example
from omnidocs.tasks.text_extraction import NanonetsTextExtractor
from omnidocs.tasks.text_extraction.nanonets import NanonetsTextPyTorchConfig

# Initialize with PyTorch backend
extractor = NanonetsTextExtractor(
        backend=NanonetsTextPyTorchConfig()
    )

# Extract text
result = extractor.extract(image)
print(result.content)

Initialize Nanonets text extractor.

PARAMETER DESCRIPTION
backend

Backend configuration. One of: - NanonetsTextPyTorchConfig: PyTorch/HuggingFace backend - NanonetsTextVLLMConfig: VLLM high-throughput backend - NanonetsTextMLXConfig: MLX backend for Apple Silicon

TYPE: NanonetsTextBackendConfig

Source code in omnidocs/tasks/text_extraction/nanonets/extractor.py
def __init__(self, backend: NanonetsTextBackendConfig):
    """
    Initialize Nanonets text extractor.

    Args:
        backend: Backend configuration. One of:
            - NanonetsTextPyTorchConfig: PyTorch/HuggingFace backend
            - NanonetsTextVLLMConfig: VLLM high-throughput backend
            - NanonetsTextMLXConfig: MLX backend for Apple Silicon
    """
    self.backend_config = backend
    self._backend: Any = None
    self._processor: Any = None
    self._loaded = False

    # Backend-specific helpers
    self._process_vision_info: Any = None
    self._sampling_params_class: Any = None
    self._device: str = "cpu"

    # MLX-specific helpers
    self._mlx_config: Any = None
    self._apply_chat_template: Any = None
    self._generate: Any = None

    # Load model
    self._load_model()

extract

extract(
    image: Union[Image, ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput

Extract text from an image.

Note: Nanonets OCR2 produces a unified output format that includes tables as HTML and equations as LaTeX inline. The output_format parameter is accepted for API compatibility but does not change the output structure.

PARAMETER DESCRIPTION
image

Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file

TYPE: Union[Image, ndarray, str, Path]

output_format

Accepted for API compatibility (default: "markdown")

TYPE: Literal['html', 'markdown'] DEFAULT: 'markdown'

RETURNS DESCRIPTION
TextOutput

TextOutput containing extracted text content

RAISES DESCRIPTION
RuntimeError

If model is not loaded

ValueError

If image format is not supported

Source code in omnidocs/tasks/text_extraction/nanonets/extractor.py
def extract(
    self,
    image: Union[Image.Image, np.ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput:
    """
    Extract text from an image.

    Note: Nanonets OCR2 produces a unified output format that includes
    tables as HTML and equations as LaTeX inline. The output_format
    parameter is accepted for API compatibility but does not change
    the output structure.

    Args:
        image: Input image as:
            - PIL.Image.Image: PIL image object
            - np.ndarray: Numpy array (HWC format, RGB)
            - str or Path: Path to image file
        output_format: Accepted for API compatibility (default: "markdown")

    Returns:
        TextOutput containing extracted text content

    Raises:
        RuntimeError: If model is not loaded
        ValueError: If image format is not supported
    """
    if not self._loaded:
        raise RuntimeError("Model not loaded. Call _load_model() first.")

    # Prepare image
    pil_image = self._prepare_image(image)
    width, height = pil_image.size

    # Run inference based on backend
    config_type = type(self.backend_config).__name__
    if config_type == "NanonetsTextPyTorchConfig":
        raw_output = self._infer_pytorch(pil_image)
    elif config_type == "NanonetsTextVLLMConfig":
        raw_output = self._infer_vllm(pil_image)
    elif config_type == "NanonetsTextMLXConfig":
        raw_output = self._infer_mlx(pil_image)
    else:
        raise RuntimeError(f"Unknown backend: {config_type}")

    # Clean output
    cleaned_output = raw_output.replace("<|im_end|>", "").strip()

    return TextOutput(
        content=cleaned_output,
        format=OutputFormat(output_format),
        raw_output=raw_output,
        plain_text=cleaned_output,
        image_width=width,
        image_height=height,
        model_name=f"Nanonets-OCR2-3B ({type(self.backend_config).__name__})",
    )