Overview¶

Nanonets OCR2-3B backend configurations and extractor for text extraction.

Available backends

NanonetsTextPyTorchConfig: PyTorch/HuggingFace backend
NanonetsTextVLLMConfig: VLLM high-throughput backend
NanonetsTextMLXConfig: MLX backend for Apple Silicon

Example

from omnidocs.tasks.text_extraction.nanonets import NanonetsTextPyTorchConfig
config = NanonetsTextPyTorchConfig()

NanonetsTextExtractor ¶

NanonetsTextExtractor(backend: NanonetsTextBackendConfig)

Bases: BaseTextExtractor

Nanonets OCR2-3B Vision-Language Model text extractor.

Extracts text from document images with support for: - Tables (output as HTML) - Equations (output as LaTeX) - Image captions (wrapped in tags) - Watermarks (wrapped in tags) - Page numbers (wrapped in tags) - Checkboxes (using ☐ and ☑ symbols)

Supports PyTorch, VLLM, and MLX backends.

Example

from omnidocs.tasks.text_extraction import NanonetsTextExtractor
from omnidocs.tasks.text_extraction.nanonets import NanonetsTextPyTorchConfig

# Initialize with PyTorch backend
extractor = NanonetsTextExtractor(
        backend=NanonetsTextPyTorchConfig()
    )

# Extract text
result = extractor.extract(image)
print(result.content)

Initialize Nanonets text extractor.

PARAMETER	DESCRIPTION
`backend`	Backend configuration. One of: - NanonetsTextPyTorchConfig: PyTorch/HuggingFace backend - NanonetsTextVLLMConfig: VLLM high-throughput backend - NanonetsTextMLXConfig: MLX backend for Apple Silicon TYPE: `NanonetsTextBackendConfig`

Source code in omnidocs/tasks/text_extraction/nanonets/extractor.py

def __init__(self, backend: NanonetsTextBackendConfig):
    """
    Initialize Nanonets text extractor.

    Args:
        backend: Backend configuration. One of:
            - NanonetsTextPyTorchConfig: PyTorch/HuggingFace backend
            - NanonetsTextVLLMConfig: VLLM high-throughput backend
            - NanonetsTextMLXConfig: MLX backend for Apple Silicon
    """
    self.backend_config = backend
    self._backend: Any = None
    self._processor: Any = None
    self._loaded = False

    # Backend-specific helpers
    self._process_vision_info: Any = None
    self._sampling_params_class: Any = None
    self._device: str = "cpu"

    # MLX-specific helpers
    self._mlx_config: Any = None
    self._apply_chat_template: Any = None
    self._generate: Any = None

    # Load model
    self._load_model()

extract ¶

extract(
    image: Union[Image, ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput

Extract text from an image.

Note: Nanonets OCR2 produces a unified output format that includes tables as HTML and equations as LaTeX inline. The output_format parameter is accepted for API compatibility but does not change the output structure.

PARAMETER	DESCRIPTION
`image`	Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file TYPE: `Union[Image, ndarray, str, Path]`
`output_format`	Accepted for API compatibility (default: "markdown") TYPE: `Literal['html', 'markdown']` DEFAULT: `'markdown'`

RETURNS	DESCRIPTION
`TextOutput`	TextOutput containing extracted text content

RAISES	DESCRIPTION
`RuntimeError`	If model is not loaded
`ValueError`	If image format is not supported

Source code in omnidocs/tasks/text_extraction/nanonets/extractor.py

def extract(
    self,
    image: Union[Image.Image, np.ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput:
    """
    Extract text from an image.

    Note: Nanonets OCR2 produces a unified output format that includes
    tables as HTML and equations as LaTeX inline. The output_format
    parameter is accepted for API compatibility but does not change
    the output structure.

    Args:
        image: Input image as:
            - PIL.Image.Image: PIL image object
            - np.ndarray: Numpy array (HWC format, RGB)
            - str or Path: Path to image file
        output_format: Accepted for API compatibility (default: "markdown")

    Returns:
        TextOutput containing extracted text content

    Raises:
        RuntimeError: If model is not loaded
        ValueError: If image format is not supported
    """
    if not self._loaded:
        raise RuntimeError("Model not loaded. Call _load_model() first.")

    # Prepare image
    pil_image = self._prepare_image(image)
    width, height = pil_image.size

    # Run inference based on backend
    config_type = type(self.backend_config).__name__
    if config_type == "NanonetsTextPyTorchConfig":
        raw_output = self._infer_pytorch(pil_image)
    elif config_type == "NanonetsTextVLLMConfig":
        raw_output = self._infer_vllm(pil_image)
    elif config_type == "NanonetsTextMLXConfig":
        raw_output = self._infer_mlx(pil_image)
    else:
        raise RuntimeError(f"Unknown backend: {config_type}")

    # Clean output
    cleaned_output = raw_output.replace("<|im_end|>", "").strip()

    return TextOutput(
        content=cleaned_output,
        format=OutputFormat(output_format),
        raw_output=raw_output,
        plain_text=cleaned_output,
        image_width=width,
        image_height=height,
        model_name=f"Nanonets-OCR2-3B ({type(self.backend_config).__name__})",
    )

NanonetsTextMLXConfig ¶

Bases: BaseModel

MLX backend configuration for Nanonets OCR2-3B text extraction.

This backend uses MLX for Apple Silicon native inference. Best for local development and testing on macOS M1/M2/M3/M4+. Requires: mlx, mlx-vlm

Note: This backend only works on Apple Silicon Macs. Do NOT use for Modal/cloud deployments.

Example

config = NanonetsTextMLXConfig(
        model="mlx-community/Nanonets-OCR2-3B-bf16",
    )

NanonetsTextPyTorchConfig ¶

Bases: BaseModel

PyTorch/HuggingFace backend configuration for Nanonets OCR2-3B text extraction.

This backend uses the transformers library with PyTorch for local GPU inference. Requires: torch, transformers, accelerate

Example

config = NanonetsTextPyTorchConfig(
        device="cuda",
        torch_dtype="float16",
    )

NanonetsTextVLLMConfig ¶

Bases: BaseModel

VLLM backend configuration for Nanonets OCR2-3B text extraction.

This backend uses VLLM for high-throughput inference. Best for batch processing and production deployments. Requires: vllm, torch, transformers, qwen-vl-utils

Example

config = NanonetsTextVLLMConfig(
        tensor_parallel_size=1,
        gpu_memory_utilization=0.85,
    )

extractor ¶

Nanonets OCR2-3B text extractor.

A Vision-Language Model for extracting text from document images with support for tables (HTML), equations (LaTeX), and image captions.

Supports PyTorch and VLLM backends.

Example

from omnidocs.tasks.text_extraction import NanonetsTextExtractor
from omnidocs.tasks.text_extraction.nanonets import NanonetsTextPyTorchConfig

extractor = NanonetsTextExtractor(
        backend=NanonetsTextPyTorchConfig()
    )
result = extractor.extract(image)
print(result.content)

NanonetsTextExtractor ¶

NanonetsTextExtractor(backend: NanonetsTextBackendConfig)

Bases: BaseTextExtractor

Nanonets OCR2-3B Vision-Language Model text extractor.

Extracts text from document images with support for: - Tables (output as HTML) - Equations (output as LaTeX) - Image captions (wrapped in tags) - Watermarks (wrapped in tags) - Page numbers (wrapped in tags) - Checkboxes (using ☐ and ☑ symbols)

Supports PyTorch, VLLM, and MLX backends.

Example

from omnidocs.tasks.text_extraction import NanonetsTextExtractor
from omnidocs.tasks.text_extraction.nanonets import NanonetsTextPyTorchConfig

# Initialize with PyTorch backend
extractor = NanonetsTextExtractor(
        backend=NanonetsTextPyTorchConfig()
    )

# Extract text
result = extractor.extract(image)
print(result.content)

Initialize Nanonets text extractor.

PARAMETER	DESCRIPTION
`backend`	Backend configuration. One of: - NanonetsTextPyTorchConfig: PyTorch/HuggingFace backend - NanonetsTextVLLMConfig: VLLM high-throughput backend - NanonetsTextMLXConfig: MLX backend for Apple Silicon TYPE: `NanonetsTextBackendConfig`

Source code in omnidocs/tasks/text_extraction/nanonets/extractor.py

def __init__(self, backend: NanonetsTextBackendConfig):
    """
    Initialize Nanonets text extractor.

    Args:
        backend: Backend configuration. One of:
            - NanonetsTextPyTorchConfig: PyTorch/HuggingFace backend
            - NanonetsTextVLLMConfig: VLLM high-throughput backend
            - NanonetsTextMLXConfig: MLX backend for Apple Silicon
    """
    self.backend_config = backend
    self._backend: Any = None
    self._processor: Any = None
    self._loaded = False

    # Backend-specific helpers
    self._process_vision_info: Any = None
    self._sampling_params_class: Any = None
    self._device: str = "cpu"

    # MLX-specific helpers
    self._mlx_config: Any = None
    self._apply_chat_template: Any = None
    self._generate: Any = None

    # Load model
    self._load_model()

extract ¶

extract(
    image: Union[Image, ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput

Extract text from an image.

Note: Nanonets OCR2 produces a unified output format that includes tables as HTML and equations as LaTeX inline. The output_format parameter is accepted for API compatibility but does not change the output structure.

PARAMETER	DESCRIPTION
`image`	Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file TYPE: `Union[Image, ndarray, str, Path]`
`output_format`	Accepted for API compatibility (default: "markdown") TYPE: `Literal['html', 'markdown']` DEFAULT: `'markdown'`

RETURNS	DESCRIPTION
`TextOutput`	TextOutput containing extracted text content

RAISES	DESCRIPTION
`RuntimeError`	If model is not loaded
`ValueError`	If image format is not supported

Source code in omnidocs/tasks/text_extraction/nanonets/extractor.py

def extract(
    self,
    image: Union[Image.Image, np.ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput:
    """
    Extract text from an image.

    Note: Nanonets OCR2 produces a unified output format that includes
    tables as HTML and equations as LaTeX inline. The output_format
    parameter is accepted for API compatibility but does not change
    the output structure.

    Args:
        image: Input image as:
            - PIL.Image.Image: PIL image object
            - np.ndarray: Numpy array (HWC format, RGB)
            - str or Path: Path to image file
        output_format: Accepted for API compatibility (default: "markdown")

    Returns:
        TextOutput containing extracted text content

    Raises:
        RuntimeError: If model is not loaded
        ValueError: If image format is not supported
    """
    if not self._loaded:
        raise RuntimeError("Model not loaded. Call _load_model() first.")

    # Prepare image
    pil_image = self._prepare_image(image)
    width, height = pil_image.size

    # Run inference based on backend
    config_type = type(self.backend_config).__name__
    if config_type == "NanonetsTextPyTorchConfig":
        raw_output = self._infer_pytorch(pil_image)
    elif config_type == "NanonetsTextVLLMConfig":
        raw_output = self._infer_vllm(pil_image)
    elif config_type == "NanonetsTextMLXConfig":
        raw_output = self._infer_mlx(pil_image)
    else:
        raise RuntimeError(f"Unknown backend: {config_type}")

    # Clean output
    cleaned_output = raw_output.replace("<|im_end|>", "").strip()

    return TextOutput(
        content=cleaned_output,
        format=OutputFormat(output_format),
        raw_output=raw_output,
        plain_text=cleaned_output,
        image_width=width,
        image_height=height,
        model_name=f"Nanonets-OCR2-3B ({type(self.backend_config).__name__})",
    )

mlx ¶

MLX backend configuration for Nanonets OCR2-3B text extraction.

NanonetsTextMLXConfig ¶

Bases: BaseModel

MLX backend configuration for Nanonets OCR2-3B text extraction.

This backend uses MLX for Apple Silicon native inference. Best for local development and testing on macOS M1/M2/M3/M4+. Requires: mlx, mlx-vlm

Note: This backend only works on Apple Silicon Macs. Do NOT use for Modal/cloud deployments.

Example

config = NanonetsTextMLXConfig(
        model="mlx-community/Nanonets-OCR2-3B-bf16",
    )

pytorch ¶

PyTorch/HuggingFace backend configuration for Nanonets OCR2-3B text extraction.

NanonetsTextPyTorchConfig ¶

Bases: BaseModel

PyTorch/HuggingFace backend configuration for Nanonets OCR2-3B text extraction.

This backend uses the transformers library with PyTorch for local GPU inference. Requires: torch, transformers, accelerate

Example

config = NanonetsTextPyTorchConfig(
        device="cuda",
        torch_dtype="float16",
    )

vllm ¶

VLLM backend configuration for Nanonets OCR2-3B text extraction.

NanonetsTextVLLMConfig ¶

Bases: BaseModel

VLLM backend configuration for Nanonets OCR2-3B text extraction.

This backend uses VLLM for high-throughput inference. Best for batch processing and production deployments. Requires: vllm, torch, transformers, qwen-vl-utils

Example

config = NanonetsTextVLLMConfig(
        tensor_parallel_size=1,
        gpu_memory_utilization=0.85,
    )