Extractor¶
Nanonets OCR2-3B text extractor.
A Vision-Language Model for extracting text from document images with support for tables (HTML), equations (LaTeX), and image captions.
Supports PyTorch and VLLM backends.
Example
NanonetsTextExtractor
¶
Bases: BaseTextExtractor
Nanonets OCR2-3B Vision-Language Model text extractor.
Extracts text from document images with support for:
- Tables (output as HTML)
- Equations (output as LaTeX)
- Image captions (wrapped in tags)
- Watermarks (wrapped in
Supports PyTorch, VLLM, and MLX backends.
Example
from omnidocs.tasks.text_extraction import NanonetsTextExtractor
from omnidocs.tasks.text_extraction.nanonets import NanonetsTextPyTorchConfig
# Initialize with PyTorch backend
extractor = NanonetsTextExtractor(
backend=NanonetsTextPyTorchConfig()
)
# Extract text
result = extractor.extract(image)
print(result.content)
Initialize Nanonets text extractor.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration. One of: - NanonetsTextPyTorchConfig: PyTorch/HuggingFace backend - NanonetsTextVLLMConfig: VLLM high-throughput backend - NanonetsTextMLXConfig: MLX backend for Apple Silicon
TYPE:
|
Source code in omnidocs/tasks/text_extraction/nanonets/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text from an image.
Note: Nanonets OCR2 produces a unified output format that includes tables as HTML and equations as LaTeX inline. The output_format parameter is accepted for API compatibility but does not change the output structure.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file
TYPE:
|
output_format
|
Accepted for API compatibility (default: "markdown")
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput containing extracted text content |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If model is not loaded |
ValueError
|
If image format is not supported |