Overview¶
Text Extraction Module.
Provides extractors for converting document images to structured text formats (HTML, Markdown, JSON). Uses Vision-Language Models for accurate text extraction with formatting preservation and optional layout detection.
Available Extractors
- QwenTextExtractor: Qwen3-VL based extractor (multi-backend)
- DotsOCRTextExtractor: Dots OCR with layout-aware extraction (PyTorch/VLLM/API)
- NanonetsTextExtractor: Nanonets OCR2-3B for text extraction (PyTorch/VLLM)
- GraniteDoclingTextExtractor: IBM Granite Docling for document conversion (multi-backend)
- MinerUVLTextExtractor: MinerU VL 1.2B with layout-aware two-step extraction (multi-backend)
Example
from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig
extractor = QwenTextExtractor(
backend=QwenTextPyTorchConfig(model="Qwen/Qwen3-VL-8B-Instruct")
)
result = extractor.extract(image, output_format="markdown")
print(result.content)
BaseTextExtractor
¶
Bases: ABC
Abstract base class for text extractors.
All text extraction models must inherit from this class and implement the required methods.
Example
extract
abstractmethod
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text from an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file
TYPE:
|
output_format
|
Desired output format: - "html": Structured HTML - "markdown": Markdown format
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput containing extracted text content |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If image format or output_format is not supported |
RuntimeError
|
If model is not loaded or inference fails |
Source code in omnidocs/tasks/text_extraction/base.py
batch_extract
¶
batch_extract(
images: List[Union[Image, ndarray, str, Path]],
output_format: Literal["html", "markdown"] = "markdown",
progress_callback: Optional[
Callable[[int, int], None]
] = None,
) -> List[TextOutput]
Extract text from multiple images.
Default implementation loops over extract(). Subclasses can override for optimized batching (e.g., VLLM).
| PARAMETER | DESCRIPTION |
|---|---|
images
|
List of images in any supported format
TYPE:
|
output_format
|
Desired output format
TYPE:
|
progress_callback
|
Optional function(current, total) for progress
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[TextOutput]
|
List of TextOutput in same order as input |
Examples:
images = [doc.get_page(i) for i in range(doc.page_count)]
results = extractor.batch_extract(images, output_format="markdown")
Source code in omnidocs/tasks/text_extraction/base.py
extract_document
¶
extract_document(
document: Document,
output_format: Literal["html", "markdown"] = "markdown",
progress_callback: Optional[
Callable[[int, int], None]
] = None,
) -> List[TextOutput]
Extract text from all pages of a document.
| PARAMETER | DESCRIPTION |
|---|---|
document
|
Document instance
TYPE:
|
output_format
|
Desired output format
TYPE:
|
progress_callback
|
Optional function(current, total) for progress
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[TextOutput]
|
List of TextOutput, one per page |
Examples:
doc = Document.from_pdf("paper.pdf")
results = extractor.extract_document(doc, output_format="markdown")
Source code in omnidocs/tasks/text_extraction/base.py
DotsOCRTextExtractor
¶
Bases: BaseTextExtractor
Dots OCR Vision-Language Model text extractor with layout detection.
Extracts text from document images with layout information including: - 11 layout categories (Caption, Footnote, Formula, List-item, etc.) - Bounding boxes (normalized to 0-1024) - Multi-format text (Markdown, LaTeX, HTML) - Reading order preservation
Supports PyTorch, VLLM, and API backends.
Example
from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRPyTorchConfig
# Initialize with PyTorch backend
extractor = DotsOCRTextExtractor(
backend=DotsOCRPyTorchConfig(model="rednote-hilab/dots.ocr")
)
# Extract with layout
result = extractor.extract(image, include_layout=True)
print(f"Found {result.num_layout_elements} elements")
print(result.content)
Initialize Dots OCR text extractor.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration. One of: - DotsOCRPyTorchConfig: PyTorch/HuggingFace backend - DotsOCRVLLMConfig: VLLM high-throughput backend - DotsOCRAPIConfig: API backend (online VLLM server)
TYPE:
|
Source code in omnidocs/tasks/text_extraction/dotsocr/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal[
"markdown", "html", "json"
] = "markdown",
include_layout: bool = False,
custom_prompt: Optional[str] = None,
max_tokens: int = 8192,
) -> DotsOCRTextOutput
Extract text from image using Dots OCR.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path)
TYPE:
|
output_format
|
Output format ("markdown", "html", or "json")
TYPE:
|
include_layout
|
Include layout bounding boxes in output
TYPE:
|
custom_prompt
|
Override default extraction prompt
TYPE:
|
max_tokens
|
Maximum tokens for generation
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DotsOCRTextOutput
|
DotsOCRTextOutput with extracted content and optional layout |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If model is not loaded or inference fails |
Source code in omnidocs/tasks/text_extraction/dotsocr/extractor.py
GraniteDoclingTextExtractor
¶
Bases: BaseTextExtractor
Granite Docling text extractor supporting PyTorch, VLLM, MLX, and API backends.
Granite Docling is IBM's compact vision-language model optimized for document conversion. It outputs DocTags format which is converted to Markdown using the docling_core library.
Example
from omnidocs.tasks.text_extraction.granitedocling import ( ... GraniteDoclingTextExtractor, ... GraniteDoclingTextPyTorchConfig, ... ) config = GraniteDoclingTextPyTorchConfig(device="cuda") extractor = GraniteDoclingTextExtractor(backend=config) result = extractor.extract(image, output_format="markdown") print(result.content)
Initialize Granite Docling extractor with backend configuration.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration (PyTorch, VLLM, MLX, or API config)
TYPE:
|
Source code in omnidocs/tasks/text_extraction/granitedocling/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text from an image using Granite Docling.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path)
TYPE:
|
output_format
|
Output format ("markdown" or "html")
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput with extracted content |
Source code in omnidocs/tasks/text_extraction/granitedocling/extractor.py
MinerUVLTextExtractor
¶
Bases: BaseTextExtractor
MinerU VL text extractor with layout-aware extraction.
Performs two-step extraction: 1. Layout detection (detect regions) 2. Content recognition (extract text/table/equation from each region)
Supports multiple backends: - PyTorch (HuggingFace Transformers) - VLLM (high-throughput GPU) - MLX (Apple Silicon) - API (VLLM OpenAI-compatible server)
Example
from omnidocs.tasks.text_extraction import MinerUVLTextExtractor
from omnidocs.tasks.text_extraction.mineruvl import MinerUVLTextPyTorchConfig
extractor = MinerUVLTextExtractor(
backend=MinerUVLTextPyTorchConfig(device="cuda")
)
result = extractor.extract(image)
print(result.content) # Combined text + tables + equations
print(result.blocks) # List of ContentBlock objects
Initialize MinerU VL text extractor.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration (PyTorch, VLLM, MLX, or API)
TYPE:
|
Source code in omnidocs/tasks/text_extraction/mineruvl/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text with layout-aware two-step extraction.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path)
TYPE:
|
output_format
|
Output format ('html' or 'markdown')
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput with extracted content and metadata |
Source code in omnidocs/tasks/text_extraction/mineruvl/extractor.py
extract_with_blocks
¶
extract_with_blocks(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> tuple[TextOutput, List[ContentBlock]]
Extract text and return both TextOutput and ContentBlocks.
This method provides access to the detailed block information including bounding boxes and block types.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image
TYPE:
|
output_format
|
Output format
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple[TextOutput, List[ContentBlock]]
|
Tuple of (TextOutput, List[ContentBlock]) |
Source code in omnidocs/tasks/text_extraction/mineruvl/extractor.py
DotsOCRTextOutput
¶
Bases: BaseModel
Text extraction output from Dots OCR with layout information.
Dots OCR provides structured output with: - Layout detection (11 categories) - Bounding boxes (normalized to 0-1024) - Multi-format text (Markdown/LaTeX/HTML) - Reading order preservation
Layout Categories
Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Section-header, Table, Text, Title
Text Formatting
- Text/Title/Section-header: Markdown
- Formula: LaTeX
- Table: HTML
- Picture: (text omitted)
Example
LayoutElement
¶
Bases: BaseModel
Single layout element from document layout detection.
Represents a detected region in the document with its bounding box, category label, and extracted text content.
| ATTRIBUTE | DESCRIPTION |
|---|---|
bbox |
Bounding box coordinates [x1, y1, x2, y2] (normalized to 0-1024)
TYPE:
|
category |
Layout category (e.g., "Text", "Title", "Table", "Formula")
TYPE:
|
text |
Extracted text content (None for pictures)
TYPE:
|
confidence |
Detection confidence score (optional)
TYPE:
|
OutputFormat
¶
Bases: str, Enum
Supported text extraction output formats.
Each format has different characteristics
- HTML: Structured with div elements, preserves layout semantics
- MARKDOWN: Portable, human-readable, good for documentation
- JSON: Structured data with layout information (Dots OCR)
TextOutput
¶
Bases: BaseModel
Text extraction output from a document image.
Contains the extracted text content in the requested format, along with optional raw output and plain text versions.
Example
NanonetsTextExtractor
¶
Bases: BaseTextExtractor
Nanonets OCR2-3B Vision-Language Model text extractor.
Extracts text from document images with support for:
- Tables (output as HTML)
- Equations (output as LaTeX)
- Image captions (wrapped in tags)
- Watermarks (wrapped in
Supports PyTorch, VLLM, and MLX backends.
Example
from omnidocs.tasks.text_extraction import NanonetsTextExtractor
from omnidocs.tasks.text_extraction.nanonets import NanonetsTextPyTorchConfig
# Initialize with PyTorch backend
extractor = NanonetsTextExtractor(
backend=NanonetsTextPyTorchConfig()
)
# Extract text
result = extractor.extract(image)
print(result.content)
Initialize Nanonets text extractor.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration. One of: - NanonetsTextPyTorchConfig: PyTorch/HuggingFace backend - NanonetsTextVLLMConfig: VLLM high-throughput backend - NanonetsTextMLXConfig: MLX backend for Apple Silicon
TYPE:
|
Source code in omnidocs/tasks/text_extraction/nanonets/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text from an image.
Note: Nanonets OCR2 produces a unified output format that includes tables as HTML and equations as LaTeX inline. The output_format parameter is accepted for API compatibility but does not change the output structure.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file
TYPE:
|
output_format
|
Accepted for API compatibility (default: "markdown")
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput containing extracted text content |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If model is not loaded |
ValueError
|
If image format is not supported |
Source code in omnidocs/tasks/text_extraction/nanonets/extractor.py
QwenTextExtractor
¶
Bases: BaseTextExtractor
Qwen3-VL Vision-Language Model text extractor.
Extracts text from document images and outputs as structured HTML or Markdown. Uses Qwen3-VL's built-in document parsing prompts.
Supports PyTorch, VLLM, MLX, and API backends.
Example
from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig
# Initialize with PyTorch backend
extractor = QwenTextExtractor(
backend=QwenTextPyTorchConfig(model="Qwen/Qwen3-VL-8B-Instruct")
)
# Extract as Markdown
result = extractor.extract(image, output_format="markdown")
print(result.content)
# Extract as HTML
result = extractor.extract(image, output_format="html")
print(result.content)
Initialize Qwen text extractor.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration. One of: - QwenTextPyTorchConfig: PyTorch/HuggingFace backend - QwenTextVLLMConfig: VLLM high-throughput backend - QwenTextMLXConfig: MLX backend for Apple Silicon - QwenTextAPIConfig: API backend (OpenRouter, etc.)
TYPE:
|
Source code in omnidocs/tasks/text_extraction/qwen/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text from an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file
TYPE:
|
output_format
|
Desired output format: - "html": Structured HTML with div elements - "markdown": Markdown format
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput containing extracted text content |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If model is not loaded |
ValueError
|
If image format or output_format is not supported |
Source code in omnidocs/tasks/text_extraction/qwen/extractor.py
VLMTextExtractor
¶
Bases: BaseTextExtractor
Provider-agnostic VLM text extractor using litellm.
Works with any cloud VLM API: Gemini, OpenRouter, Azure, OpenAI, Anthropic, etc. Supports custom prompts for specialized extraction.
Example
from omnidocs.vlm import VLMAPIConfig
from omnidocs.tasks.text_extraction import VLMTextExtractor
# Gemini (reads GOOGLE_API_KEY from env)
config = VLMAPIConfig(model="gemini/gemini-2.5-flash")
extractor = VLMTextExtractor(config=config)
# Default extraction
result = extractor.extract("document.png", output_format="markdown")
# Custom prompt
result = extractor.extract(
"document.png",
prompt="Extract only the table data as markdown",
)
Initialize VLM text extractor.
| PARAMETER | DESCRIPTION |
|---|---|
config
|
VLM API configuration with model and provider details.
TYPE:
|
Source code in omnidocs/tasks/text_extraction/vlm.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
prompt: Optional[str] = None,
) -> TextOutput
Extract text from an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path).
TYPE:
|
output_format
|
Desired output format ("html" or "markdown").
TYPE:
|
prompt
|
Custom prompt. If None, uses a task-specific default prompt.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput containing extracted text content. |
Source code in omnidocs/tasks/text_extraction/vlm.py
base
¶
Base class for text extractors.
Defines the abstract interface that all text extractors must implement.
BaseTextExtractor
¶
Bases: ABC
Abstract base class for text extractors.
All text extraction models must inherit from this class and implement the required methods.
Example
extract
abstractmethod
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text from an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file
TYPE:
|
output_format
|
Desired output format: - "html": Structured HTML - "markdown": Markdown format
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput containing extracted text content |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If image format or output_format is not supported |
RuntimeError
|
If model is not loaded or inference fails |
Source code in omnidocs/tasks/text_extraction/base.py
batch_extract
¶
batch_extract(
images: List[Union[Image, ndarray, str, Path]],
output_format: Literal["html", "markdown"] = "markdown",
progress_callback: Optional[
Callable[[int, int], None]
] = None,
) -> List[TextOutput]
Extract text from multiple images.
Default implementation loops over extract(). Subclasses can override for optimized batching (e.g., VLLM).
| PARAMETER | DESCRIPTION |
|---|---|
images
|
List of images in any supported format
TYPE:
|
output_format
|
Desired output format
TYPE:
|
progress_callback
|
Optional function(current, total) for progress
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[TextOutput]
|
List of TextOutput in same order as input |
Examples:
images = [doc.get_page(i) for i in range(doc.page_count)]
results = extractor.batch_extract(images, output_format="markdown")
Source code in omnidocs/tasks/text_extraction/base.py
extract_document
¶
extract_document(
document: Document,
output_format: Literal["html", "markdown"] = "markdown",
progress_callback: Optional[
Callable[[int, int], None]
] = None,
) -> List[TextOutput]
Extract text from all pages of a document.
| PARAMETER | DESCRIPTION |
|---|---|
document
|
Document instance
TYPE:
|
output_format
|
Desired output format
TYPE:
|
progress_callback
|
Optional function(current, total) for progress
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[TextOutput]
|
List of TextOutput, one per page |
Examples:
doc = Document.from_pdf("paper.pdf")
results = extractor.extract_document(doc, output_format="markdown")
Source code in omnidocs/tasks/text_extraction/base.py
dotsocr
¶
Dots OCR text extractor and backend configurations.
Available backends: - PyTorch: DotsOCRPyTorchConfig (local GPU inference) - VLLM: DotsOCRVLLMConfig (offline batch inference) - API: DotsOCRAPIConfig (online VLLM server via OpenAI-compatible API)
DotsOCRAPIConfig
¶
Bases: BaseModel
API backend configuration for Dots OCR.
This config is for accessing a deployed VLLM server via OpenAI-compatible API. Typically used with modal_dotsocr_vllm_online.py deployment.
Example
DotsOCRTextExtractor
¶
Bases: BaseTextExtractor
Dots OCR Vision-Language Model text extractor with layout detection.
Extracts text from document images with layout information including: - 11 layout categories (Caption, Footnote, Formula, List-item, etc.) - Bounding boxes (normalized to 0-1024) - Multi-format text (Markdown, LaTeX, HTML) - Reading order preservation
Supports PyTorch, VLLM, and API backends.
Example
from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRPyTorchConfig
# Initialize with PyTorch backend
extractor = DotsOCRTextExtractor(
backend=DotsOCRPyTorchConfig(model="rednote-hilab/dots.ocr")
)
# Extract with layout
result = extractor.extract(image, include_layout=True)
print(f"Found {result.num_layout_elements} elements")
print(result.content)
Initialize Dots OCR text extractor.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration. One of: - DotsOCRPyTorchConfig: PyTorch/HuggingFace backend - DotsOCRVLLMConfig: VLLM high-throughput backend - DotsOCRAPIConfig: API backend (online VLLM server)
TYPE:
|
Source code in omnidocs/tasks/text_extraction/dotsocr/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal[
"markdown", "html", "json"
] = "markdown",
include_layout: bool = False,
custom_prompt: Optional[str] = None,
max_tokens: int = 8192,
) -> DotsOCRTextOutput
Extract text from image using Dots OCR.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path)
TYPE:
|
output_format
|
Output format ("markdown", "html", or "json")
TYPE:
|
include_layout
|
Include layout bounding boxes in output
TYPE:
|
custom_prompt
|
Override default extraction prompt
TYPE:
|
max_tokens
|
Maximum tokens for generation
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DotsOCRTextOutput
|
DotsOCRTextOutput with extracted content and optional layout |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If model is not loaded or inference fails |
Source code in omnidocs/tasks/text_extraction/dotsocr/extractor.py
DotsOCRPyTorchConfig
¶
Bases: BaseModel
PyTorch/HuggingFace backend configuration for Dots OCR.
Dots OCR provides layout-aware text extraction with 11 predefined layout categories (Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Section-header, Table, Text, Title).
Example
DotsOCRVLLMConfig
¶
Bases: BaseModel
VLLM backend configuration for Dots OCR.
VLLM provides high-throughput inference with optimizations like: - PagedAttention for efficient KV cache management - Continuous batching for higher throughput - Optimized CUDA kernels
Example
api
¶
API backend configuration for Dots OCR (VLLM online server).
DotsOCRAPIConfig
¶
Bases: BaseModel
API backend configuration for Dots OCR.
This config is for accessing a deployed VLLM server via OpenAI-compatible API. Typically used with modal_dotsocr_vllm_online.py deployment.
Example
extractor
¶
Dots OCR text extractor with layout-aware extraction.
A Vision-Language Model optimized for document OCR with structured output containing layout information, bounding boxes, and multi-format text.
Supports PyTorch, VLLM, and API backends.
Example
from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRPyTorchConfig
extractor = DotsOCRTextExtractor(
backend=DotsOCRPyTorchConfig(model="rednote-hilab/dots.ocr")
)
result = extractor.extract(image, include_layout=True)
print(result.content)
for elem in result.layout:
print(f"{elem.category}: {elem.bbox}")
DotsOCRTextExtractor
¶
Bases: BaseTextExtractor
Dots OCR Vision-Language Model text extractor with layout detection.
Extracts text from document images with layout information including: - 11 layout categories (Caption, Footnote, Formula, List-item, etc.) - Bounding boxes (normalized to 0-1024) - Multi-format text (Markdown, LaTeX, HTML) - Reading order preservation
Supports PyTorch, VLLM, and API backends.
Example
from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRPyTorchConfig
# Initialize with PyTorch backend
extractor = DotsOCRTextExtractor(
backend=DotsOCRPyTorchConfig(model="rednote-hilab/dots.ocr")
)
# Extract with layout
result = extractor.extract(image, include_layout=True)
print(f"Found {result.num_layout_elements} elements")
print(result.content)
Initialize Dots OCR text extractor.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration. One of: - DotsOCRPyTorchConfig: PyTorch/HuggingFace backend - DotsOCRVLLMConfig: VLLM high-throughput backend - DotsOCRAPIConfig: API backend (online VLLM server)
TYPE:
|
Source code in omnidocs/tasks/text_extraction/dotsocr/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal[
"markdown", "html", "json"
] = "markdown",
include_layout: bool = False,
custom_prompt: Optional[str] = None,
max_tokens: int = 8192,
) -> DotsOCRTextOutput
Extract text from image using Dots OCR.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path)
TYPE:
|
output_format
|
Output format ("markdown", "html", or "json")
TYPE:
|
include_layout
|
Include layout bounding boxes in output
TYPE:
|
custom_prompt
|
Override default extraction prompt
TYPE:
|
max_tokens
|
Maximum tokens for generation
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DotsOCRTextOutput
|
DotsOCRTextOutput with extracted content and optional layout |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If model is not loaded or inference fails |
Source code in omnidocs/tasks/text_extraction/dotsocr/extractor.py
pytorch
¶
PyTorch backend configuration for Dots OCR.
DotsOCRPyTorchConfig
¶
Bases: BaseModel
PyTorch/HuggingFace backend configuration for Dots OCR.
Dots OCR provides layout-aware text extraction with 11 predefined layout categories (Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Section-header, Table, Text, Title).
Example
vllm
¶
VLLM backend configuration for Dots OCR.
DotsOCRVLLMConfig
¶
Bases: BaseModel
VLLM backend configuration for Dots OCR.
VLLM provides high-throughput inference with optimizations like: - PagedAttention for efficient KV cache management - Continuous batching for higher throughput - Optimized CUDA kernels
Example
granitedocling
¶
Granite Docling text extraction with multi-backend support.
GraniteDoclingTextAPIConfig
¶
Bases: BaseModel
Configuration for Granite Docling text extraction via API.
Uses litellm for provider-agnostic API access. Supports OpenRouter, Gemini, Azure, OpenAI, and any other litellm-compatible provider.
API keys can be passed directly or read from environment variables.
Example
GraniteDoclingTextExtractor
¶
Bases: BaseTextExtractor
Granite Docling text extractor supporting PyTorch, VLLM, MLX, and API backends.
Granite Docling is IBM's compact vision-language model optimized for document conversion. It outputs DocTags format which is converted to Markdown using the docling_core library.
Example
from omnidocs.tasks.text_extraction.granitedocling import ( ... GraniteDoclingTextExtractor, ... GraniteDoclingTextPyTorchConfig, ... ) config = GraniteDoclingTextPyTorchConfig(device="cuda") extractor = GraniteDoclingTextExtractor(backend=config) result = extractor.extract(image, output_format="markdown") print(result.content)
Initialize Granite Docling extractor with backend configuration.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration (PyTorch, VLLM, MLX, or API config)
TYPE:
|
Source code in omnidocs/tasks/text_extraction/granitedocling/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text from an image using Granite Docling.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path)
TYPE:
|
output_format
|
Output format ("markdown" or "html")
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput with extracted content |
Source code in omnidocs/tasks/text_extraction/granitedocling/extractor.py
GraniteDoclingTextMLXConfig
¶
Bases: BaseModel
Configuration for Granite Docling text extraction with MLX backend.
This backend is optimized for Apple Silicon Macs (M1/M2/M3/M4). Uses the MLX-optimized model variant.
GraniteDoclingTextPyTorchConfig
¶
Bases: BaseModel
Configuration for Granite Docling text extraction with PyTorch backend.
GraniteDoclingTextVLLMConfig
¶
Bases: BaseModel
Configuration for Granite Docling text extraction with VLLM backend.
IMPORTANT: This config uses revision="untied" by default, which is required for VLLM compatibility with Granite Docling's tied weights.
api
¶
API backend configuration for Granite Docling text extraction.
Uses litellm for provider-agnostic inference (OpenRouter, Gemini, Azure, etc.).
GraniteDoclingTextAPIConfig
¶
Bases: BaseModel
Configuration for Granite Docling text extraction via API.
Uses litellm for provider-agnostic API access. Supports OpenRouter, Gemini, Azure, OpenAI, and any other litellm-compatible provider.
API keys can be passed directly or read from environment variables.
Example
extractor
¶
Granite Docling text extractor with multi-backend support.
GraniteDoclingTextExtractor
¶
Bases: BaseTextExtractor
Granite Docling text extractor supporting PyTorch, VLLM, MLX, and API backends.
Granite Docling is IBM's compact vision-language model optimized for document conversion. It outputs DocTags format which is converted to Markdown using the docling_core library.
Example
from omnidocs.tasks.text_extraction.granitedocling import ( ... GraniteDoclingTextExtractor, ... GraniteDoclingTextPyTorchConfig, ... ) config = GraniteDoclingTextPyTorchConfig(device="cuda") extractor = GraniteDoclingTextExtractor(backend=config) result = extractor.extract(image, output_format="markdown") print(result.content)
Initialize Granite Docling extractor with backend configuration.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration (PyTorch, VLLM, MLX, or API config)
TYPE:
|
Source code in omnidocs/tasks/text_extraction/granitedocling/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text from an image using Granite Docling.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path)
TYPE:
|
output_format
|
Output format ("markdown" or "html")
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput with extracted content |
Source code in omnidocs/tasks/text_extraction/granitedocling/extractor.py
mlx
¶
MLX backend configuration for Granite Docling text extraction (Apple Silicon).
GraniteDoclingTextMLXConfig
¶
Bases: BaseModel
Configuration for Granite Docling text extraction with MLX backend.
This backend is optimized for Apple Silicon Macs (M1/M2/M3/M4). Uses the MLX-optimized model variant.
pytorch
¶
PyTorch backend configuration for Granite Docling text extraction.
GraniteDoclingTextPyTorchConfig
¶
Bases: BaseModel
Configuration for Granite Docling text extraction with PyTorch backend.
vllm
¶
VLLM backend configuration for Granite Docling text extraction.
GraniteDoclingTextVLLMConfig
¶
Bases: BaseModel
Configuration for Granite Docling text extraction with VLLM backend.
IMPORTANT: This config uses revision="untied" by default, which is required for VLLM compatibility with Granite Docling's tied weights.
mineruvl
¶
MinerU VL text extraction module.
MinerU VL is a vision-language model for document layout detection and text/table/equation recognition. It performs two-step extraction: 1. Layout Detection: Detect regions with types (text, table, equation, etc.) 2. Content Recognition: Extract content from each detected region
Example
from omnidocs.tasks.text_extraction import MinerUVLTextExtractor
from omnidocs.tasks.text_extraction.mineruvl import MinerUVLTextPyTorchConfig
# Initialize with PyTorch backend
extractor = MinerUVLTextExtractor(
backend=MinerUVLTextPyTorchConfig(device="cuda")
)
# Extract text
result = extractor.extract(image)
print(result.content)
# Extract with detailed blocks
result, blocks = extractor.extract_with_blocks(image)
for block in blocks:
print(f"{block.type}: {block.content[:50]}...")
MinerUVLTextAPIConfig
¶
Bases: BaseModel
API backend config for MinerU VL text extraction.
Connects to a deployed VLLM server with OpenAI-compatible API.
Example
MinerUVLTextExtractor
¶
Bases: BaseTextExtractor
MinerU VL text extractor with layout-aware extraction.
Performs two-step extraction: 1. Layout detection (detect regions) 2. Content recognition (extract text/table/equation from each region)
Supports multiple backends: - PyTorch (HuggingFace Transformers) - VLLM (high-throughput GPU) - MLX (Apple Silicon) - API (VLLM OpenAI-compatible server)
Example
from omnidocs.tasks.text_extraction import MinerUVLTextExtractor
from omnidocs.tasks.text_extraction.mineruvl import MinerUVLTextPyTorchConfig
extractor = MinerUVLTextExtractor(
backend=MinerUVLTextPyTorchConfig(device="cuda")
)
result = extractor.extract(image)
print(result.content) # Combined text + tables + equations
print(result.blocks) # List of ContentBlock objects
Initialize MinerU VL text extractor.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration (PyTorch, VLLM, MLX, or API)
TYPE:
|
Source code in omnidocs/tasks/text_extraction/mineruvl/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text with layout-aware two-step extraction.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path)
TYPE:
|
output_format
|
Output format ('html' or 'markdown')
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput with extracted content and metadata |
Source code in omnidocs/tasks/text_extraction/mineruvl/extractor.py
extract_with_blocks
¶
extract_with_blocks(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> tuple[TextOutput, List[ContentBlock]]
Extract text and return both TextOutput and ContentBlocks.
This method provides access to the detailed block information including bounding boxes and block types.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image
TYPE:
|
output_format
|
Output format
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple[TextOutput, List[ContentBlock]]
|
Tuple of (TextOutput, List[ContentBlock]) |
Source code in omnidocs/tasks/text_extraction/mineruvl/extractor.py
MinerUVLTextMLXConfig
¶
Bases: BaseModel
MLX backend config for MinerU VL text extraction on Apple Silicon.
Uses MLX-VLM for efficient inference on M1/M2/M3/M4 chips.
Example
MinerUVLTextPyTorchConfig
¶
Bases: BaseModel
PyTorch/HuggingFace backend config for MinerU VL text extraction.
Uses HuggingFace Transformers with Qwen2VLForConditionalGeneration.
Example
BlockType
¶
Bases: str, Enum
MinerU VL block types (22 categories).
ContentBlock
¶
Bases: BaseModel
A detected content block with type, bounding box, angle, and content.
Coordinates are normalized to [0, 1] range relative to image dimensions.
to_absolute
¶
Convert normalized bbox to absolute pixel coordinates.
Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
MinerUSamplingParams
¶
MinerUSamplingParams(
temperature: Optional[float] = 0.0,
top_p: Optional[float] = 0.01,
top_k: Optional[int] = 1,
presence_penalty: Optional[float] = 0.0,
frequency_penalty: Optional[float] = 0.0,
repetition_penalty: Optional[float] = 1.0,
no_repeat_ngram_size: Optional[int] = 100,
max_new_tokens: Optional[int] = None,
)
Bases: SamplingParams
Default sampling parameters optimized for MinerU VL.
Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
SamplingParams
dataclass
¶
SamplingParams(
temperature: Optional[float] = None,
top_p: Optional[float] = None,
top_k: Optional[int] = None,
presence_penalty: Optional[float] = None,
frequency_penalty: Optional[float] = None,
repetition_penalty: Optional[float] = None,
no_repeat_ngram_size: Optional[int] = None,
max_new_tokens: Optional[int] = None,
)
Sampling parameters for text generation.
MinerUVLTextVLLMConfig
¶
Bases: BaseModel
VLLM backend config for MinerU VL text extraction.
Uses VLLM for high-throughput GPU inference with: - PagedAttention for efficient KV cache - Continuous batching - Optimized CUDA kernels
Example
convert_otsl_to_html
¶
Convert OTSL table format to HTML.
Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 | |
parse_layout_output
¶
Parse layout detection model output into ContentBlocks.
Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
api
¶
API backend configuration for MinerU VL text extraction.
MinerUVLTextAPIConfig
¶
Bases: BaseModel
API backend config for MinerU VL text extraction.
Connects to a deployed VLLM server with OpenAI-compatible API.
Example
extractor
¶
MinerU VL text extractor with layout-aware two-step extraction.
MinerU VL performs document extraction in two steps: 1. Layout Detection: Detect regions with types (text, table, equation, etc.) 2. Content Recognition: Extract text/table/equation content from each region
MinerUVLTextExtractor
¶
Bases: BaseTextExtractor
MinerU VL text extractor with layout-aware extraction.
Performs two-step extraction: 1. Layout detection (detect regions) 2. Content recognition (extract text/table/equation from each region)
Supports multiple backends: - PyTorch (HuggingFace Transformers) - VLLM (high-throughput GPU) - MLX (Apple Silicon) - API (VLLM OpenAI-compatible server)
Example
from omnidocs.tasks.text_extraction import MinerUVLTextExtractor
from omnidocs.tasks.text_extraction.mineruvl import MinerUVLTextPyTorchConfig
extractor = MinerUVLTextExtractor(
backend=MinerUVLTextPyTorchConfig(device="cuda")
)
result = extractor.extract(image)
print(result.content) # Combined text + tables + equations
print(result.blocks) # List of ContentBlock objects
Initialize MinerU VL text extractor.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration (PyTorch, VLLM, MLX, or API)
TYPE:
|
Source code in omnidocs/tasks/text_extraction/mineruvl/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text with layout-aware two-step extraction.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path)
TYPE:
|
output_format
|
Output format ('html' or 'markdown')
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput with extracted content and metadata |
Source code in omnidocs/tasks/text_extraction/mineruvl/extractor.py
extract_with_blocks
¶
extract_with_blocks(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> tuple[TextOutput, List[ContentBlock]]
Extract text and return both TextOutput and ContentBlocks.
This method provides access to the detailed block information including bounding boxes and block types.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image
TYPE:
|
output_format
|
Output format
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple[TextOutput, List[ContentBlock]]
|
Tuple of (TextOutput, List[ContentBlock]) |
Source code in omnidocs/tasks/text_extraction/mineruvl/extractor.py
mlx
¶
MLX backend configuration for MinerU VL text extraction (Apple Silicon).
MinerUVLTextMLXConfig
¶
Bases: BaseModel
MLX backend config for MinerU VL text extraction on Apple Silicon.
Uses MLX-VLM for efficient inference on M1/M2/M3/M4 chips.
Example
pytorch
¶
PyTorch/HuggingFace backend configuration for MinerU VL text extraction.
MinerUVLTextPyTorchConfig
¶
Bases: BaseModel
PyTorch/HuggingFace backend config for MinerU VL text extraction.
Uses HuggingFace Transformers with Qwen2VLForConditionalGeneration.
Example
utils
¶
MinerU VL utilities for document extraction.
Contains data structures, parsing, prompts, and post-processing functions for MinerU VL document extraction pipeline.
This file contains code adapted from mineru-vl-utils
https://github.com/opendatalab/mineru-vl-utils https://pypi.org/project/mineru-vl-utils/
The original mineru-vl-utils is licensed under AGPL-3.0: Copyright (c) OpenDataLab https://github.com/opendatalab/mineru-vl-utils/blob/main/LICENSE.md
Adapted components
- BlockType enum (from structs.py)
- ContentBlock data structure (from structs.py)
- OTSL to HTML table conversion (from post_process/otsl2html.py)
BlockType
¶
Bases: str, Enum
MinerU VL block types (22 categories).
ContentBlock
¶
Bases: BaseModel
A detected content block with type, bounding box, angle, and content.
Coordinates are normalized to [0, 1] range relative to image dimensions.
to_absolute
¶
Convert normalized bbox to absolute pixel coordinates.
Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
SamplingParams
dataclass
¶
SamplingParams(
temperature: Optional[float] = None,
top_p: Optional[float] = None,
top_k: Optional[int] = None,
presence_penalty: Optional[float] = None,
frequency_penalty: Optional[float] = None,
repetition_penalty: Optional[float] = None,
no_repeat_ngram_size: Optional[int] = None,
max_new_tokens: Optional[int] = None,
)
Sampling parameters for text generation.
MinerUSamplingParams
¶
MinerUSamplingParams(
temperature: Optional[float] = 0.0,
top_p: Optional[float] = 0.01,
top_k: Optional[int] = 1,
presence_penalty: Optional[float] = 0.0,
frequency_penalty: Optional[float] = 0.0,
repetition_penalty: Optional[float] = 1.0,
no_repeat_ngram_size: Optional[int] = 100,
max_new_tokens: Optional[int] = None,
)
Bases: SamplingParams
Default sampling parameters optimized for MinerU VL.
Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
convert_bbox
¶
Convert bbox from model output (0-1000) to normalized format (0-1).
Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
parse_angle
¶
Parse rotation angle from model output tail string.
parse_layout_output
¶
Parse layout detection model output into ContentBlocks.
Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
get_rgb_image
¶
Convert image to RGB mode.
prepare_for_layout
¶
prepare_for_layout(
image: Image,
layout_size: Tuple[int, int] = LAYOUT_IMAGE_SIZE,
) -> Image.Image
Prepare image for layout detection.
Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
resize_by_need
¶
Resize image if needed based on aspect ratio constraints.
Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
prepare_for_extract
¶
prepare_for_extract(
image: Image,
blocks: List[ContentBlock],
prompts: Dict[str, str] = None,
sampling_params: Dict[str, SamplingParams] = None,
skip_types: set = None,
) -> Tuple[
List[Image.Image],
List[str],
List[SamplingParams],
List[int],
]
Prepare cropped images for content extraction.
Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
convert_otsl_to_html
¶
Convert OTSL table format to HTML.
Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 | |
simple_post_process
¶
Simple post-processing: convert OTSL tables to HTML.
Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
vllm
¶
VLLM backend configuration for MinerU VL text extraction.
MinerUVLTextVLLMConfig
¶
Bases: BaseModel
VLLM backend config for MinerU VL text extraction.
Uses VLLM for high-throughput GPU inference with: - PagedAttention for efficient KV cache - Continuous batching - Optimized CUDA kernels
Example
models
¶
Pydantic models for text extraction outputs.
Defines output types and format enums for text extraction.
OutputFormat
¶
Bases: str, Enum
Supported text extraction output formats.
Each format has different characteristics
- HTML: Structured with div elements, preserves layout semantics
- MARKDOWN: Portable, human-readable, good for documentation
- JSON: Structured data with layout information (Dots OCR)
TextOutput
¶
Bases: BaseModel
Text extraction output from a document image.
Contains the extracted text content in the requested format, along with optional raw output and plain text versions.
Example
LayoutElement
¶
Bases: BaseModel
Single layout element from document layout detection.
Represents a detected region in the document with its bounding box, category label, and extracted text content.
| ATTRIBUTE | DESCRIPTION |
|---|---|
bbox |
Bounding box coordinates [x1, y1, x2, y2] (normalized to 0-1024)
TYPE:
|
category |
Layout category (e.g., "Text", "Title", "Table", "Formula")
TYPE:
|
text |
Extracted text content (None for pictures)
TYPE:
|
confidence |
Detection confidence score (optional)
TYPE:
|
DotsOCRTextOutput
¶
Bases: BaseModel
Text extraction output from Dots OCR with layout information.
Dots OCR provides structured output with: - Layout detection (11 categories) - Bounding boxes (normalized to 0-1024) - Multi-format text (Markdown/LaTeX/HTML) - Reading order preservation
Layout Categories
Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Section-header, Table, Text, Title
Text Formatting
- Text/Title/Section-header: Markdown
- Formula: LaTeX
- Table: HTML
- Picture: (text omitted)
Example
nanonets
¶
Nanonets OCR2-3B backend configurations and extractor for text extraction.
Available backends
- NanonetsTextPyTorchConfig: PyTorch/HuggingFace backend
- NanonetsTextVLLMConfig: VLLM high-throughput backend
- NanonetsTextMLXConfig: MLX backend for Apple Silicon
Example
NanonetsTextExtractor
¶
Bases: BaseTextExtractor
Nanonets OCR2-3B Vision-Language Model text extractor.
Extracts text from document images with support for:
- Tables (output as HTML)
- Equations (output as LaTeX)
- Image captions (wrapped in tags)
- Watermarks (wrapped in
Supports PyTorch, VLLM, and MLX backends.
Example
from omnidocs.tasks.text_extraction import NanonetsTextExtractor
from omnidocs.tasks.text_extraction.nanonets import NanonetsTextPyTorchConfig
# Initialize with PyTorch backend
extractor = NanonetsTextExtractor(
backend=NanonetsTextPyTorchConfig()
)
# Extract text
result = extractor.extract(image)
print(result.content)
Initialize Nanonets text extractor.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration. One of: - NanonetsTextPyTorchConfig: PyTorch/HuggingFace backend - NanonetsTextVLLMConfig: VLLM high-throughput backend - NanonetsTextMLXConfig: MLX backend for Apple Silicon
TYPE:
|
Source code in omnidocs/tasks/text_extraction/nanonets/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text from an image.
Note: Nanonets OCR2 produces a unified output format that includes tables as HTML and equations as LaTeX inline. The output_format parameter is accepted for API compatibility but does not change the output structure.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file
TYPE:
|
output_format
|
Accepted for API compatibility (default: "markdown")
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput containing extracted text content |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If model is not loaded |
ValueError
|
If image format is not supported |
Source code in omnidocs/tasks/text_extraction/nanonets/extractor.py
NanonetsTextMLXConfig
¶
Bases: BaseModel
MLX backend configuration for Nanonets OCR2-3B text extraction.
This backend uses MLX for Apple Silicon native inference. Best for local development and testing on macOS M1/M2/M3/M4+. Requires: mlx, mlx-vlm
Note: This backend only works on Apple Silicon Macs. Do NOT use for Modal/cloud deployments.
NanonetsTextPyTorchConfig
¶
Bases: BaseModel
PyTorch/HuggingFace backend configuration for Nanonets OCR2-3B text extraction.
This backend uses the transformers library with PyTorch for local GPU inference. Requires: torch, transformers, accelerate
NanonetsTextVLLMConfig
¶
Bases: BaseModel
VLLM backend configuration for Nanonets OCR2-3B text extraction.
This backend uses VLLM for high-throughput inference. Best for batch processing and production deployments. Requires: vllm, torch, transformers, qwen-vl-utils
extractor
¶
Nanonets OCR2-3B text extractor.
A Vision-Language Model for extracting text from document images with support for tables (HTML), equations (LaTeX), and image captions.
Supports PyTorch and VLLM backends.
Example
NanonetsTextExtractor
¶
Bases: BaseTextExtractor
Nanonets OCR2-3B Vision-Language Model text extractor.
Extracts text from document images with support for:
- Tables (output as HTML)
- Equations (output as LaTeX)
- Image captions (wrapped in tags)
- Watermarks (wrapped in
Supports PyTorch, VLLM, and MLX backends.
Example
from omnidocs.tasks.text_extraction import NanonetsTextExtractor
from omnidocs.tasks.text_extraction.nanonets import NanonetsTextPyTorchConfig
# Initialize with PyTorch backend
extractor = NanonetsTextExtractor(
backend=NanonetsTextPyTorchConfig()
)
# Extract text
result = extractor.extract(image)
print(result.content)
Initialize Nanonets text extractor.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration. One of: - NanonetsTextPyTorchConfig: PyTorch/HuggingFace backend - NanonetsTextVLLMConfig: VLLM high-throughput backend - NanonetsTextMLXConfig: MLX backend for Apple Silicon
TYPE:
|
Source code in omnidocs/tasks/text_extraction/nanonets/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text from an image.
Note: Nanonets OCR2 produces a unified output format that includes tables as HTML and equations as LaTeX inline. The output_format parameter is accepted for API compatibility but does not change the output structure.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file
TYPE:
|
output_format
|
Accepted for API compatibility (default: "markdown")
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput containing extracted text content |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If model is not loaded |
ValueError
|
If image format is not supported |
Source code in omnidocs/tasks/text_extraction/nanonets/extractor.py
mlx
¶
MLX backend configuration for Nanonets OCR2-3B text extraction.
NanonetsTextMLXConfig
¶
Bases: BaseModel
MLX backend configuration for Nanonets OCR2-3B text extraction.
This backend uses MLX for Apple Silicon native inference. Best for local development and testing on macOS M1/M2/M3/M4+. Requires: mlx, mlx-vlm
Note: This backend only works on Apple Silicon Macs. Do NOT use for Modal/cloud deployments.
pytorch
¶
PyTorch/HuggingFace backend configuration for Nanonets OCR2-3B text extraction.
NanonetsTextPyTorchConfig
¶
Bases: BaseModel
PyTorch/HuggingFace backend configuration for Nanonets OCR2-3B text extraction.
This backend uses the transformers library with PyTorch for local GPU inference. Requires: torch, transformers, accelerate
vllm
¶
VLLM backend configuration for Nanonets OCR2-3B text extraction.
NanonetsTextVLLMConfig
¶
Bases: BaseModel
VLLM backend configuration for Nanonets OCR2-3B text extraction.
This backend uses VLLM for high-throughput inference. Best for batch processing and production deployments. Requires: vllm, torch, transformers, qwen-vl-utils
qwen
¶
Qwen3-VL backend configurations and extractor for text extraction.
Available backends
- QwenTextPyTorchConfig: PyTorch/HuggingFace backend
- QwenTextVLLMConfig: VLLM high-throughput backend
- QwenTextMLXConfig: MLX backend for Apple Silicon
- QwenTextAPIConfig: API backend (OpenRouter, etc.)
Example
QwenTextAPIConfig
¶
Bases: BaseModel
API backend configuration for Qwen text extraction.
Uses litellm for provider-agnostic API access. Supports OpenRouter, Gemini, Azure, OpenAI, and any other litellm-compatible provider.
API keys can be passed directly or read from environment variables.
Example
# OpenRouter (reads OPENROUTER_API_KEY from env)
config = QwenTextAPIConfig(
model="openrouter/qwen/qwen3-vl-8b-instruct",
)
# With explicit key
config = QwenTextAPIConfig(
model="openrouter/qwen/qwen3-vl-8b-instruct",
api_key=os.environ["OPENROUTER_API_KEY"],
api_base="https://openrouter.ai/api/v1",
)
QwenTextExtractor
¶
Bases: BaseTextExtractor
Qwen3-VL Vision-Language Model text extractor.
Extracts text from document images and outputs as structured HTML or Markdown. Uses Qwen3-VL's built-in document parsing prompts.
Supports PyTorch, VLLM, MLX, and API backends.
Example
from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig
# Initialize with PyTorch backend
extractor = QwenTextExtractor(
backend=QwenTextPyTorchConfig(model="Qwen/Qwen3-VL-8B-Instruct")
)
# Extract as Markdown
result = extractor.extract(image, output_format="markdown")
print(result.content)
# Extract as HTML
result = extractor.extract(image, output_format="html")
print(result.content)
Initialize Qwen text extractor.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration. One of: - QwenTextPyTorchConfig: PyTorch/HuggingFace backend - QwenTextVLLMConfig: VLLM high-throughput backend - QwenTextMLXConfig: MLX backend for Apple Silicon - QwenTextAPIConfig: API backend (OpenRouter, etc.)
TYPE:
|
Source code in omnidocs/tasks/text_extraction/qwen/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text from an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file
TYPE:
|
output_format
|
Desired output format: - "html": Structured HTML with div elements - "markdown": Markdown format
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput containing extracted text content |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If model is not loaded |
ValueError
|
If image format or output_format is not supported |
Source code in omnidocs/tasks/text_extraction/qwen/extractor.py
QwenTextMLXConfig
¶
Bases: BaseModel
MLX backend configuration for Qwen text extraction.
This backend uses MLX for Apple Silicon native inference. Best for local development and testing on macOS M1/M2/M3+. Requires: mlx, mlx-vlm
Note: This backend only works on Apple Silicon Macs. Do NOT use for Modal/cloud deployments.
QwenTextPyTorchConfig
¶
Bases: BaseModel
PyTorch/HuggingFace backend configuration for Qwen text extraction.
This backend uses the transformers library with PyTorch for local GPU inference. Requires: torch, transformers, accelerate, qwen-vl-utils
Example
QwenTextVLLMConfig
¶
Bases: BaseModel
VLLM backend configuration for Qwen text extraction.
This backend uses VLLM for high-throughput inference. Best for batch processing and production deployments. Requires: vllm, torch, transformers, qwen-vl-utils
Example
api
¶
API backend configuration for Qwen3-VL text extraction.
Uses litellm for provider-agnostic inference (OpenRouter, Gemini, Azure, etc.).
QwenTextAPIConfig
¶
Bases: BaseModel
API backend configuration for Qwen text extraction.
Uses litellm for provider-agnostic API access. Supports OpenRouter, Gemini, Azure, OpenAI, and any other litellm-compatible provider.
API keys can be passed directly or read from environment variables.
Example
# OpenRouter (reads OPENROUTER_API_KEY from env)
config = QwenTextAPIConfig(
model="openrouter/qwen/qwen3-vl-8b-instruct",
)
# With explicit key
config = QwenTextAPIConfig(
model="openrouter/qwen/qwen3-vl-8b-instruct",
api_key=os.environ["OPENROUTER_API_KEY"],
api_base="https://openrouter.ai/api/v1",
)
extractor
¶
Qwen3-VL text extractor.
A Vision-Language Model for extracting text from document images as structured HTML or Markdown.
Supports PyTorch, VLLM, MLX, and API backends.
Example
from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig
extractor = QwenTextExtractor(
backend=QwenTextPyTorchConfig(model="Qwen/Qwen3-VL-8B-Instruct")
)
result = extractor.extract(image, output_format="markdown")
print(result.content)
QwenTextExtractor
¶
Bases: BaseTextExtractor
Qwen3-VL Vision-Language Model text extractor.
Extracts text from document images and outputs as structured HTML or Markdown. Uses Qwen3-VL's built-in document parsing prompts.
Supports PyTorch, VLLM, MLX, and API backends.
Example
from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig
# Initialize with PyTorch backend
extractor = QwenTextExtractor(
backend=QwenTextPyTorchConfig(model="Qwen/Qwen3-VL-8B-Instruct")
)
# Extract as Markdown
result = extractor.extract(image, output_format="markdown")
print(result.content)
# Extract as HTML
result = extractor.extract(image, output_format="html")
print(result.content)
Initialize Qwen text extractor.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration. One of: - QwenTextPyTorchConfig: PyTorch/HuggingFace backend - QwenTextVLLMConfig: VLLM high-throughput backend - QwenTextMLXConfig: MLX backend for Apple Silicon - QwenTextAPIConfig: API backend (OpenRouter, etc.)
TYPE:
|
Source code in omnidocs/tasks/text_extraction/qwen/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text from an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file
TYPE:
|
output_format
|
Desired output format: - "html": Structured HTML with div elements - "markdown": Markdown format
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput containing extracted text content |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If model is not loaded |
ValueError
|
If image format or output_format is not supported |
Source code in omnidocs/tasks/text_extraction/qwen/extractor.py
mlx
¶
MLX backend configuration for Qwen3-VL text extraction.
QwenTextMLXConfig
¶
Bases: BaseModel
MLX backend configuration for Qwen text extraction.
This backend uses MLX for Apple Silicon native inference. Best for local development and testing on macOS M1/M2/M3+. Requires: mlx, mlx-vlm
Note: This backend only works on Apple Silicon Macs. Do NOT use for Modal/cloud deployments.
pytorch
¶
PyTorch/HuggingFace backend configuration for Qwen3-VL text extraction.
QwenTextPyTorchConfig
¶
Bases: BaseModel
PyTorch/HuggingFace backend configuration for Qwen text extraction.
This backend uses the transformers library with PyTorch for local GPU inference. Requires: torch, transformers, accelerate, qwen-vl-utils
Example
vllm
¶
VLLM backend configuration for Qwen3-VL text extraction.
QwenTextVLLMConfig
¶
Bases: BaseModel
VLLM backend configuration for Qwen text extraction.
This backend uses VLLM for high-throughput inference. Best for batch processing and production deployments. Requires: vllm, torch, transformers, qwen-vl-utils
Example
vlm
¶
VLM text extractor.
A provider-agnostic Vision-Language Model text extractor using litellm. Works with any cloud API: Gemini, OpenRouter, Azure, OpenAI, Anthropic, etc.
Example
from omnidocs.vlm import VLMAPIConfig
from omnidocs.tasks.text_extraction import VLMTextExtractor
config = VLMAPIConfig(model="gemini/gemini-2.5-flash")
extractor = VLMTextExtractor(config=config)
result = extractor.extract("document.png", output_format="markdown")
print(result.content)
# With custom prompt
result = extractor.extract("document.png", prompt="Extract only table data as markdown")
VLMTextExtractor
¶
Bases: BaseTextExtractor
Provider-agnostic VLM text extractor using litellm.
Works with any cloud VLM API: Gemini, OpenRouter, Azure, OpenAI, Anthropic, etc. Supports custom prompts for specialized extraction.
Example
from omnidocs.vlm import VLMAPIConfig
from omnidocs.tasks.text_extraction import VLMTextExtractor
# Gemini (reads GOOGLE_API_KEY from env)
config = VLMAPIConfig(model="gemini/gemini-2.5-flash")
extractor = VLMTextExtractor(config=config)
# Default extraction
result = extractor.extract("document.png", output_format="markdown")
# Custom prompt
result = extractor.extract(
"document.png",
prompt="Extract only the table data as markdown",
)
Initialize VLM text extractor.
| PARAMETER | DESCRIPTION |
|---|---|
config
|
VLM API configuration with model and provider details.
TYPE:
|
Source code in omnidocs/tasks/text_extraction/vlm.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
prompt: Optional[str] = None,
) -> TextOutput
Extract text from an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path).
TYPE:
|
output_format
|
Desired output format ("html" or "markdown").
TYPE:
|
prompt
|
Custom prompt. If None, uses a task-specific default prompt.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput containing extracted text content. |