Overview¶
Text Extraction Module.
Provides extractors for converting document images to structured text formats (HTML, Markdown, JSON). Uses Vision-Language Models for accurate text extraction with formatting preservation and optional layout detection.
Available Extractors
- QwenTextExtractor: Qwen3-VL based extractor (multi-backend)
- DotsOCRTextExtractor: Dots OCR with layout-aware extraction (PyTorch/VLLM/API)
- NanonetsTextExtractor: Nanonets OCR2-3B for text extraction (PyTorch/VLLM)
- GraniteDoclingTextExtractor: IBM Granite Docling for document conversion (multi-backend)
- MinerUVLTextExtractor: MinerU VL 1.2B with layout-aware two-step extraction (multi-backend)
- LightOnTextExtractor: LightOn OCR for document text extraction (multi-backend)
- DeepSeekOCRTextExtractor: DeepSeek-OCR-2 ~3B high-accuracy OCR (PyTorch/VLLM/MLX/API)
- GLMOCRTextExtractor: GLM-OCR 0.9B OCR specialist, #1 OmniDocBench (PyTorch/VLLM/API/MLX)
Example
from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig
extractor = QwenTextExtractor(
backend=QwenTextPyTorchConfig(model="Qwen/Qwen3-VL-8B-Instruct")
)
result = extractor.extract(image, output_format="markdown")
print(result.content)
BaseTextExtractor
¶
Bases: ABC
Abstract base class for text extractors.
All text extraction models must inherit from this class and implement the required methods.
Example
extract
abstractmethod
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text from an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file
TYPE:
|
output_format
|
Desired output format: - "html": Structured HTML - "markdown": Markdown format
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput containing extracted text content |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If image format or output_format is not supported |
RuntimeError
|
If model is not loaded or inference fails |
Source code in omnidocs/tasks/text_extraction/base.py
batch_extract
¶
batch_extract(
images: List[Union[Image, ndarray, str, Path]],
output_format: Literal["html", "markdown"] = "markdown",
progress_callback: Optional[
Callable[[int, int], None]
] = None,
) -> List[TextOutput]
Extract text from multiple images.
Default implementation loops over extract(). Subclasses can override for optimized batching (e.g., VLLM).
| PARAMETER | DESCRIPTION |
|---|---|
images
|
List of images in any supported format
TYPE:
|
output_format
|
Desired output format
TYPE:
|
progress_callback
|
Optional function(current, total) for progress
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[TextOutput]
|
List of TextOutput in same order as input |
Examples:
images = [doc.get_page(i) for i in range(doc.page_count)]
results = extractor.batch_extract(images, output_format="markdown")
Source code in omnidocs/tasks/text_extraction/base.py
extract_document
¶
extract_document(
document: Document,
output_format: Literal["html", "markdown"] = "markdown",
progress_callback: Optional[
Callable[[int, int], None]
] = None,
) -> List[TextOutput]
Extract text from all pages of a document.
| PARAMETER | DESCRIPTION |
|---|---|
document
|
Document instance
TYPE:
|
output_format
|
Desired output format
TYPE:
|
progress_callback
|
Optional function(current, total) for progress
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[TextOutput]
|
List of TextOutput, one per page |
Examples:
doc = Document.from_pdf("paper.pdf")
results = extractor.extract_document(doc, output_format="markdown")
Source code in omnidocs/tasks/text_extraction/base.py
DeepSeekOCRTextExtractor
¶
Bases: BaseTextExtractor
DeepSeek-OCR / DeepSeek-OCR-2 text extractor.
High-accuracy OCR model that reads complex real-world documents (PDFs, forms, tables, handwritten/noisy text) and outputs clean Markdown. Uses a hybrid vision encoder + causal text decoder — output is structured by the model itself rather than post-processed from bounding boxes.
DeepSeek-OCR-2 ("Visual Causal Flow") is the default — released Jan 2026.
Supports PyTorch, VLLM (recommended), MLX, and API backends.
Example
from omnidocs.tasks.text_extraction import DeepSeekOCRTextExtractor
from omnidocs.tasks.text_extraction.deepseek import (
DeepSeekOCRTextPyTorchConfig,
DeepSeekOCRTextVLLMConfig,
)
# VLLM — ~2500 tokens/s on A100 (recommended for production)
extractor = DeepSeekOCRTextExtractor(
backend=DeepSeekOCRTextVLLMConfig()
)
result = extractor.extract(image)
print(result.content)
# PyTorch with crop_mode for dense pages
extractor = DeepSeekOCRTextExtractor(
backend=DeepSeekOCRTextPyTorchConfig(crop_mode=True)
)
Initialize DeepSeek-OCR extractor.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend config. One of: - DeepSeekOCRTextPyTorchConfig (local GPU) - DeepSeekOCRTextVLLMConfig (recommended, high-throughput) - DeepSeekOCRTextMLXConfig (Apple Silicon) - DeepSeekOCRTextAPIConfig (Novita AI)
TYPE:
|
Source code in omnidocs/tasks/text_extraction/deepseek/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text from a document image.
DeepSeek-OCR always outputs Markdown-structured text. The output_format parameter is accepted for API compatibility.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path)
TYPE:
|
output_format
|
Accepted for API compatibility (default: "markdown")
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput with extracted Markdown content |
Source code in omnidocs/tasks/text_extraction/deepseek/extractor.py
DotsOCRTextExtractor
¶
Bases: BaseTextExtractor
Dots OCR Vision-Language Model text extractor with layout detection.
Extracts text from document images with layout information including: - 11 layout categories (Caption, Footnote, Formula, List-item, etc.) - Bounding boxes (normalized to 0-1024) - Multi-format text (Markdown, LaTeX, HTML) - Reading order preservation
Supports PyTorch, VLLM, and API backends.
Example
from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRPyTorchConfig
# Initialize with PyTorch backend
extractor = DotsOCRTextExtractor(
backend=DotsOCRPyTorchConfig(model="rednote-hilab/dots.ocr")
)
# Extract with layout
result = extractor.extract(image, include_layout=True)
print(f"Found {result.num_layout_elements} elements")
print(result.content)
Initialize Dots OCR text extractor.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration. One of: - DotsOCRPyTorchConfig: PyTorch/HuggingFace backend - DotsOCRVLLMConfig: VLLM high-throughput backend - DotsOCRAPIConfig: API backend (online VLLM server)
TYPE:
|
Source code in omnidocs/tasks/text_extraction/dotsocr/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal[
"markdown", "html", "json"
] = "markdown",
include_layout: bool = False,
custom_prompt: Optional[str] = None,
max_tokens: int = 8192,
) -> DotsOCRTextOutput
Extract text from image using Dots OCR.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path)
TYPE:
|
output_format
|
Output format ("markdown", "html", or "json")
TYPE:
|
include_layout
|
Include layout bounding boxes in output
TYPE:
|
custom_prompt
|
Override default extraction prompt
TYPE:
|
max_tokens
|
Maximum tokens for generation
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DotsOCRTextOutput
|
DotsOCRTextOutput with extracted content and optional layout |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If model is not loaded or inference fails |
Source code in omnidocs/tasks/text_extraction/dotsocr/extractor.py
GLMOCRTextExtractor
¶
Bases: BaseTextExtractor
GLM-OCR text extractor (zai-org/GLM-OCR, 0.9B, Feb 2026).
Purpose-built OCR model, #1 on OmniDocBench V1.5.
Faster and cheaper than GLM-V for pure document OCR tasks.
Example:
python
from omnidocs.tasks.text_extraction import GLMOCRTextExtractor
from omnidocs.tasks.text_extraction.glmocr import GLMOCRPyTorchConfig
extractor = GLMOCRTextExtractor(backend=GLMOCRPyTorchConfig())
result = extractor.extract(image)
print(result.content)
Source code in omnidocs/tasks/text_extraction/glmocr/extractor.py
GraniteDoclingTextExtractor
¶
Bases: BaseTextExtractor
Granite Docling text extractor supporting PyTorch, VLLM, MLX, and API backends.
Granite Docling is IBM's compact vision-language model optimized for document conversion. It outputs DocTags format which is converted to Markdown using the docling_core library.
Example
from omnidocs.tasks.text_extraction.granitedocling import ( ... GraniteDoclingTextExtractor, ... GraniteDoclingTextPyTorchConfig, ... ) config = GraniteDoclingTextPyTorchConfig(device="cuda") extractor = GraniteDoclingTextExtractor(backend=config) result = extractor.extract(image, output_format="markdown") print(result.content)
Initialize Granite Docling extractor with backend configuration.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration (PyTorch, VLLM, MLX, or API config)
TYPE:
|
Source code in omnidocs/tasks/text_extraction/granitedocling/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text from an image using Granite Docling.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path)
TYPE:
|
output_format
|
Output format ("markdown" or "html")
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput with extracted content |
Source code in omnidocs/tasks/text_extraction/granitedocling/extractor.py
LightOnTextExtractor
¶
Bases: BaseTextExtractor
LightOn text extractor with multi-backend support.
LightOn OCR is optimized for document text extraction with multi-lingual capabilities.
Supports multiple backends: - PyTorch (HuggingFace Transformers) - VLLM (high-throughput GPU) - MLX (Apple Silicon) - API (VLLM OpenAI-compatible server)
Example
from omnidocs.tasks.text_extraction import LightOnTextExtractor
from omnidocs.tasks.text_extraction.lighton import LightOnTextPyTorchConfig
# PyTorch backend
extractor = LightOnTextExtractor(
backend=LightOnTextPyTorchConfig(device="cuda", torch_dtype="bfloat16")
)
result = extractor.extract(image)
print(result.content)
# VLLM backend for high-throughput inference
from omnidocs.tasks.text_extraction.lighton import LightOnTextVLLMConfig
extractor = LightOnTextExtractor(
backend=LightOnTextVLLMConfig(gpu_memory_utilization=0.85)
)
result = extractor.extract(image)
Initialize LightOn text extractor.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration (PyTorch, VLLM, MLX, or API)
TYPE:
|
Source code in omnidocs/tasks/text_extraction/lighton/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text from an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path)
TYPE:
|
output_format
|
Desired output format ('html' or 'markdown')
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput with extracted text content |
Source code in omnidocs/tasks/text_extraction/lighton/extractor.py
MinerUVLTextExtractor
¶
Bases: BaseTextExtractor
MinerU VL text extractor with layout-aware extraction.
Performs two-step extraction: 1. Layout detection (detect regions) 2. Content recognition (extract text/table/equation from each region)
Supports multiple backends: - PyTorch (HuggingFace Transformers) - VLLM (high-throughput GPU) - MLX (Apple Silicon) - API (VLLM OpenAI-compatible server)
Example
from omnidocs.tasks.text_extraction import MinerUVLTextExtractor
from omnidocs.tasks.text_extraction.mineruvl import MinerUVLTextPyTorchConfig
extractor = MinerUVLTextExtractor(
backend=MinerUVLTextPyTorchConfig(device="cuda")
)
result = extractor.extract(image)
print(result.content) # Combined text + tables + equations
print(result.blocks) # List of ContentBlock objects
Initialize MinerU VL text extractor.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration (PyTorch, VLLM, MLX, or API)
TYPE:
|
Source code in omnidocs/tasks/text_extraction/mineruvl/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text with layout-aware two-step extraction.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path)
TYPE:
|
output_format
|
Output format ('html' or 'markdown')
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput with extracted content and metadata |
Source code in omnidocs/tasks/text_extraction/mineruvl/extractor.py
extract_with_blocks
¶
extract_with_blocks(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> tuple[TextOutput, List[ContentBlock]]
Extract text and return both TextOutput and ContentBlocks.
This method provides access to the detailed block information including bounding boxes and block types.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image
TYPE:
|
output_format
|
Output format
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple[TextOutput, List[ContentBlock]]
|
Tuple of (TextOutput, List[ContentBlock]) |
Source code in omnidocs/tasks/text_extraction/mineruvl/extractor.py
DotsOCRTextOutput
¶
Bases: BaseModel
Text extraction output from Dots OCR with layout information.
Dots OCR provides structured output with: - Layout detection (11 categories) - Bounding boxes (normalized to 0-1024) - Multi-format text (Markdown/LaTeX/HTML) - Reading order preservation
Layout Categories
Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Section-header, Table, Text, Title
Text Formatting
- Text/Title/Section-header: Markdown
- Formula: LaTeX
- Table: HTML
- Picture: (text omitted)
Example
LayoutElement
¶
Bases: BaseModel
Single layout element from document layout detection.
Represents a detected region in the document with its bounding box, category label, and extracted text content.
| ATTRIBUTE | DESCRIPTION |
|---|---|
bbox |
Bounding box coordinates [x1, y1, x2, y2] (normalized to 0-1024)
TYPE:
|
category |
Layout category (e.g., "Text", "Title", "Table", "Formula")
TYPE:
|
text |
Extracted text content (None for pictures)
TYPE:
|
confidence |
Detection confidence score (optional)
TYPE:
|
OutputFormat
¶
Bases: str, Enum
Supported text extraction output formats.
Each format has different characteristics
- HTML: Structured with div elements, preserves layout semantics
- MARKDOWN: Portable, human-readable, good for documentation
- JSON: Structured data with layout information (Dots OCR)
TextOutput
¶
Bases: BaseModel
Text extraction output from a document image.
Contains the extracted text content in the requested format, along with optional raw output and plain text versions.
Example
NanonetsTextExtractor
¶
Bases: BaseTextExtractor
Nanonets OCR2-3B Vision-Language Model text extractor.
Extracts text from document images with support for:
- Tables (output as HTML)
- Equations (output as LaTeX)
- Image captions (wrapped in tags)
- Watermarks (wrapped in
Supports PyTorch, VLLM, and MLX backends.
Example
from omnidocs.tasks.text_extraction import NanonetsTextExtractor
from omnidocs.tasks.text_extraction.nanonets import NanonetsTextPyTorchConfig
# Initialize with PyTorch backend
extractor = NanonetsTextExtractor(
backend=NanonetsTextPyTorchConfig()
)
# Extract text
result = extractor.extract(image)
print(result.content)
Initialize Nanonets text extractor.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration. One of: - NanonetsTextPyTorchConfig: PyTorch/HuggingFace backend - NanonetsTextVLLMConfig: VLLM high-throughput backend - NanonetsTextMLXConfig: MLX backend for Apple Silicon
TYPE:
|
Source code in omnidocs/tasks/text_extraction/nanonets/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text from an image.
Note: Nanonets OCR2 produces a unified output format that includes tables as HTML and equations as LaTeX inline. The output_format parameter is accepted for API compatibility but does not change the output structure.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file
TYPE:
|
output_format
|
Accepted for API compatibility (default: "markdown")
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput containing extracted text content |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If model is not loaded |
ValueError
|
If image format is not supported |
Source code in omnidocs/tasks/text_extraction/nanonets/extractor.py
QwenTextExtractor
¶
Bases: BaseTextExtractor
Qwen3-VL Vision-Language Model text extractor.
Extracts text from document images and outputs as structured HTML or Markdown. Uses Qwen3-VL's built-in document parsing prompts.
Supports PyTorch, VLLM, MLX, and API backends.
Example
from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig
# Initialize with PyTorch backend
extractor = QwenTextExtractor(
backend=QwenTextPyTorchConfig(model="Qwen/Qwen3-VL-8B-Instruct")
)
# Extract as Markdown
result = extractor.extract(image, output_format="markdown")
print(result.content)
# Extract as HTML
result = extractor.extract(image, output_format="html")
print(result.content)
Initialize Qwen text extractor.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration. One of: - QwenTextPyTorchConfig: PyTorch/HuggingFace backend - QwenTextVLLMConfig: VLLM high-throughput backend - QwenTextMLXConfig: MLX backend for Apple Silicon - QwenTextAPIConfig: API backend (OpenRouter, etc.)
TYPE:
|
Source code in omnidocs/tasks/text_extraction/qwen/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text from an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file
TYPE:
|
output_format
|
Desired output format: - "html": Structured HTML with div elements - "markdown": Markdown format
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput containing extracted text content |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If model is not loaded |
ValueError
|
If image format or output_format is not supported |
Source code in omnidocs/tasks/text_extraction/qwen/extractor.py
VLMTextExtractor
¶
Bases: BaseTextExtractor
Provider-agnostic VLM text extractor using litellm.
Works with any cloud VLM API: Gemini, OpenRouter, Azure, OpenAI, Anthropic, etc. Supports custom prompts for specialized extraction.
Example
from omnidocs.vlm import VLMAPIConfig
from omnidocs.tasks.text_extraction import VLMTextExtractor
# Gemini (reads GOOGLE_API_KEY from env)
config = VLMAPIConfig(model="gemini/gemini-2.5-flash")
extractor = VLMTextExtractor(config=config)
# Default extraction
result = extractor.extract("document.png", output_format="markdown")
# Custom prompt
result = extractor.extract(
"document.png",
prompt="Extract only the table data as markdown",
)
Initialize VLM text extractor.
| PARAMETER | DESCRIPTION |
|---|---|
config
|
VLM API configuration with model and provider details.
TYPE:
|
Source code in omnidocs/tasks/text_extraction/vlm.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
prompt: Optional[str] = None,
) -> TextOutput
Extract text from an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path).
TYPE:
|
output_format
|
Desired output format ("html" or "markdown").
TYPE:
|
prompt
|
Custom prompt. If None, uses a task-specific default prompt.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput containing extracted text content. |
Source code in omnidocs/tasks/text_extraction/vlm.py
base
¶
Base class for text extractors.
Defines the abstract interface that all text extractors must implement.
BaseTextExtractor
¶
Bases: ABC
Abstract base class for text extractors.
All text extraction models must inherit from this class and implement the required methods.
Example
extract
abstractmethod
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text from an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file
TYPE:
|
output_format
|
Desired output format: - "html": Structured HTML - "markdown": Markdown format
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput containing extracted text content |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If image format or output_format is not supported |
RuntimeError
|
If model is not loaded or inference fails |
Source code in omnidocs/tasks/text_extraction/base.py
batch_extract
¶
batch_extract(
images: List[Union[Image, ndarray, str, Path]],
output_format: Literal["html", "markdown"] = "markdown",
progress_callback: Optional[
Callable[[int, int], None]
] = None,
) -> List[TextOutput]
Extract text from multiple images.
Default implementation loops over extract(). Subclasses can override for optimized batching (e.g., VLLM).
| PARAMETER | DESCRIPTION |
|---|---|
images
|
List of images in any supported format
TYPE:
|
output_format
|
Desired output format
TYPE:
|
progress_callback
|
Optional function(current, total) for progress
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[TextOutput]
|
List of TextOutput in same order as input |
Examples:
images = [doc.get_page(i) for i in range(doc.page_count)]
results = extractor.batch_extract(images, output_format="markdown")
Source code in omnidocs/tasks/text_extraction/base.py
extract_document
¶
extract_document(
document: Document,
output_format: Literal["html", "markdown"] = "markdown",
progress_callback: Optional[
Callable[[int, int], None]
] = None,
) -> List[TextOutput]
Extract text from all pages of a document.
| PARAMETER | DESCRIPTION |
|---|---|
document
|
Document instance
TYPE:
|
output_format
|
Desired output format
TYPE:
|
progress_callback
|
Optional function(current, total) for progress
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[TextOutput]
|
List of TextOutput, one per page |
Examples:
doc = Document.from_pdf("paper.pdf")
results = extractor.extract_document(doc, output_format="markdown")
Source code in omnidocs/tasks/text_extraction/base.py
deepseek
¶
DeepSeek-OCR backend configurations and extractor for text extraction.
Two generations of DeepSeek OCR models from deepseek-ai:
DeepSeek-OCR (Oct 2024, arXiv:2510.18234) — v1, MIT license, 3B params, ~6.7 GB BF16 DeepSeek-OCR-2 (Jan 2026, arXiv:2601.20552) — v2, Apache 2.0, 3B params, improved "Visual Causal Flow"
Both share the same inference interface (AutoModel + AutoTokenizer with model.infer()). The default model is DeepSeek-OCR-2 (latest).
Supported prompts
"
<|grounding|>Convert the document to markdown." ← structured document
"
Available backends
- DeepSeekOCRTextPyTorchConfig: PyTorch/HuggingFace backend
- DeepSeekOCRTextVLLMConfig: VLLM high-throughput backend (recommended, ~2500 tok/s on A100)
- DeepSeekOCRTextMLXConfig: MLX backend for Apple Silicon
- DeepSeekOCRTextAPIConfig: API backend (Novita AI)
HuggingFaces
deepseek-ai/DeepSeek-OCR-2 (latest, Apache 2.0) deepseek-ai/DeepSeek-OCR (v1, MIT)
GitHub: https://github.com/deepseek-ai/DeepSeek-OCR-2
Example
DeepSeekOCRTextAPIConfig
¶
Bases: BaseModel
API backend configuration for DeepSeek-OCR / DeepSeek-OCR-2 text extraction.
Uses litellm for provider-agnostic API access. Primary provider: Novita AI (official hosting).
Example
DeepSeekOCRTextExtractor
¶
Bases: BaseTextExtractor
DeepSeek-OCR / DeepSeek-OCR-2 text extractor.
High-accuracy OCR model that reads complex real-world documents (PDFs, forms, tables, handwritten/noisy text) and outputs clean Markdown. Uses a hybrid vision encoder + causal text decoder — output is structured by the model itself rather than post-processed from bounding boxes.
DeepSeek-OCR-2 ("Visual Causal Flow") is the default — released Jan 2026.
Supports PyTorch, VLLM (recommended), MLX, and API backends.
Example
from omnidocs.tasks.text_extraction import DeepSeekOCRTextExtractor
from omnidocs.tasks.text_extraction.deepseek import (
DeepSeekOCRTextPyTorchConfig,
DeepSeekOCRTextVLLMConfig,
)
# VLLM — ~2500 tokens/s on A100 (recommended for production)
extractor = DeepSeekOCRTextExtractor(
backend=DeepSeekOCRTextVLLMConfig()
)
result = extractor.extract(image)
print(result.content)
# PyTorch with crop_mode for dense pages
extractor = DeepSeekOCRTextExtractor(
backend=DeepSeekOCRTextPyTorchConfig(crop_mode=True)
)
Initialize DeepSeek-OCR extractor.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend config. One of: - DeepSeekOCRTextPyTorchConfig (local GPU) - DeepSeekOCRTextVLLMConfig (recommended, high-throughput) - DeepSeekOCRTextMLXConfig (Apple Silicon) - DeepSeekOCRTextAPIConfig (Novita AI)
TYPE:
|
Source code in omnidocs/tasks/text_extraction/deepseek/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text from a document image.
DeepSeek-OCR always outputs Markdown-structured text. The output_format parameter is accepted for API compatibility.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path)
TYPE:
|
output_format
|
Accepted for API compatibility (default: "markdown")
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput with extracted Markdown content |
Source code in omnidocs/tasks/text_extraction/deepseek/extractor.py
DeepSeekOCRTextMLXConfig
¶
Bases: BaseModel
MLX backend configuration for DeepSeek-OCR text extraction.
Apple Silicon only (M1/M2/M3+). Do NOT deploy to Modal/cloud. Uses standard mlx-vlm generate interface.
Note: MLX variants currently available for DeepSeek-OCR v1. Check mlx-community for DeepSeek-OCR-2 variants as they are published.
DeepSeekOCRTextPyTorchConfig
¶
Bases: BaseModel
PyTorch/HuggingFace backend configuration for DeepSeek-OCR / DeepSeek-OCR-2.
Uses AutoModel + AutoTokenizer. Inference via model.infer() — the model handles tiling and multi-page PDF stitching internally.
Models
deepseek-ai/DeepSeek-OCR-2 (default, latest — Jan 2026, Apache 2.0) deepseek-ai/DeepSeek-OCR (v1 — Oct 2024, MIT)
GPU requirements: L4 / A100 (≥16 GB VRAM recommended).
Example
DeepSeekOCRTextVLLMConfig
¶
Bases: BaseModel
VLLM backend configuration for DeepSeek-OCR / DeepSeek-OCR-2 text extraction.
DeepSeek-OCR has official upstream VLLM support (~2500 tokens/s on A100). Recommended for high-throughput batch document processing in production. Requires: vllm>=0.11.1 (or nightly for OCR-2), torch, transformers==4.46.3
Note: Default model is DeepSeek-OCR v1 (not v2) because DeepSeek-OCR-2 VLLM support requires a vllm nightly build. Use PyTorch backend for DeepSeek-OCR-2 until official vllm support is released.
Example
api
¶
API backend configuration for DeepSeek-OCR text extraction.
Note: DeepSeek-OCR-2 API availability may vary by provider — check novita.ai for updated model slugs as providers onboard the new version.
DeepSeekOCRTextAPIConfig
¶
Bases: BaseModel
API backend configuration for DeepSeek-OCR / DeepSeek-OCR-2 text extraction.
Uses litellm for provider-agnostic API access. Primary provider: Novita AI (official hosting).
Example
extractor
¶
DeepSeek-OCR / DeepSeek-OCR-2 text extractor.
DeepSeek-OCR (Oct 2024, arXiv:2510.18234) — v1, MIT, 3B params DeepSeek-OCR-2 (Jan 2026, arXiv:2601.20552) — v2, Apache 2.0, 3B params, "Visual Causal Flow"
Supported backends: PyTorch, VLLM (official upstream support), MLX, API.
GitHub
v1: https://github.com/deepseek-ai/DeepSeek-OCR v2: https://github.com/deepseek-ai/DeepSeek-OCR-2
Example
from omnidocs.tasks.text_extraction import DeepSeekOCRTextExtractor
from omnidocs.tasks.text_extraction.deepseek import DeepSeekOCRTextVLLMConfig
extractor = DeepSeekOCRTextExtractor(
backend=DeepSeekOCRTextVLLMConfig() # DeepSeek-OCR-2, VLLM
)
result = extractor.extract(image)
print(result.content)
DeepSeekOCRTextExtractor
¶
Bases: BaseTextExtractor
DeepSeek-OCR / DeepSeek-OCR-2 text extractor.
High-accuracy OCR model that reads complex real-world documents (PDFs, forms, tables, handwritten/noisy text) and outputs clean Markdown. Uses a hybrid vision encoder + causal text decoder — output is structured by the model itself rather than post-processed from bounding boxes.
DeepSeek-OCR-2 ("Visual Causal Flow") is the default — released Jan 2026.
Supports PyTorch, VLLM (recommended), MLX, and API backends.
Example
from omnidocs.tasks.text_extraction import DeepSeekOCRTextExtractor
from omnidocs.tasks.text_extraction.deepseek import (
DeepSeekOCRTextPyTorchConfig,
DeepSeekOCRTextVLLMConfig,
)
# VLLM — ~2500 tokens/s on A100 (recommended for production)
extractor = DeepSeekOCRTextExtractor(
backend=DeepSeekOCRTextVLLMConfig()
)
result = extractor.extract(image)
print(result.content)
# PyTorch with crop_mode for dense pages
extractor = DeepSeekOCRTextExtractor(
backend=DeepSeekOCRTextPyTorchConfig(crop_mode=True)
)
Initialize DeepSeek-OCR extractor.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend config. One of: - DeepSeekOCRTextPyTorchConfig (local GPU) - DeepSeekOCRTextVLLMConfig (recommended, high-throughput) - DeepSeekOCRTextMLXConfig (Apple Silicon) - DeepSeekOCRTextAPIConfig (Novita AI)
TYPE:
|
Source code in omnidocs/tasks/text_extraction/deepseek/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text from a document image.
DeepSeek-OCR always outputs Markdown-structured text. The output_format parameter is accepted for API compatibility.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path)
TYPE:
|
output_format
|
Accepted for API compatibility (default: "markdown")
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput with extracted Markdown content |
Source code in omnidocs/tasks/text_extraction/deepseek/extractor.py
mlx
¶
MLX backend configuration for DeepSeek-OCR text extraction.
Available MLX quantized variants (mlx-community): mlx-community/DeepSeek-OCR-4bit (4-bit, recommended) mlx-community/DeepSeek-OCR-8bit (8-bit, higher fidelity)
Note: DeepSeek-OCR-2 MLX variants may not yet be available — check https://huggingface.co/mlx-community for latest uploads. Fall back to DeepSeek-OCR v1 4bit/8bit for Apple Silicon.
DeepSeekOCRTextMLXConfig
¶
Bases: BaseModel
MLX backend configuration for DeepSeek-OCR text extraction.
Apple Silicon only (M1/M2/M3+). Do NOT deploy to Modal/cloud. Uses standard mlx-vlm generate interface.
Note: MLX variants currently available for DeepSeek-OCR v1. Check mlx-community for DeepSeek-OCR-2 variants as they are published.
pytorch
¶
PyTorch/HuggingFace backend configuration for DeepSeek-OCR text extraction.
Both DeepSeek-OCR and DeepSeek-OCR-2 use
- AutoModel (not AutoModelForCausalLM)
- AutoTokenizer (not AutoProcessor)
- model.infer(tokenizer, prompt=..., image_file=...) for inference
Requirements (from official README): python==3.12.9, CUDA==11.8 torch==2.6.0, transformers==4.46.3, tokenizers==0.20.3 einops, addict, easydict flash-attn==2.7.3 (optional, --no-build-isolation)
DeepSeekOCRTextPyTorchConfig
¶
Bases: BaseModel
PyTorch/HuggingFace backend configuration for DeepSeek-OCR / DeepSeek-OCR-2.
Uses AutoModel + AutoTokenizer. Inference via model.infer() — the model handles tiling and multi-page PDF stitching internally.
Models
deepseek-ai/DeepSeek-OCR-2 (default, latest — Jan 2026, Apache 2.0) deepseek-ai/DeepSeek-OCR (v1 — Oct 2024, MIT)
GPU requirements: L4 / A100 (≥16 GB VRAM recommended).
Example
vllm
¶
VLLM backend configuration for DeepSeek-OCR text extraction.
DeepSeek-OCR has official upstream VLLM support (announced Oct 23 2025). Achieves ~2500 tokens/s on A100-40G — the recommended backend for production.
DeepSeek-OCR-2 VLLM support: refer to https://github.com/deepseek-ai/DeepSeek-OCR-2 for the latest vLLM setup instructions (may require nightly build).
DeepSeekOCRTextVLLMConfig
¶
Bases: BaseModel
VLLM backend configuration for DeepSeek-OCR / DeepSeek-OCR-2 text extraction.
DeepSeek-OCR has official upstream VLLM support (~2500 tokens/s on A100). Recommended for high-throughput batch document processing in production. Requires: vllm>=0.11.1 (or nightly for OCR-2), torch, transformers==4.46.3
Note: Default model is DeepSeek-OCR v1 (not v2) because DeepSeek-OCR-2 VLLM support requires a vllm nightly build. Use PyTorch backend for DeepSeek-OCR-2 until official vllm support is released.
Example
dotsocr
¶
Dots OCR text extractor and backend configurations.
Available backends: - PyTorch: DotsOCRPyTorchConfig (local GPU inference) - VLLM: DotsOCRVLLMConfig (offline batch inference) - API: DotsOCRAPIConfig (online VLLM server via OpenAI-compatible API)
DotsOCRAPIConfig
¶
Bases: BaseModel
API backend configuration for Dots OCR.
This config is for accessing a deployed VLLM server via OpenAI-compatible API. Typically used with modal_dotsocr_vllm_online.py deployment.
Example
DotsOCRTextExtractor
¶
Bases: BaseTextExtractor
Dots OCR Vision-Language Model text extractor with layout detection.
Extracts text from document images with layout information including: - 11 layout categories (Caption, Footnote, Formula, List-item, etc.) - Bounding boxes (normalized to 0-1024) - Multi-format text (Markdown, LaTeX, HTML) - Reading order preservation
Supports PyTorch, VLLM, and API backends.
Example
from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRPyTorchConfig
# Initialize with PyTorch backend
extractor = DotsOCRTextExtractor(
backend=DotsOCRPyTorchConfig(model="rednote-hilab/dots.ocr")
)
# Extract with layout
result = extractor.extract(image, include_layout=True)
print(f"Found {result.num_layout_elements} elements")
print(result.content)
Initialize Dots OCR text extractor.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration. One of: - DotsOCRPyTorchConfig: PyTorch/HuggingFace backend - DotsOCRVLLMConfig: VLLM high-throughput backend - DotsOCRAPIConfig: API backend (online VLLM server)
TYPE:
|
Source code in omnidocs/tasks/text_extraction/dotsocr/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal[
"markdown", "html", "json"
] = "markdown",
include_layout: bool = False,
custom_prompt: Optional[str] = None,
max_tokens: int = 8192,
) -> DotsOCRTextOutput
Extract text from image using Dots OCR.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path)
TYPE:
|
output_format
|
Output format ("markdown", "html", or "json")
TYPE:
|
include_layout
|
Include layout bounding boxes in output
TYPE:
|
custom_prompt
|
Override default extraction prompt
TYPE:
|
max_tokens
|
Maximum tokens for generation
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DotsOCRTextOutput
|
DotsOCRTextOutput with extracted content and optional layout |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If model is not loaded or inference fails |
Source code in omnidocs/tasks/text_extraction/dotsocr/extractor.py
DotsOCRPyTorchConfig
¶
Bases: BaseModel
PyTorch/HuggingFace backend configuration for Dots OCR.
Dots OCR provides layout-aware text extraction with 11 predefined layout categories (Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Section-header, Table, Text, Title).
Example
DotsOCRVLLMConfig
¶
Bases: BaseModel
VLLM backend configuration for Dots OCR.
VLLM provides high-throughput inference with optimizations like: - PagedAttention for efficient KV cache management - Continuous batching for higher throughput - Optimized CUDA kernels
Example
api
¶
API backend configuration for Dots OCR (VLLM online server).
DotsOCRAPIConfig
¶
Bases: BaseModel
API backend configuration for Dots OCR.
This config is for accessing a deployed VLLM server via OpenAI-compatible API. Typically used with modal_dotsocr_vllm_online.py deployment.
Example
extractor
¶
Dots OCR text extractor with layout-aware extraction.
A Vision-Language Model optimized for document OCR with structured output containing layout information, bounding boxes, and multi-format text.
Supports PyTorch, VLLM, and API backends.
Example
from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRPyTorchConfig
extractor = DotsOCRTextExtractor(
backend=DotsOCRPyTorchConfig(model="rednote-hilab/dots.ocr")
)
result = extractor.extract(image, include_layout=True)
print(result.content)
for elem in result.layout:
print(f"{elem.category}: {elem.bbox}")
DotsOCRTextExtractor
¶
Bases: BaseTextExtractor
Dots OCR Vision-Language Model text extractor with layout detection.
Extracts text from document images with layout information including: - 11 layout categories (Caption, Footnote, Formula, List-item, etc.) - Bounding boxes (normalized to 0-1024) - Multi-format text (Markdown, LaTeX, HTML) - Reading order preservation
Supports PyTorch, VLLM, and API backends.
Example
from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRPyTorchConfig
# Initialize with PyTorch backend
extractor = DotsOCRTextExtractor(
backend=DotsOCRPyTorchConfig(model="rednote-hilab/dots.ocr")
)
# Extract with layout
result = extractor.extract(image, include_layout=True)
print(f"Found {result.num_layout_elements} elements")
print(result.content)
Initialize Dots OCR text extractor.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration. One of: - DotsOCRPyTorchConfig: PyTorch/HuggingFace backend - DotsOCRVLLMConfig: VLLM high-throughput backend - DotsOCRAPIConfig: API backend (online VLLM server)
TYPE:
|
Source code in omnidocs/tasks/text_extraction/dotsocr/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal[
"markdown", "html", "json"
] = "markdown",
include_layout: bool = False,
custom_prompt: Optional[str] = None,
max_tokens: int = 8192,
) -> DotsOCRTextOutput
Extract text from image using Dots OCR.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path)
TYPE:
|
output_format
|
Output format ("markdown", "html", or "json")
TYPE:
|
include_layout
|
Include layout bounding boxes in output
TYPE:
|
custom_prompt
|
Override default extraction prompt
TYPE:
|
max_tokens
|
Maximum tokens for generation
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DotsOCRTextOutput
|
DotsOCRTextOutput with extracted content and optional layout |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If model is not loaded or inference fails |
Source code in omnidocs/tasks/text_extraction/dotsocr/extractor.py
pytorch
¶
PyTorch backend configuration for Dots OCR.
DotsOCRPyTorchConfig
¶
Bases: BaseModel
PyTorch/HuggingFace backend configuration for Dots OCR.
Dots OCR provides layout-aware text extraction with 11 predefined layout categories (Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Section-header, Table, Text, Title).
Example
vllm
¶
VLLM backend configuration for Dots OCR.
DotsOCRVLLMConfig
¶
Bases: BaseModel
VLLM backend configuration for Dots OCR.
VLLM provides high-throughput inference with optimizations like: - PagedAttention for efficient KV cache management - Continuous batching for higher throughput - Optimized CUDA kernels
Example
glmocr
¶
GLM-OCR backend configurations and extractor for text extraction.
GLM-OCR from zai-org (Feb 2026) — 0.9B OCR-specialist model. Architecture: CogViT visual encoder (0.4B) + GLM decoder (0.5B). Scores #1 on OmniDocBench V1.5 (94.62), beating models 10x its size.
Unlike GLM-V (which is a general VLM), GLM-OCR is purpose-built for document OCR. Uses AutoModelForImageTextToText + AutoProcessor (NOT Glm4vForConditionalGeneration). Requires transformers>=5.3.0.
Available backends
- GLMOCRPyTorchConfig: PyTorch/HuggingFace backend
- GLMOCRVLLMConfig: VLLM high-throughput backend (with MTP speculative decoding)
- GLMOCRAPIConfig: API backend
HuggingFace: zai-org/GLM-OCR License: Apache 2.0
GLMOCRAPIConfig
¶
Bases: BaseModel
API backend configuration for GLM-OCR.
Primary provider: ZhipuAI / BigModel (official) — get key at open.bigmodel.cn.
Example:
python
# Self-hosted vLLM server
config = GLMOCRAPIConfig(
model="zai-org/GLM-OCR",
api_base="http://localhost:8000/v1",
api_key="token-abc",
)
GLMOCRTextExtractor
¶
Bases: BaseTextExtractor
GLM-OCR text extractor (zai-org/GLM-OCR, 0.9B, Feb 2026).
Purpose-built OCR model, #1 on OmniDocBench V1.5.
Faster and cheaper than GLM-V for pure document OCR tasks.
Example:
python
from omnidocs.tasks.text_extraction import GLMOCRTextExtractor
from omnidocs.tasks.text_extraction.glmocr import GLMOCRPyTorchConfig
extractor = GLMOCRTextExtractor(backend=GLMOCRPyTorchConfig())
result = extractor.extract(image)
print(result.content)
Source code in omnidocs/tasks/text_extraction/glmocr/extractor.py
GLMOCRMLXConfig
¶
Bases: BaseModel
MLX backend configuration for GLM-OCR.
Uses mlx-vlm for Apple Silicon native inference.
GLM-OCR at 0.9B runs comfortably on any M-series Mac with 8GB+ unified memory.
Requires: mlx, mlx-vlm>=0.3.11
Note: Only works on Apple Silicon Macs. Do NOT use for Modal/cloud deployments.
Available models:
mlx-community/GLM-OCR-bf16 (default — full precision, 2.21 GB)
mlx-community/GLM-OCR-6bit (quantized, smaller)
Example:
python
config = GLMOCRMLXConfig() # bf16, default
config = GLMOCRMLXConfig(model="mlx-community/GLM-OCR-6bit") # quantized
GLMOCRPyTorchConfig
¶
Bases: BaseModel
PyTorch/HuggingFace backend configuration for GLM-OCR.
GLM-OCR uses AutoModelForImageTextToText + AutoProcessor.
Requires transformers>=5.3.0.
Example:
python
config = GLMOCRPyTorchConfig() # zai-org/GLM-OCR, default
config = GLMOCRPyTorchConfig(device="mps") # Apple Silicon
GLMOCRVLLMConfig
¶
Bases: BaseModel
VLLM backend configuration for GLM-OCR.
GLM-OCR supports VLLM with MTP (Multi-Token Prediction) speculative decoding
for significantly higher throughput. Requires vllm>=0.17.0 and transformers>=5.3.0.
Example:
python
config = GLMOCRVLLMConfig(gpu_memory_utilization=0.85)
api
¶
API backend configuration for GLM-OCR text extraction.
GLMOCRAPIConfig
¶
Bases: BaseModel
API backend configuration for GLM-OCR.
Primary provider: ZhipuAI / BigModel (official) — get key at open.bigmodel.cn.
Example:
python
# Self-hosted vLLM server
config = GLMOCRAPIConfig(
model="zai-org/GLM-OCR",
api_base="http://localhost:8000/v1",
api_key="token-abc",
)
extractor
¶
GLM-OCR text extractor.
GLM-OCR from zai-org (Feb 2026) — 0.9B OCR-specialist model. Architecture: CogViT visual encoder (0.4B) + GLM decoder (0.5B). Scores #1 on OmniDocBench V1.5 (94.62).
Key differences from GLM-V
- Uses AutoModelForImageTextToText (NOT Glm4vForConditionalGeneration)
- Uses AutoProcessor with direct image input (no chat template URL trick)
- Much smaller (0.9B vs 9B) — faster, lower VRAM
- Requires transformers>=5.3.0
- No
tokens, no <|begin_of_box|> — clean output
GLMOCRTextExtractor
¶
Bases: BaseTextExtractor
GLM-OCR text extractor (zai-org/GLM-OCR, 0.9B, Feb 2026).
Purpose-built OCR model, #1 on OmniDocBench V1.5.
Faster and cheaper than GLM-V for pure document OCR tasks.
Example:
python
from omnidocs.tasks.text_extraction import GLMOCRTextExtractor
from omnidocs.tasks.text_extraction.glmocr import GLMOCRPyTorchConfig
extractor = GLMOCRTextExtractor(backend=GLMOCRPyTorchConfig())
result = extractor.extract(image)
print(result.content)
Source code in omnidocs/tasks/text_extraction/glmocr/extractor.py
mlx
¶
MLX backend configuration for GLM-OCR text extraction.
GLMOCRMLXConfig
¶
Bases: BaseModel
MLX backend configuration for GLM-OCR.
Uses mlx-vlm for Apple Silicon native inference.
GLM-OCR at 0.9B runs comfortably on any M-series Mac with 8GB+ unified memory.
Requires: mlx, mlx-vlm>=0.3.11
Note: Only works on Apple Silicon Macs. Do NOT use for Modal/cloud deployments.
Available models:
mlx-community/GLM-OCR-bf16 (default — full precision, 2.21 GB)
mlx-community/GLM-OCR-6bit (quantized, smaller)
Example:
python
config = GLMOCRMLXConfig() # bf16, default
config = GLMOCRMLXConfig(model="mlx-community/GLM-OCR-6bit") # quantized
pytorch
¶
PyTorch backend configuration for GLM-OCR text extraction.
GLMOCRPyTorchConfig
¶
Bases: BaseModel
PyTorch/HuggingFace backend configuration for GLM-OCR.
GLM-OCR uses AutoModelForImageTextToText + AutoProcessor.
Requires transformers>=5.3.0.
Example:
python
config = GLMOCRPyTorchConfig() # zai-org/GLM-OCR, default
config = GLMOCRPyTorchConfig(device="mps") # Apple Silicon
vllm
¶
VLLM backend configuration for GLM-OCR text extraction.
GLMOCRVLLMConfig
¶
Bases: BaseModel
VLLM backend configuration for GLM-OCR.
GLM-OCR supports VLLM with MTP (Multi-Token Prediction) speculative decoding
for significantly higher throughput. Requires vllm>=0.17.0 and transformers>=5.3.0.
Example:
python
config = GLMOCRVLLMConfig(gpu_memory_utilization=0.85)
granitedocling
¶
Granite Docling text extraction with multi-backend support.
GraniteDoclingTextAPIConfig
¶
Bases: BaseModel
Configuration for Granite Docling text extraction via API.
Uses litellm for provider-agnostic API access. Supports OpenRouter, Gemini, Azure, OpenAI, and any other litellm-compatible provider.
API keys can be passed directly or read from environment variables.
Example
GraniteDoclingTextExtractor
¶
Bases: BaseTextExtractor
Granite Docling text extractor supporting PyTorch, VLLM, MLX, and API backends.
Granite Docling is IBM's compact vision-language model optimized for document conversion. It outputs DocTags format which is converted to Markdown using the docling_core library.
Example
from omnidocs.tasks.text_extraction.granitedocling import ( ... GraniteDoclingTextExtractor, ... GraniteDoclingTextPyTorchConfig, ... ) config = GraniteDoclingTextPyTorchConfig(device="cuda") extractor = GraniteDoclingTextExtractor(backend=config) result = extractor.extract(image, output_format="markdown") print(result.content)
Initialize Granite Docling extractor with backend configuration.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration (PyTorch, VLLM, MLX, or API config)
TYPE:
|
Source code in omnidocs/tasks/text_extraction/granitedocling/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text from an image using Granite Docling.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path)
TYPE:
|
output_format
|
Output format ("markdown" or "html")
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput with extracted content |
Source code in omnidocs/tasks/text_extraction/granitedocling/extractor.py
GraniteDoclingTextMLXConfig
¶
Bases: BaseModel
Configuration for Granite Docling text extraction with MLX backend.
This backend is optimized for Apple Silicon Macs (M1/M2/M3/M4). Uses the MLX-optimized model variant.
GraniteDoclingTextPyTorchConfig
¶
Bases: BaseModel
Configuration for Granite Docling text extraction with PyTorch backend.
GraniteDoclingTextVLLMConfig
¶
Bases: BaseModel
Configuration for Granite Docling text extraction with VLLM backend.
IMPORTANT: This config uses revision="untied" by default, which is required for VLLM compatibility with Granite Docling's tied weights.
api
¶
API backend configuration for Granite Docling text extraction.
Uses litellm for provider-agnostic inference (OpenRouter, Gemini, Azure, etc.).
GraniteDoclingTextAPIConfig
¶
Bases: BaseModel
Configuration for Granite Docling text extraction via API.
Uses litellm for provider-agnostic API access. Supports OpenRouter, Gemini, Azure, OpenAI, and any other litellm-compatible provider.
API keys can be passed directly or read from environment variables.
Example
extractor
¶
Granite Docling text extractor with multi-backend support.
GraniteDoclingTextExtractor
¶
Bases: BaseTextExtractor
Granite Docling text extractor supporting PyTorch, VLLM, MLX, and API backends.
Granite Docling is IBM's compact vision-language model optimized for document conversion. It outputs DocTags format which is converted to Markdown using the docling_core library.
Example
from omnidocs.tasks.text_extraction.granitedocling import ( ... GraniteDoclingTextExtractor, ... GraniteDoclingTextPyTorchConfig, ... ) config = GraniteDoclingTextPyTorchConfig(device="cuda") extractor = GraniteDoclingTextExtractor(backend=config) result = extractor.extract(image, output_format="markdown") print(result.content)
Initialize Granite Docling extractor with backend configuration.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration (PyTorch, VLLM, MLX, or API config)
TYPE:
|
Source code in omnidocs/tasks/text_extraction/granitedocling/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text from an image using Granite Docling.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path)
TYPE:
|
output_format
|
Output format ("markdown" or "html")
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput with extracted content |
Source code in omnidocs/tasks/text_extraction/granitedocling/extractor.py
mlx
¶
MLX backend configuration for Granite Docling text extraction (Apple Silicon).
GraniteDoclingTextMLXConfig
¶
Bases: BaseModel
Configuration for Granite Docling text extraction with MLX backend.
This backend is optimized for Apple Silicon Macs (M1/M2/M3/M4). Uses the MLX-optimized model variant.
pytorch
¶
PyTorch backend configuration for Granite Docling text extraction.
GraniteDoclingTextPyTorchConfig
¶
Bases: BaseModel
Configuration for Granite Docling text extraction with PyTorch backend.
vllm
¶
VLLM backend configuration for Granite Docling text extraction.
GraniteDoclingTextVLLMConfig
¶
Bases: BaseModel
Configuration for Granite Docling text extraction with VLLM backend.
IMPORTANT: This config uses revision="untied" by default, which is required for VLLM compatibility with Granite Docling's tied weights.
lighton
¶
LightOn text extraction module.
LightOn OCR is optimized for document text extraction with multi-lingual support. Supports multiple backends: PyTorch, VLLM, MLX, and API.
Example
from omnidocs.tasks.text_extraction import LightOnTextExtractor
from omnidocs.tasks.text_extraction.lighton import LightOnTextPyTorchConfig
# Initialize with PyTorch backend
extractor = LightOnTextExtractor(
backend=LightOnTextPyTorchConfig(device="cuda", torch_dtype="bfloat16")
)
# Extract text
result = extractor.extract(image)
print(result.content)
print(f"Confidence: {result.format}")
LightOnTextExtractor
¶
Bases: BaseTextExtractor
LightOn text extractor with multi-backend support.
LightOn OCR is optimized for document text extraction with multi-lingual capabilities.
Supports multiple backends: - PyTorch (HuggingFace Transformers) - VLLM (high-throughput GPU) - MLX (Apple Silicon) - API (VLLM OpenAI-compatible server)
Example
from omnidocs.tasks.text_extraction import LightOnTextExtractor
from omnidocs.tasks.text_extraction.lighton import LightOnTextPyTorchConfig
# PyTorch backend
extractor = LightOnTextExtractor(
backend=LightOnTextPyTorchConfig(device="cuda", torch_dtype="bfloat16")
)
result = extractor.extract(image)
print(result.content)
# VLLM backend for high-throughput inference
from omnidocs.tasks.text_extraction.lighton import LightOnTextVLLMConfig
extractor = LightOnTextExtractor(
backend=LightOnTextVLLMConfig(gpu_memory_utilization=0.85)
)
result = extractor.extract(image)
Initialize LightOn text extractor.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration (PyTorch, VLLM, MLX, or API)
TYPE:
|
Source code in omnidocs/tasks/text_extraction/lighton/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text from an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path)
TYPE:
|
output_format
|
Desired output format ('html' or 'markdown')
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput with extracted text content |
Source code in omnidocs/tasks/text_extraction/lighton/extractor.py
LightOnTextMLXConfig
¶
Bases: BaseModel
MLX backend config for LightOn text extraction.
Uses MLX for efficient inference on Apple Silicon.
Example
LightOnTextPyTorchConfig
¶
Bases: BaseModel
PyTorch/HuggingFace backend config for LightOn text extraction.
Uses HuggingFace Transformers with LightOnOcrForConditionalGeneration.
Example
LightOnTextVLLMConfig
¶
Bases: BaseModel
VLLM backend config for LightOn text extraction.
Uses VLLM for high-throughput GPU inference with: - PagedAttention for efficient KV cache - Continuous batching - Optimized CUDA kernels
Example
extractor
¶
LightOn text extractor with multi-backend support.
LightOn OCR is optimized for document text extraction and recognition. Supports multiple backends: PyTorch, VLLM, and MLX.
LightOnTextExtractor
¶
Bases: BaseTextExtractor
LightOn text extractor with multi-backend support.
LightOn OCR is optimized for document text extraction with multi-lingual capabilities.
Supports multiple backends: - PyTorch (HuggingFace Transformers) - VLLM (high-throughput GPU) - MLX (Apple Silicon) - API (VLLM OpenAI-compatible server)
Example
from omnidocs.tasks.text_extraction import LightOnTextExtractor
from omnidocs.tasks.text_extraction.lighton import LightOnTextPyTorchConfig
# PyTorch backend
extractor = LightOnTextExtractor(
backend=LightOnTextPyTorchConfig(device="cuda", torch_dtype="bfloat16")
)
result = extractor.extract(image)
print(result.content)
# VLLM backend for high-throughput inference
from omnidocs.tasks.text_extraction.lighton import LightOnTextVLLMConfig
extractor = LightOnTextExtractor(
backend=LightOnTextVLLMConfig(gpu_memory_utilization=0.85)
)
result = extractor.extract(image)
Initialize LightOn text extractor.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration (PyTorch, VLLM, MLX, or API)
TYPE:
|
Source code in omnidocs/tasks/text_extraction/lighton/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text from an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path)
TYPE:
|
output_format
|
Desired output format ('html' or 'markdown')
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput with extracted text content |
Source code in omnidocs/tasks/text_extraction/lighton/extractor.py
mlx
¶
MLX backend configuration for LightOn text extraction.
LightOnTextMLXConfig
¶
Bases: BaseModel
MLX backend config for LightOn text extraction.
Uses MLX for efficient inference on Apple Silicon.
Example
pytorch
¶
PyTorch/HuggingFace backend configuration for LightOn text extraction.
LightOnTextPyTorchConfig
¶
Bases: BaseModel
PyTorch/HuggingFace backend config for LightOn text extraction.
Uses HuggingFace Transformers with LightOnOcrForConditionalGeneration.
Example
mineruvl
¶
MinerU VL text extraction module.
MinerU VL is a vision-language model for document layout detection and text/table/equation recognition. It performs two-step extraction: 1. Layout Detection: Detect regions with types (text, table, equation, etc.) 2. Content Recognition: Extract content from each detected region
Example
from omnidocs.tasks.text_extraction import MinerUVLTextExtractor
from omnidocs.tasks.text_extraction.mineruvl import MinerUVLTextPyTorchConfig
# Initialize with PyTorch backend
extractor = MinerUVLTextExtractor(
backend=MinerUVLTextPyTorchConfig(device="cuda")
)
# Extract text
result = extractor.extract(image)
print(result.content)
# Extract with detailed blocks
result, blocks = extractor.extract_with_blocks(image)
for block in blocks:
print(f"{block.type}: {block.content[:50]}...")
MinerUVLTextAPIConfig
¶
Bases: BaseModel
API backend config for MinerU VL text extraction.
Connects to a deployed VLLM server with OpenAI-compatible API.
Example
MinerUVLTextExtractor
¶
Bases: BaseTextExtractor
MinerU VL text extractor with layout-aware extraction.
Performs two-step extraction: 1. Layout detection (detect regions) 2. Content recognition (extract text/table/equation from each region)
Supports multiple backends: - PyTorch (HuggingFace Transformers) - VLLM (high-throughput GPU) - MLX (Apple Silicon) - API (VLLM OpenAI-compatible server)
Example
from omnidocs.tasks.text_extraction import MinerUVLTextExtractor
from omnidocs.tasks.text_extraction.mineruvl import MinerUVLTextPyTorchConfig
extractor = MinerUVLTextExtractor(
backend=MinerUVLTextPyTorchConfig(device="cuda")
)
result = extractor.extract(image)
print(result.content) # Combined text + tables + equations
print(result.blocks) # List of ContentBlock objects
Initialize MinerU VL text extractor.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration (PyTorch, VLLM, MLX, or API)
TYPE:
|
Source code in omnidocs/tasks/text_extraction/mineruvl/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text with layout-aware two-step extraction.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path)
TYPE:
|
output_format
|
Output format ('html' or 'markdown')
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput with extracted content and metadata |
Source code in omnidocs/tasks/text_extraction/mineruvl/extractor.py
extract_with_blocks
¶
extract_with_blocks(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> tuple[TextOutput, List[ContentBlock]]
Extract text and return both TextOutput and ContentBlocks.
This method provides access to the detailed block information including bounding boxes and block types.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image
TYPE:
|
output_format
|
Output format
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple[TextOutput, List[ContentBlock]]
|
Tuple of (TextOutput, List[ContentBlock]) |
Source code in omnidocs/tasks/text_extraction/mineruvl/extractor.py
MinerUVLTextMLXConfig
¶
Bases: BaseModel
MLX backend config for MinerU VL text extraction on Apple Silicon.
Uses MLX-VLM for efficient inference on M1/M2/M3/M4 chips.
Example
MinerUVLTextPyTorchConfig
¶
Bases: BaseModel
PyTorch/HuggingFace backend config for MinerU VL text extraction.
Uses HuggingFace Transformers with Qwen2VLForConditionalGeneration.
Example
BlockType
¶
Bases: str, Enum
MinerU VL block types (22 categories).
ContentBlock
¶
Bases: BaseModel
A detected content block with type, bounding box, angle, and content.
Coordinates are normalized to [0, 1] range relative to image dimensions.
to_absolute
¶
Convert normalized bbox to absolute pixel coordinates.
Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
MinerUSamplingParams
¶
MinerUSamplingParams(
temperature: Optional[float] = 0.0,
top_p: Optional[float] = 0.01,
top_k: Optional[int] = 1,
presence_penalty: Optional[float] = 0.0,
frequency_penalty: Optional[float] = 0.0,
repetition_penalty: Optional[float] = 1.0,
no_repeat_ngram_size: Optional[int] = 100,
max_new_tokens: Optional[int] = None,
)
Bases: SamplingParams
Default sampling parameters optimized for MinerU VL.
Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
SamplingParams
dataclass
¶
SamplingParams(
temperature: Optional[float] = None,
top_p: Optional[float] = None,
top_k: Optional[int] = None,
presence_penalty: Optional[float] = None,
frequency_penalty: Optional[float] = None,
repetition_penalty: Optional[float] = None,
no_repeat_ngram_size: Optional[int] = None,
max_new_tokens: Optional[int] = None,
)
Sampling parameters for text generation.
MinerUVLTextVLLMConfig
¶
Bases: BaseModel
VLLM backend config for MinerU VL text extraction.
Uses VLLM for high-throughput GPU inference with: - PagedAttention for efficient KV cache - Continuous batching - Optimized CUDA kernels
Example
convert_otsl_to_html
¶
Convert OTSL table format to HTML.
Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 | |
parse_layout_output
¶
Parse layout detection model output into ContentBlocks.
Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
api
¶
API backend configuration for MinerU VL text extraction.
MinerUVLTextAPIConfig
¶
Bases: BaseModel
API backend config for MinerU VL text extraction.
Connects to a deployed VLLM server with OpenAI-compatible API.
Example
extractor
¶
MinerU VL text extractor with layout-aware two-step extraction.
MinerU VL performs document extraction in two steps: 1. Layout Detection: Detect regions with types (text, table, equation, etc.) 2. Content Recognition: Extract text/table/equation content from each region
MinerUVLTextExtractor
¶
Bases: BaseTextExtractor
MinerU VL text extractor with layout-aware extraction.
Performs two-step extraction: 1. Layout detection (detect regions) 2. Content recognition (extract text/table/equation from each region)
Supports multiple backends: - PyTorch (HuggingFace Transformers) - VLLM (high-throughput GPU) - MLX (Apple Silicon) - API (VLLM OpenAI-compatible server)
Example
from omnidocs.tasks.text_extraction import MinerUVLTextExtractor
from omnidocs.tasks.text_extraction.mineruvl import MinerUVLTextPyTorchConfig
extractor = MinerUVLTextExtractor(
backend=MinerUVLTextPyTorchConfig(device="cuda")
)
result = extractor.extract(image)
print(result.content) # Combined text + tables + equations
print(result.blocks) # List of ContentBlock objects
Initialize MinerU VL text extractor.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration (PyTorch, VLLM, MLX, or API)
TYPE:
|
Source code in omnidocs/tasks/text_extraction/mineruvl/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text with layout-aware two-step extraction.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path)
TYPE:
|
output_format
|
Output format ('html' or 'markdown')
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput with extracted content and metadata |
Source code in omnidocs/tasks/text_extraction/mineruvl/extractor.py
extract_with_blocks
¶
extract_with_blocks(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> tuple[TextOutput, List[ContentBlock]]
Extract text and return both TextOutput and ContentBlocks.
This method provides access to the detailed block information including bounding boxes and block types.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image
TYPE:
|
output_format
|
Output format
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple[TextOutput, List[ContentBlock]]
|
Tuple of (TextOutput, List[ContentBlock]) |
Source code in omnidocs/tasks/text_extraction/mineruvl/extractor.py
mlx
¶
MLX backend configuration for MinerU VL text extraction (Apple Silicon).
MinerUVLTextMLXConfig
¶
Bases: BaseModel
MLX backend config for MinerU VL text extraction on Apple Silicon.
Uses MLX-VLM for efficient inference on M1/M2/M3/M4 chips.
Example
pytorch
¶
PyTorch/HuggingFace backend configuration for MinerU VL text extraction.
MinerUVLTextPyTorchConfig
¶
Bases: BaseModel
PyTorch/HuggingFace backend config for MinerU VL text extraction.
Uses HuggingFace Transformers with Qwen2VLForConditionalGeneration.
Example
utils
¶
MinerU VL utilities for document extraction.
Contains data structures, parsing, prompts, and post-processing functions for MinerU VL document extraction pipeline.
This file contains code adapted from mineru-vl-utils
https://github.com/opendatalab/mineru-vl-utils https://pypi.org/project/mineru-vl-utils/
The original mineru-vl-utils is licensed under AGPL-3.0: Copyright (c) OpenDataLab https://github.com/opendatalab/mineru-vl-utils/blob/main/LICENSE.md
Adapted components
- BlockType enum (from structs.py)
- ContentBlock data structure (from structs.py)
- OTSL to HTML table conversion (from post_process/otsl2html.py)
BlockType
¶
Bases: str, Enum
MinerU VL block types (22 categories).
ContentBlock
¶
Bases: BaseModel
A detected content block with type, bounding box, angle, and content.
Coordinates are normalized to [0, 1] range relative to image dimensions.
to_absolute
¶
Convert normalized bbox to absolute pixel coordinates.
Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
SamplingParams
dataclass
¶
SamplingParams(
temperature: Optional[float] = None,
top_p: Optional[float] = None,
top_k: Optional[int] = None,
presence_penalty: Optional[float] = None,
frequency_penalty: Optional[float] = None,
repetition_penalty: Optional[float] = None,
no_repeat_ngram_size: Optional[int] = None,
max_new_tokens: Optional[int] = None,
)
Sampling parameters for text generation.
MinerUSamplingParams
¶
MinerUSamplingParams(
temperature: Optional[float] = 0.0,
top_p: Optional[float] = 0.01,
top_k: Optional[int] = 1,
presence_penalty: Optional[float] = 0.0,
frequency_penalty: Optional[float] = 0.0,
repetition_penalty: Optional[float] = 1.0,
no_repeat_ngram_size: Optional[int] = 100,
max_new_tokens: Optional[int] = None,
)
Bases: SamplingParams
Default sampling parameters optimized for MinerU VL.
Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
convert_bbox
¶
Convert bbox from model output (0-1000) to normalized format (0-1).
Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
parse_angle
¶
Parse rotation angle from model output tail string.
parse_layout_output
¶
Parse layout detection model output into ContentBlocks.
Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
get_rgb_image
¶
Convert image to RGB mode.
prepare_for_layout
¶
prepare_for_layout(
image: Image,
layout_size: Tuple[int, int] = LAYOUT_IMAGE_SIZE,
) -> Image.Image
Prepare image for layout detection.
Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
resize_by_need
¶
Resize image if needed based on aspect ratio constraints.
Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
prepare_for_extract
¶
prepare_for_extract(
image: Image,
blocks: List[ContentBlock],
prompts: Dict[str, str] = None,
sampling_params: Dict[str, SamplingParams] = None,
skip_types: set = None,
) -> Tuple[
List[Image.Image],
List[str],
List[SamplingParams],
List[int],
]
Prepare cropped images for content extraction.
Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
convert_otsl_to_html
¶
Convert OTSL table format to HTML.
Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 | |
simple_post_process
¶
Simple post-processing: convert OTSL tables to HTML.
Source code in omnidocs/tasks/text_extraction/mineruvl/utils.py
vllm
¶
VLLM backend configuration for MinerU VL text extraction.
MinerUVLTextVLLMConfig
¶
Bases: BaseModel
VLLM backend config for MinerU VL text extraction.
Uses VLLM for high-throughput GPU inference with: - PagedAttention for efficient KV cache - Continuous batching - Optimized CUDA kernels
Example
models
¶
Pydantic models for text extraction outputs.
Defines output types and format enums for text extraction.
OutputFormat
¶
Bases: str, Enum
Supported text extraction output formats.
Each format has different characteristics
- HTML: Structured with div elements, preserves layout semantics
- MARKDOWN: Portable, human-readable, good for documentation
- JSON: Structured data with layout information (Dots OCR)
TextOutput
¶
Bases: BaseModel
Text extraction output from a document image.
Contains the extracted text content in the requested format, along with optional raw output and plain text versions.
Example
LayoutElement
¶
Bases: BaseModel
Single layout element from document layout detection.
Represents a detected region in the document with its bounding box, category label, and extracted text content.
| ATTRIBUTE | DESCRIPTION |
|---|---|
bbox |
Bounding box coordinates [x1, y1, x2, y2] (normalized to 0-1024)
TYPE:
|
category |
Layout category (e.g., "Text", "Title", "Table", "Formula")
TYPE:
|
text |
Extracted text content (None for pictures)
TYPE:
|
confidence |
Detection confidence score (optional)
TYPE:
|
DotsOCRTextOutput
¶
Bases: BaseModel
Text extraction output from Dots OCR with layout information.
Dots OCR provides structured output with: - Layout detection (11 categories) - Bounding boxes (normalized to 0-1024) - Multi-format text (Markdown/LaTeX/HTML) - Reading order preservation
Layout Categories
Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Section-header, Table, Text, Title
Text Formatting
- Text/Title/Section-header: Markdown
- Formula: LaTeX
- Table: HTML
- Picture: (text omitted)
Example
nanonets
¶
Nanonets OCR2-3B backend configurations and extractor for text extraction.
Available backends
- NanonetsTextPyTorchConfig: PyTorch/HuggingFace backend
- NanonetsTextVLLMConfig: VLLM high-throughput backend
- NanonetsTextMLXConfig: MLX backend for Apple Silicon
Example
NanonetsTextExtractor
¶
Bases: BaseTextExtractor
Nanonets OCR2-3B Vision-Language Model text extractor.
Extracts text from document images with support for:
- Tables (output as HTML)
- Equations (output as LaTeX)
- Image captions (wrapped in tags)
- Watermarks (wrapped in
Supports PyTorch, VLLM, and MLX backends.
Example
from omnidocs.tasks.text_extraction import NanonetsTextExtractor
from omnidocs.tasks.text_extraction.nanonets import NanonetsTextPyTorchConfig
# Initialize with PyTorch backend
extractor = NanonetsTextExtractor(
backend=NanonetsTextPyTorchConfig()
)
# Extract text
result = extractor.extract(image)
print(result.content)
Initialize Nanonets text extractor.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration. One of: - NanonetsTextPyTorchConfig: PyTorch/HuggingFace backend - NanonetsTextVLLMConfig: VLLM high-throughput backend - NanonetsTextMLXConfig: MLX backend for Apple Silicon
TYPE:
|
Source code in omnidocs/tasks/text_extraction/nanonets/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text from an image.
Note: Nanonets OCR2 produces a unified output format that includes tables as HTML and equations as LaTeX inline. The output_format parameter is accepted for API compatibility but does not change the output structure.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file
TYPE:
|
output_format
|
Accepted for API compatibility (default: "markdown")
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput containing extracted text content |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If model is not loaded |
ValueError
|
If image format is not supported |
Source code in omnidocs/tasks/text_extraction/nanonets/extractor.py
NanonetsTextMLXConfig
¶
Bases: BaseModel
MLX backend configuration for Nanonets OCR2-3B text extraction.
This backend uses MLX for Apple Silicon native inference. Best for local development and testing on macOS M1/M2/M3/M4+. Requires: mlx, mlx-vlm
Note: This backend only works on Apple Silicon Macs. Do NOT use for Modal/cloud deployments.
NanonetsTextPyTorchConfig
¶
Bases: BaseModel
PyTorch/HuggingFace backend configuration for Nanonets OCR2-3B text extraction.
This backend uses the transformers library with PyTorch for local GPU inference. Requires: torch, transformers, accelerate
NanonetsTextVLLMConfig
¶
Bases: BaseModel
VLLM backend configuration for Nanonets OCR2-3B text extraction.
This backend uses VLLM for high-throughput inference. Best for batch processing and production deployments. Requires: vllm, torch, transformers, qwen-vl-utils
extractor
¶
Nanonets OCR2-3B text extractor.
A Vision-Language Model for extracting text from document images with support for tables (HTML), equations (LaTeX), and image captions.
Supports PyTorch and VLLM backends.
Example
NanonetsTextExtractor
¶
Bases: BaseTextExtractor
Nanonets OCR2-3B Vision-Language Model text extractor.
Extracts text from document images with support for:
- Tables (output as HTML)
- Equations (output as LaTeX)
- Image captions (wrapped in tags)
- Watermarks (wrapped in
Supports PyTorch, VLLM, and MLX backends.
Example
from omnidocs.tasks.text_extraction import NanonetsTextExtractor
from omnidocs.tasks.text_extraction.nanonets import NanonetsTextPyTorchConfig
# Initialize with PyTorch backend
extractor = NanonetsTextExtractor(
backend=NanonetsTextPyTorchConfig()
)
# Extract text
result = extractor.extract(image)
print(result.content)
Initialize Nanonets text extractor.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration. One of: - NanonetsTextPyTorchConfig: PyTorch/HuggingFace backend - NanonetsTextVLLMConfig: VLLM high-throughput backend - NanonetsTextMLXConfig: MLX backend for Apple Silicon
TYPE:
|
Source code in omnidocs/tasks/text_extraction/nanonets/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text from an image.
Note: Nanonets OCR2 produces a unified output format that includes tables as HTML and equations as LaTeX inline. The output_format parameter is accepted for API compatibility but does not change the output structure.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file
TYPE:
|
output_format
|
Accepted for API compatibility (default: "markdown")
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput containing extracted text content |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If model is not loaded |
ValueError
|
If image format is not supported |
Source code in omnidocs/tasks/text_extraction/nanonets/extractor.py
mlx
¶
MLX backend configuration for Nanonets OCR2-3B text extraction.
NanonetsTextMLXConfig
¶
Bases: BaseModel
MLX backend configuration for Nanonets OCR2-3B text extraction.
This backend uses MLX for Apple Silicon native inference. Best for local development and testing on macOS M1/M2/M3/M4+. Requires: mlx, mlx-vlm
Note: This backend only works on Apple Silicon Macs. Do NOT use for Modal/cloud deployments.
pytorch
¶
PyTorch/HuggingFace backend configuration for Nanonets OCR2-3B text extraction.
NanonetsTextPyTorchConfig
¶
Bases: BaseModel
PyTorch/HuggingFace backend configuration for Nanonets OCR2-3B text extraction.
This backend uses the transformers library with PyTorch for local GPU inference. Requires: torch, transformers, accelerate
vllm
¶
VLLM backend configuration for Nanonets OCR2-3B text extraction.
NanonetsTextVLLMConfig
¶
Bases: BaseModel
VLLM backend configuration for Nanonets OCR2-3B text extraction.
This backend uses VLLM for high-throughput inference. Best for batch processing and production deployments. Requires: vllm, torch, transformers, qwen-vl-utils
qwen
¶
Qwen3-VL backend configurations and extractor for text extraction.
Available backends
- QwenTextPyTorchConfig: PyTorch/HuggingFace backend
- QwenTextVLLMConfig: VLLM high-throughput backend
- QwenTextMLXConfig: MLX backend for Apple Silicon
- QwenTextAPIConfig: API backend (OpenRouter, etc.)
Example
QwenTextAPIConfig
¶
Bases: BaseModel
API backend configuration for Qwen text extraction.
Uses litellm for provider-agnostic API access. Supports OpenRouter, Gemini, Azure, OpenAI, and any other litellm-compatible provider.
API keys can be passed directly or read from environment variables.
Example
# OpenRouter (reads OPENROUTER_API_KEY from env)
config = QwenTextAPIConfig(
model="openrouter/qwen/qwen3-vl-8b-instruct",
)
# With explicit key
config = QwenTextAPIConfig(
model="openrouter/qwen/qwen3-vl-8b-instruct",
api_key=os.environ["OPENROUTER_API_KEY"],
api_base="https://openrouter.ai/api/v1",
)
QwenTextExtractor
¶
Bases: BaseTextExtractor
Qwen3-VL Vision-Language Model text extractor.
Extracts text from document images and outputs as structured HTML or Markdown. Uses Qwen3-VL's built-in document parsing prompts.
Supports PyTorch, VLLM, MLX, and API backends.
Example
from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig
# Initialize with PyTorch backend
extractor = QwenTextExtractor(
backend=QwenTextPyTorchConfig(model="Qwen/Qwen3-VL-8B-Instruct")
)
# Extract as Markdown
result = extractor.extract(image, output_format="markdown")
print(result.content)
# Extract as HTML
result = extractor.extract(image, output_format="html")
print(result.content)
Initialize Qwen text extractor.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration. One of: - QwenTextPyTorchConfig: PyTorch/HuggingFace backend - QwenTextVLLMConfig: VLLM high-throughput backend - QwenTextMLXConfig: MLX backend for Apple Silicon - QwenTextAPIConfig: API backend (OpenRouter, etc.)
TYPE:
|
Source code in omnidocs/tasks/text_extraction/qwen/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text from an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file
TYPE:
|
output_format
|
Desired output format: - "html": Structured HTML with div elements - "markdown": Markdown format
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput containing extracted text content |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If model is not loaded |
ValueError
|
If image format or output_format is not supported |
Source code in omnidocs/tasks/text_extraction/qwen/extractor.py
QwenTextMLXConfig
¶
Bases: BaseModel
MLX backend configuration for Qwen text extraction.
This backend uses MLX for Apple Silicon native inference. Best for local development and testing on macOS M1/M2/M3+. Requires: mlx, mlx-vlm
Note: This backend only works on Apple Silicon Macs. Do NOT use for Modal/cloud deployments.
QwenTextPyTorchConfig
¶
Bases: BaseModel
PyTorch/HuggingFace backend configuration for Qwen text extraction.
This backend uses the transformers library with PyTorch for local GPU inference. Requires: torch, transformers, accelerate, qwen-vl-utils
Example
QwenTextVLLMConfig
¶
Bases: BaseModel
VLLM backend configuration for Qwen text extraction.
This backend uses VLLM for high-throughput inference. Best for batch processing and production deployments. Requires: vllm, torch, transformers, qwen-vl-utils
Example
api
¶
API backend configuration for Qwen3-VL text extraction.
Uses litellm for provider-agnostic inference (OpenRouter, Gemini, Azure, etc.).
QwenTextAPIConfig
¶
Bases: BaseModel
API backend configuration for Qwen text extraction.
Uses litellm for provider-agnostic API access. Supports OpenRouter, Gemini, Azure, OpenAI, and any other litellm-compatible provider.
API keys can be passed directly or read from environment variables.
Example
# OpenRouter (reads OPENROUTER_API_KEY from env)
config = QwenTextAPIConfig(
model="openrouter/qwen/qwen3-vl-8b-instruct",
)
# With explicit key
config = QwenTextAPIConfig(
model="openrouter/qwen/qwen3-vl-8b-instruct",
api_key=os.environ["OPENROUTER_API_KEY"],
api_base="https://openrouter.ai/api/v1",
)
extractor
¶
Qwen3-VL text extractor.
A Vision-Language Model for extracting text from document images as structured HTML or Markdown.
Supports PyTorch, VLLM, MLX, and API backends.
Example
from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig
extractor = QwenTextExtractor(
backend=QwenTextPyTorchConfig(model="Qwen/Qwen3-VL-8B-Instruct")
)
result = extractor.extract(image, output_format="markdown")
print(result.content)
QwenTextExtractor
¶
Bases: BaseTextExtractor
Qwen3-VL Vision-Language Model text extractor.
Extracts text from document images and outputs as structured HTML or Markdown. Uses Qwen3-VL's built-in document parsing prompts.
Supports PyTorch, VLLM, MLX, and API backends.
Example
from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig
# Initialize with PyTorch backend
extractor = QwenTextExtractor(
backend=QwenTextPyTorchConfig(model="Qwen/Qwen3-VL-8B-Instruct")
)
# Extract as Markdown
result = extractor.extract(image, output_format="markdown")
print(result.content)
# Extract as HTML
result = extractor.extract(image, output_format="html")
print(result.content)
Initialize Qwen text extractor.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration. One of: - QwenTextPyTorchConfig: PyTorch/HuggingFace backend - QwenTextVLLMConfig: VLLM high-throughput backend - QwenTextMLXConfig: MLX backend for Apple Silicon - QwenTextAPIConfig: API backend (OpenRouter, etc.)
TYPE:
|
Source code in omnidocs/tasks/text_extraction/qwen/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text from an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file
TYPE:
|
output_format
|
Desired output format: - "html": Structured HTML with div elements - "markdown": Markdown format
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput containing extracted text content |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If model is not loaded |
ValueError
|
If image format or output_format is not supported |
Source code in omnidocs/tasks/text_extraction/qwen/extractor.py
mlx
¶
MLX backend configuration for Qwen3-VL text extraction.
QwenTextMLXConfig
¶
Bases: BaseModel
MLX backend configuration for Qwen text extraction.
This backend uses MLX for Apple Silicon native inference. Best for local development and testing on macOS M1/M2/M3+. Requires: mlx, mlx-vlm
Note: This backend only works on Apple Silicon Macs. Do NOT use for Modal/cloud deployments.
pytorch
¶
PyTorch/HuggingFace backend configuration for Qwen3-VL text extraction.
QwenTextPyTorchConfig
¶
Bases: BaseModel
PyTorch/HuggingFace backend configuration for Qwen text extraction.
This backend uses the transformers library with PyTorch for local GPU inference. Requires: torch, transformers, accelerate, qwen-vl-utils
Example
vllm
¶
VLLM backend configuration for Qwen3-VL text extraction.
QwenTextVLLMConfig
¶
Bases: BaseModel
VLLM backend configuration for Qwen text extraction.
This backend uses VLLM for high-throughput inference. Best for batch processing and production deployments. Requires: vllm, torch, transformers, qwen-vl-utils
Example
vlm
¶
VLM text extractor.
A provider-agnostic Vision-Language Model text extractor using litellm. Works with any cloud API: Gemini, OpenRouter, Azure, OpenAI, Anthropic, etc.
Example
from omnidocs.vlm import VLMAPIConfig
from omnidocs.tasks.text_extraction import VLMTextExtractor
config = VLMAPIConfig(model="gemini/gemini-2.5-flash")
extractor = VLMTextExtractor(config=config)
result = extractor.extract("document.png", output_format="markdown")
print(result.content)
# With custom prompt
result = extractor.extract("document.png", prompt="Extract only table data as markdown")
VLMTextExtractor
¶
Bases: BaseTextExtractor
Provider-agnostic VLM text extractor using litellm.
Works with any cloud VLM API: Gemini, OpenRouter, Azure, OpenAI, Anthropic, etc. Supports custom prompts for specialized extraction.
Example
from omnidocs.vlm import VLMAPIConfig
from omnidocs.tasks.text_extraction import VLMTextExtractor
# Gemini (reads GOOGLE_API_KEY from env)
config = VLMAPIConfig(model="gemini/gemini-2.5-flash")
extractor = VLMTextExtractor(config=config)
# Default extraction
result = extractor.extract("document.png", output_format="markdown")
# Custom prompt
result = extractor.extract(
"document.png",
prompt="Extract only the table data as markdown",
)
Initialize VLM text extractor.
| PARAMETER | DESCRIPTION |
|---|---|
config
|
VLM API configuration with model and provider details.
TYPE:
|
Source code in omnidocs/tasks/text_extraction/vlm.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
prompt: Optional[str] = None,
) -> TextOutput
Extract text from an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path).
TYPE:
|
output_format
|
Desired output format ("html" or "markdown").
TYPE:
|
prompt
|
Custom prompt. If None, uses a task-specific default prompt.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput containing extracted text content. |