Extractor¶
LightOn text extractor with multi-backend support.
LightOn OCR is optimized for document text extraction and recognition. Supports multiple backends: PyTorch, VLLM, and MLX.
LightOnTextExtractor
¶
Bases: BaseTextExtractor
LightOn text extractor with multi-backend support.
LightOn OCR is optimized for document text extraction with multi-lingual capabilities.
Supports multiple backends: - PyTorch (HuggingFace Transformers) - VLLM (high-throughput GPU) - MLX (Apple Silicon) - API (VLLM OpenAI-compatible server)
Example
from omnidocs.tasks.text_extraction import LightOnTextExtractor
from omnidocs.tasks.text_extraction.lighton import LightOnTextPyTorchConfig
# PyTorch backend
extractor = LightOnTextExtractor(
backend=LightOnTextPyTorchConfig(device="cuda", torch_dtype="bfloat16")
)
result = extractor.extract(image)
print(result.content)
# VLLM backend for high-throughput inference
from omnidocs.tasks.text_extraction.lighton import LightOnTextVLLMConfig
extractor = LightOnTextExtractor(
backend=LightOnTextVLLMConfig(gpu_memory_utilization=0.85)
)
result = extractor.extract(image)
Initialize LightOn text extractor.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
Backend configuration (PyTorch, VLLM, MLX, or API)
TYPE:
|
Source code in omnidocs/tasks/text_extraction/lighton/extractor.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text from an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path)
TYPE:
|
output_format
|
Desired output format ('html' or 'markdown')
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput with extracted text content |