Base¶
Base class for text extractors.
Defines the abstract interface that all text extractors must implement.
BaseTextExtractor
¶
Bases: ABC
Abstract base class for text extractors.
All text extraction models must inherit from this class and implement the required methods.
Example
extract
abstractmethod
¶
extract(
image: Union[Image, ndarray, str, Path],
output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput
Extract text from an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file
TYPE:
|
output_format
|
Desired output format: - "html": Structured HTML - "markdown": Markdown format
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextOutput
|
TextOutput containing extracted text content |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If image format or output_format is not supported |
RuntimeError
|
If model is not loaded or inference fails |
Source code in omnidocs/tasks/text_extraction/base.py
batch_extract
¶
batch_extract(
images: List[Union[Image, ndarray, str, Path]],
output_format: Literal["html", "markdown"] = "markdown",
progress_callback: Optional[
Callable[[int, int], None]
] = None,
) -> List[TextOutput]
Extract text from multiple images.
Default implementation loops over extract(). Subclasses can override for optimized batching (e.g., VLLM).
| PARAMETER | DESCRIPTION |
|---|---|
images
|
List of images in any supported format
TYPE:
|
output_format
|
Desired output format
TYPE:
|
progress_callback
|
Optional function(current, total) for progress
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[TextOutput]
|
List of TextOutput in same order as input |
Examples:
images = [doc.get_page(i) for i in range(doc.page_count)]
results = extractor.batch_extract(images, output_format="markdown")
Source code in omnidocs/tasks/text_extraction/base.py
extract_document
¶
extract_document(
document: Document,
output_format: Literal["html", "markdown"] = "markdown",
progress_callback: Optional[
Callable[[int, int], None]
] = None,
) -> List[TextOutput]
Extract text from all pages of a document.
| PARAMETER | DESCRIPTION |
|---|---|
document
|
Document instance
TYPE:
|
output_format
|
Desired output format
TYPE:
|
progress_callback
|
Optional function(current, total) for progress
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[TextOutput]
|
List of TextOutput, one per page |
Examples:
doc = Document.from_pdf("paper.pdf")
results = extractor.extract_document(doc, output_format="markdown")