Skip to content

Base

Base class for OCR extractors.

Defines the abstract interface that all OCR extractors must implement.

BaseOCRExtractor

Bases: ABC

Abstract base class for OCR extractors.

All OCR extraction models must inherit from this class and implement the required methods.

Example
class MyOCRExtractor(BaseOCRExtractor):
        def __init__(self, config: MyConfig):
            self.config = config
            self._load_model()

        def _load_model(self):
            # Initialize OCR engine
            pass

        def extract(self, image):
            # Run OCR extraction
            return OCROutput(...)

extract abstractmethod

extract(
    image: Union[Image, ndarray, str, Path],
) -> OCROutput

Run OCR extraction on an image.

PARAMETER DESCRIPTION
image

Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file

TYPE: Union[Image, ndarray, str, Path]

RETURNS DESCRIPTION
OCROutput

OCROutput containing detected text blocks with bounding boxes

RAISES DESCRIPTION
ValueError

If image format is not supported

RuntimeError

If OCR engine is not initialized or extraction fails

Source code in omnidocs/tasks/ocr_extraction/base.py
@abstractmethod
def extract(self, image: Union[Image.Image, np.ndarray, str, Path]) -> OCROutput:
    """
    Run OCR extraction on an image.

    Args:
        image: Input image as:
            - PIL.Image.Image: PIL image object
            - np.ndarray: Numpy array (HWC format, RGB)
            - str or Path: Path to image file

    Returns:
        OCROutput containing detected text blocks with bounding boxes

    Raises:
        ValueError: If image format is not supported
        RuntimeError: If OCR engine is not initialized or extraction fails
    """
    pass

batch_extract

batch_extract(
    images: List[Union[Image, ndarray, str, Path]],
    progress_callback: Optional[
        Callable[[int, int], None]
    ] = None,
) -> List[OCROutput]

Run OCR extraction on multiple images.

Default implementation loops over extract(). Subclasses can override for optimized batching.

PARAMETER DESCRIPTION
images

List of images in any supported format

TYPE: List[Union[Image, ndarray, str, Path]]

progress_callback

Optional function(current, total) for progress

TYPE: Optional[Callable[[int, int], None]] DEFAULT: None

RETURNS DESCRIPTION
List[OCROutput]

List of OCROutput in same order as input

Examples:

images = [doc.get_page(i) for i in range(doc.page_count)]
results = extractor.batch_extract(images)
Source code in omnidocs/tasks/ocr_extraction/base.py
def batch_extract(
    self,
    images: List[Union[Image.Image, np.ndarray, str, Path]],
    progress_callback: Optional[Callable[[int, int], None]] = None,
) -> List[OCROutput]:
    """
    Run OCR extraction on multiple images.

    Default implementation loops over extract(). Subclasses can override
    for optimized batching.

    Args:
        images: List of images in any supported format
        progress_callback: Optional function(current, total) for progress

    Returns:
        List of OCROutput in same order as input

    Examples:
        ```python
        images = [doc.get_page(i) for i in range(doc.page_count)]
        results = extractor.batch_extract(images)
        ```
    """
    results = []
    total = len(images)

    for i, image in enumerate(images):
        if progress_callback:
            progress_callback(i + 1, total)

        result = self.extract(image)
        results.append(result)

    return results

extract_document

extract_document(
    document: Document,
    progress_callback: Optional[
        Callable[[int, int], None]
    ] = None,
) -> List[OCROutput]

Run OCR extraction on all pages of a document.

PARAMETER DESCRIPTION
document

Document instance

TYPE: Document

progress_callback

Optional function(current, total) for progress

TYPE: Optional[Callable[[int, int], None]] DEFAULT: None

RETURNS DESCRIPTION
List[OCROutput]

List of OCROutput, one per page

Examples:

doc = Document.from_pdf("paper.pdf")
results = extractor.extract_document(doc)
Source code in omnidocs/tasks/ocr_extraction/base.py
def extract_document(
    self,
    document: "Document",
    progress_callback: Optional[Callable[[int, int], None]] = None,
) -> List[OCROutput]:
    """
    Run OCR extraction on all pages of a document.

    Args:
        document: Document instance
        progress_callback: Optional function(current, total) for progress

    Returns:
        List of OCROutput, one per page

    Examples:
        ```python
        doc = Document.from_pdf("paper.pdf")
        results = extractor.extract_document(doc)
        ```
    """
    results = []
    total = document.page_count

    for i, page in enumerate(document.iter_pages()):
        if progress_callback:
            progress_callback(i + 1, total)

        result = self.extract(page)
        results.append(result)

    return results