Skip to content

Base

Base class for text extractors.

Defines the abstract interface that all text extractors must implement.

BaseTextExtractor

Bases: ABC

Abstract base class for text extractors.

All text extraction models must inherit from this class and implement the required methods.

Example
class MyTextExtractor(BaseTextExtractor):
        def __init__(self, config: MyConfig):
            self.config = config
            self._load_model()

        def _load_model(self):
            # Load model weights
            pass

        def extract(self, image, output_format="markdown"):
            # Run extraction
            return TextOutput(...)

extract abstractmethod

extract(
    image: Union[Image, ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput

Extract text from an image.

PARAMETER DESCRIPTION
image

Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file

TYPE: Union[Image, ndarray, str, Path]

output_format

Desired output format: - "html": Structured HTML - "markdown": Markdown format

TYPE: Literal['html', 'markdown'] DEFAULT: 'markdown'

RETURNS DESCRIPTION
TextOutput

TextOutput containing extracted text content

RAISES DESCRIPTION
ValueError

If image format or output_format is not supported

RuntimeError

If model is not loaded or inference fails

Source code in omnidocs/tasks/text_extraction/base.py
@abstractmethod
def extract(
    self,
    image: Union[Image.Image, np.ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput:
    """
    Extract text from an image.

    Args:
        image: Input image as:
            - PIL.Image.Image: PIL image object
            - np.ndarray: Numpy array (HWC format, RGB)
            - str or Path: Path to image file
        output_format: Desired output format:
            - "html": Structured HTML
            - "markdown": Markdown format

    Returns:
        TextOutput containing extracted text content

    Raises:
        ValueError: If image format or output_format is not supported
        RuntimeError: If model is not loaded or inference fails
    """
    pass

batch_extract

batch_extract(
    images: List[Union[Image, ndarray, str, Path]],
    output_format: Literal["html", "markdown"] = "markdown",
    progress_callback: Optional[
        Callable[[int, int], None]
    ] = None,
) -> List[TextOutput]

Extract text from multiple images.

Default implementation loops over extract(). Subclasses can override for optimized batching (e.g., VLLM).

PARAMETER DESCRIPTION
images

List of images in any supported format

TYPE: List[Union[Image, ndarray, str, Path]]

output_format

Desired output format

TYPE: Literal['html', 'markdown'] DEFAULT: 'markdown'

progress_callback

Optional function(current, total) for progress

TYPE: Optional[Callable[[int, int], None]] DEFAULT: None

RETURNS DESCRIPTION
List[TextOutput]

List of TextOutput in same order as input

Examples:

images = [doc.get_page(i) for i in range(doc.page_count)]
results = extractor.batch_extract(images, output_format="markdown")
Source code in omnidocs/tasks/text_extraction/base.py
def batch_extract(
    self,
    images: List[Union[Image.Image, np.ndarray, str, Path]],
    output_format: Literal["html", "markdown"] = "markdown",
    progress_callback: Optional[Callable[[int, int], None]] = None,
) -> List[TextOutput]:
    """
    Extract text from multiple images.

    Default implementation loops over extract(). Subclasses can override
    for optimized batching (e.g., VLLM).

    Args:
        images: List of images in any supported format
        output_format: Desired output format
        progress_callback: Optional function(current, total) for progress

    Returns:
        List of TextOutput in same order as input

    Examples:
        ```python
        images = [doc.get_page(i) for i in range(doc.page_count)]
        results = extractor.batch_extract(images, output_format="markdown")
        ```
    """
    results = []
    total = len(images)

    for i, image in enumerate(images):
        if progress_callback:
            progress_callback(i + 1, total)

        result = self.extract(image, output_format=output_format)
        results.append(result)

    return results

extract_document

extract_document(
    document: Document,
    output_format: Literal["html", "markdown"] = "markdown",
    progress_callback: Optional[
        Callable[[int, int], None]
    ] = None,
) -> List[TextOutput]

Extract text from all pages of a document.

PARAMETER DESCRIPTION
document

Document instance

TYPE: Document

output_format

Desired output format

TYPE: Literal['html', 'markdown'] DEFAULT: 'markdown'

progress_callback

Optional function(current, total) for progress

TYPE: Optional[Callable[[int, int], None]] DEFAULT: None

RETURNS DESCRIPTION
List[TextOutput]

List of TextOutput, one per page

Examples:

doc = Document.from_pdf("paper.pdf")
results = extractor.extract_document(doc, output_format="markdown")
Source code in omnidocs/tasks/text_extraction/base.py
def extract_document(
    self,
    document: "Document",
    output_format: Literal["html", "markdown"] = "markdown",
    progress_callback: Optional[Callable[[int, int], None]] = None,
) -> List[TextOutput]:
    """
    Extract text from all pages of a document.

    Args:
        document: Document instance
        output_format: Desired output format
        progress_callback: Optional function(current, total) for progress

    Returns:
        List of TextOutput, one per page

    Examples:
        ```python
        doc = Document.from_pdf("paper.pdf")
        results = extractor.extract_document(doc, output_format="markdown")
        ```
    """
    results = []
    total = document.page_count

    for i, page in enumerate(document.iter_pages()):
        if progress_callback:
            progress_callback(i + 1, total)

        result = self.extract(page, output_format=output_format)
        results.append(result)

    return results