Skip to content

🧩 Core Classes

This section documents the core base classes and fundamental components that power all OmniDocs extractors.

Base Extractor Classes

BaseOCRExtractor

The foundation for all OCR (Optical Character Recognition) extractors.

omnidocs.tasks.ocr_extraction.base.BaseOCRExtractor

BaseOCRExtractor(device: Optional[str] = None, show_log: bool = False, languages: Optional[List[str]] = None, engine_name: Optional[str] = None)

Bases: ABC

Base class for OCR text extraction models.

Initialize the OCR extractor.

Parameters:

Name Type Description Default
device Optional[str]

Device to run model on ('cuda' or 'cpu')

None
show_log bool

Whether to show detailed logs

False
languages Optional[List[str]]

List of language codes to support (e.g., ['en', 'zh'])

None
engine_name Optional[str]

Name of the OCR engine for language mapping

None

extract abstractmethod

extract(input_path: Union[str, Path, Image], **kwargs) -> OCROutput

Extract text from input image.

Parameters:

Name Type Description Default
input_path Union[str, Path, Image]

Path to input image or image data

required
**kwargs

Additional model-specific parameters

{}

Returns:

Type Description
OCROutput

OCROutput containing extracted text

extract_all

extract_all(input_paths: List[Union[str, Path, Image]], **kwargs) -> List[OCROutput]

Extract text from multiple images.

Parameters:

Name Type Description Default
input_paths List[Union[str, Path, Image]]

List of image paths or image data

required
**kwargs

Additional model-specific parameters

{}

Returns:

Type Description
List[OCROutput]

List of OCROutput objects

extract_with_layout

extract_with_layout(input_path: Union[str, Path, Image], layout_regions: Optional[List[Dict]] = None, **kwargs) -> OCROutput

Extract text with optional layout information.

Parameters:

Name Type Description Default
input_path Union[str, Path, Image]

Path to input image or image data

required
layout_regions Optional[List[Dict]]

Optional list of layout regions to focus OCR on

None
**kwargs

Additional model-specific parameters

{}

Returns:

Type Description
OCROutput

OCROutput containing extracted text

preprocess_input

preprocess_input(input_path: Union[str, Path, Image, ndarray]) -> List[Image.Image]

Convert input to list of PIL Images.

Parameters:

Name Type Description Default
input_path Union[str, Path, Image, ndarray]

Input image path or image data

required

Returns:

Type Description
List[Image]

List of PIL Images

postprocess_output

postprocess_output(raw_output: Any, img_size: Tuple[int, int]) -> OCROutput

Convert raw OCR output to standardized OCROutput format.

Parameters:

Name Type Description Default
raw_output Any

Raw output from OCR engine

required
img_size Tuple[int, int]

Original image size (width, height)

required

Returns:

Type Description
OCROutput

Standardized OCROutput object

visualize

visualize(ocr_result: OCROutput, image_path: Union[str, Path, Image], output_path: str = 'visualized.png', box_color: str = 'red', box_width: int = 2, show_text: bool = False, text_color: str = 'blue', font_size: int = 12) -> None

Visualize OCR results by drawing bounding boxes on the original image.

This method allows users to easily see which extractor is working better by visualizing the detected text regions with bounding boxes.

get_supported_languages

get_supported_languages() -> List[str]

Get list of supported language codes.

set_languages

set_languages(languages: List[str]) -> None

Update supported languages for OCR extraction.

Key Features

  • Unified Interface: Consistent API across all OCR engines
  • Language Support: Multi-language text recognition
  • Batch Processing: Process multiple documents efficiently
  • Visualization: Built-in result visualization
  • Device Management: CPU/GPU support

Usage Example

from omnidocs.tasks.ocr_extraction.extractors.easy_ocr import EasyOCRExtractor

# Initialize extractor
extractor = EasyOCRExtractor(
    languages=['en', 'fr'],
    device='cuda',
    show_log=True
)

# Extract text
result = extractor.extract("document.png")
print(f"Extracted: {result.full_text}")

# Visualize results
extractor.visualize(
    result=result,
    image_path="document.png",
    output_path="visualization.png"
)

BaseTableExtractor

The foundation for all table extraction implementations.

omnidocs.tasks.table_extraction.base.BaseTableExtractor

BaseTableExtractor(device: Optional[str] = None, show_log: bool = False, engine_name: Optional[str] = None)

Bases: ABC

Base class for table extraction models.

Initialize the table extractor.

Parameters:

Name Type Description Default
device Optional[str]

Device to run model on ('cuda' or 'cpu')

None
show_log bool

Whether to show detailed logs

False
engine_name Optional[str]

Name of the table extraction engine

None

extract abstractmethod

extract(input_path: Union[str, Path, Image], **kwargs) -> TableOutput

Extract tables from input image.

Parameters:

Name Type Description Default
input_path Union[str, Path, Image]

Path to input image or image data

required
**kwargs

Additional model-specific parameters

{}

Returns:

Type Description
TableOutput

TableOutput containing extracted tables

extract_all

extract_all(input_paths: List[Union[str, Path, Image]], **kwargs) -> List[TableOutput]

Extract tables from multiple images.

Parameters:

Name Type Description Default
input_paths List[Union[str, Path, Image]]

List of image paths or image data

required
**kwargs

Additional model-specific parameters

{}

Returns:

Type Description
List[TableOutput]

List of TableOutput objects

extract_with_layout

extract_with_layout(input_path: Union[str, Path, Image], layout_regions: Optional[List[Dict]] = None, **kwargs) -> TableOutput

Extract tables with optional layout information.

Parameters:

Name Type Description Default
input_path Union[str, Path, Image]

Path to input image or image data

required
layout_regions Optional[List[Dict]]

Optional list of layout regions containing tables

None
**kwargs

Additional model-specific parameters

{}

Returns:

Type Description
TableOutput

TableOutput containing extracted tables

preprocess_input

preprocess_input(input_path: Union[str, Path, Image, ndarray]) -> List[Image.Image]

Convert input to list of PIL Images.

Parameters:

Name Type Description Default
input_path Union[str, Path, Image, ndarray]

Input image path or image data

required

Returns:

Type Description
List[Image]

List of PIL Images

postprocess_output

postprocess_output(raw_output: Any, img_size: Tuple[int, int]) -> TableOutput

Convert raw table extraction output to standardized TableOutput format.

Parameters:

Name Type Description Default
raw_output Any

Raw output from table extraction engine

required
img_size Tuple[int, int]

Original image size (width, height)

required

Returns:

Type Description
TableOutput

Standardized TableOutput object

visualize

visualize(table_result: TableOutput, image_path: Union[str, Path, Image], output_path: str = 'visualized_tables.png', table_color: str = 'red', cell_color: str = 'blue', box_width: int = 2, show_text: bool = False, text_color: str = 'green', font_size: int = 12, show_table_ids: bool = True) -> None

Visualize table extraction results by drawing bounding boxes on the original image.

This method allows users to easily see which extractor is working better by visualizing the detected tables and cells with bounding boxes.

Parameters:

Name Type Description Default
table_result TableOutput

TableOutput containing extracted tables

required
image_path Union[str, Path, Image]

Path to original image or PIL Image object

required
output_path str

Path to save the annotated image

'visualized_tables.png'
table_color str

Color for table bounding boxes

'red'
cell_color str

Color for cell bounding boxes

'blue'
box_width int

Width of bounding box lines

2
show_text bool

Whether to overlay cell text

False
text_color str

Color for text overlay

'green'
font_size int

Font size for text overlay

12
show_table_ids bool

Whether to show table IDs

True

Key Features

  • Multiple Formats: Support for PDF and image inputs
  • Structured Output: Returns pandas DataFrames
  • Coordinate Transformation: Handles PDF to image coordinate mapping
  • Batch Processing: Process multiple documents
  • Visualization: Table detection visualization

Usage Example

from omnidocs.tasks.table_extraction.extractors.camelot import CamelotExtractor

# Initialize extractor
extractor = CamelotExtractor(
    flavor='lattice',
    pages='all'
)

# Extract tables
result = extractor.extract("report.pdf")

# Access tables as DataFrames
for i, table in enumerate(result.tables):
    print(f"Table {i} shape: {table.df.shape}")
    table.df.to_csv(f"table_{i}.csv", index=False)

BaseTextExtractor

The foundation for text extraction from documents.

omnidocs.tasks.text_extraction.base.BaseTextExtractor

BaseTextExtractor(device: Optional[str] = None, show_log: bool = False, engine_name: Optional[str] = None, extract_images: bool = False)

Bases: ABC

Base class for text extraction models.

Initialize the text extractor.

Parameters:

Name Type Description Default
device Optional[str]

Device to run model on ('cuda' or 'cpu')

None
show_log bool

Whether to show detailed logs

False
engine_name Optional[str]

Name of the text extraction engine

None
extract_images bool

Whether to extract images alongside text

False

extract abstractmethod

extract(input_path: Union[str, Path], **kwargs) -> TextOutput

Extract text from input document.

Parameters:

Name Type Description Default
input_path Union[str, Path]

Path to input document

required
**kwargs

Additional model-specific parameters

{}

Returns:

Type Description
TextOutput

TextOutput containing extracted text

extract_all

extract_all(input_paths: List[Union[str, Path]], **kwargs) -> List[TextOutput]

Extract text from multiple documents.

Parameters:

Name Type Description Default
input_paths List[Union[str, Path]]

List of document paths

required
**kwargs

Additional model-specific parameters

{}

Returns:

Type Description
List[TextOutput]

List of TextOutput objects

preprocess_input

preprocess_input(input_path: Union[str, Path]) -> Any

Preprocess input document for text extraction.

Parameters:

Name Type Description Default
input_path Union[str, Path]

Path to input document

required

Returns:

Type Description
Any

Preprocessed document object

postprocess_output

postprocess_output(raw_output: Any, source_info: Optional[Dict] = None) -> TextOutput

Convert raw text extraction output to standardized TextOutput format.

Parameters:

Name Type Description Default
raw_output Any

Raw output from text extraction engine

required
source_info Optional[Dict]

Optional source document information

None

Returns:

Type Description
TextOutput

Standardized TextOutput object

Key Features

  • Multi-format Support: PDF, DOCX, HTML, and more
  • Layout Preservation: Maintains document structure
  • Metadata Extraction: Document properties and formatting
  • Batch Processing: Handle multiple documents

Usage Example

from omnidocs.tasks.text_extraction.extractors.pymupdf import PyMuPDFExtractor

# Initialize extractor
extractor = PyMuPDFExtractor()

# Extract text with layout
result = extractor.extract("document.pdf")

# Access structured text
print(f"Full text: {result.full_text}")
for block in result.text_blocks:
    print(f"Block: {block.text[:50]}...")
    print(f"Position: {block.bbox}")

Data Models

OCRText

Represents a single text region detected by OCR.

omnidocs.tasks.ocr_extraction.base.OCRText

Bases: BaseModel

Container for individual OCR text detection.

Attributes:

Name Type Description
text str

Extracted text content

confidence Optional[float]

Confidence score for the text detection

bbox Optional[List[float]]

Bounding box coordinates [x1, y1, x2, y2]

polygon Optional[List[List[float]]]

Optional polygon coordinates for irregular text regions

language Optional[str]

Detected language code (e.g., 'en', 'zh', 'fr')

reading_order Optional[int]

Optional reading order index for text sequencing

to_dict

to_dict() -> Dict

Convert to dictionary representation.

Attributes

  • text (str): The recognized text content
  • confidence (float): Recognition confidence score (0.0-1.0)
  • bbox (List[float]): Bounding box coordinates [x1, y1, x2, y2]
  • polygon (List[List[float]]): Precise polygon coordinates
  • language (Optional[str]): Detected language code
  • reading_order (int): Reading order index

Example

# Access OCR text regions
for text_region in ocr_result.texts:
    print(f"Text: {text_region.text}")
    print(f"Confidence: {text_region.confidence:.3f}")
    print(f"Bbox: {text_region.bbox}")
    print(f"Language: {text_region.language}")

OCROutput

Complete OCR extraction result.

omnidocs.tasks.ocr_extraction.base.OCROutput

Bases: BaseModel

Container for OCR extraction results.

Attributes:

Name Type Description
texts List[OCRText]

List of detected text objects

full_text str

Combined text from all detections

source_img_size Optional[Tuple[int, int]]

Original image dimensions (width, height)

processing_time Optional[float]

Time taken for OCR processing

metadata Optional[Dict[str, Any]]

Additional metadata from the OCR engine

get_sorted_by_reading_order

get_sorted_by_reading_order() -> List[OCRText]

Get texts sorted by reading order (top-to-bottom, left-to-right if no reading_order).

get_text_by_confidence

get_text_by_confidence(min_confidence: float = 0.5) -> List[OCRText]

Filter texts by minimum confidence threshold.

save_json

save_json(output_path: Union[str, Path]) -> None

Save output to JSON file.

to_dict

to_dict() -> Dict

Convert to dictionary representation.

Key Methods

  • get_text_by_confidence(min_confidence): Filter by confidence threshold
  • get_sorted_by_reading_order(): Sort by reading order
  • save_json(output_path): Save results to JSON
  • to_dict(): Convert to dictionary

Example

result = extractor.extract("image.png")

# Filter high-confidence text
high_conf_texts = result.get_text_by_confidence(0.8)
print(f"High confidence regions: {len(high_conf_texts)}")

# Save results
result.save_json("ocr_results.json")

Table

Represents an extracted table with structure and data.

omnidocs.tasks.table_extraction.base.Table

Bases: BaseModel

Container for extracted table.

Attributes:

Name Type Description
cells List[TableCell]

List of table cells

num_rows int

Number of rows in the table

num_cols int

Number of columns in the table

bbox Optional[List[float]]

Bounding box of the entire table [x1, y1, x2, y2]

confidence Optional[float]

Overall table detection confidence

table_id Optional[str]

Optional table identifier

caption Optional[str]

Optional table caption

structure_confidence Optional[float]

Confidence score for table structure detection

to_csv

to_csv() -> str

Convert table to CSV format.

to_dict

to_dict() -> Dict

Convert to dictionary representation.

to_html

to_html() -> str

Convert table to HTML format.

Key Properties

  • df (pandas.DataFrame): Table data as DataFrame
  • bbox (List[float]): Table bounding box
  • confidence (float): Extraction confidence
  • page_number (int): Source page number

Key Methods

  • to_csv(): Export as CSV string
  • to_html(): Export as HTML string
  • to_dict(): Convert to dictionary

Example

for table in table_result.tables:
    # Access as DataFrame
    df = table.df
    print(f"Table shape: {df.shape}")

    # Export formats
    csv_content = table.to_csv()
    html_content = table.to_html()

    # Save to file
    df.to_excel(f"table_page_{table.page_number}.xlsx")

TableOutput

Complete table extraction result.

omnidocs.tasks.table_extraction.base.TableOutput

Bases: BaseModel

Container for table extraction results.

Attributes:

Name Type Description
tables List[Table]

List of extracted tables

source_img_size Optional[Tuple[int, int]]

Original image dimensions (width, height)

processing_time Optional[float]

Time taken for table extraction

metadata Optional[Dict[str, Any]]

Additional metadata from the extraction engine

get_tables_by_confidence

get_tables_by_confidence(min_confidence: float = 0.5) -> List[Table]

Filter tables by minimum confidence threshold.

save_json

save_json(output_path: Union[str, Path]) -> None

Save output to JSON file.

save_tables_as_csv

save_tables_as_csv(output_dir: Union[str, Path]) -> List[Path]

Save all tables as separate CSV files.

to_dict

to_dict() -> Dict

Convert to dictionary representation.

Key Methods

  • get_tables_by_confidence(min_confidence): Filter by confidence
  • save_tables_as_csv(output_dir): Save all tables as CSV files
  • save_json(output_path): Save metadata to JSON

Example

result = extractor.extract("document.pdf")

# Filter high-confidence tables
good_tables = result.get_tables_by_confidence(0.7)

# Save all tables
csv_files = result.save_tables_as_csv("output_tables/")
print(f"Saved {len(csv_files)} CSV files")

Mapper Classes

BaseOCRMapper

Handles language code mapping and normalization for OCR engines.

omnidocs.tasks.ocr_extraction.base.BaseOCRMapper

BaseOCRMapper(engine_name: str)

Base class for mapping OCR engine-specific outputs to standardized format.

Initialize mapper for specific OCR engine.

Parameters:

Name Type Description Default
engine_name str

Name of the OCR engine (e.g., 'tesseract', 'paddle', 'easyocr')

required

detect_text_language

detect_text_language(text: str) -> Optional[str]

Detect language of extracted text.

from_standard_language

from_standard_language(standard_language: str) -> str

Convert standard ISO 639-1 language code to engine-specific format.

get_supported_languages

get_supported_languages() -> List[str]

Get list of supported languages for this engine.

normalize_bbox

normalize_bbox(bbox: List[float], img_width: int, img_height: int) -> List[float]

Normalize bounding box coordinates to absolute pixel values.

to_standard_language

to_standard_language(engine_language: str) -> str

Convert engine-specific language code to standard ISO 639-1.

Key Methods

  • to_standard_language(engine_language): Convert to standard language code
  • from_standard_language(standard_language): Convert from standard language code
  • get_supported_languages(): List supported languages
  • normalize_bbox(bbox, img_width, img_height): Normalize bounding box coordinates

BaseTableMapper

Handles coordinate transformation and table structure mapping.

omnidocs.tasks.table_extraction.base.BaseTableMapper

BaseTableMapper(engine_name: str)

Base class for mapping table extraction engine-specific outputs to standardized format.

Initialize mapper for specific table extraction engine.

Parameters:

Name Type Description Default
engine_name str

Name of the table extraction engine

required

detect_header_rows

detect_header_rows(cells: List[TableCell]) -> List[TableCell]

Detect and mark header cells based on position and formatting.

normalize_bbox

normalize_bbox(bbox: List[float], img_width: int, img_height: int) -> List[float]

Normalize bounding box coordinates to absolute pixel values.

Key Methods

  • normalize_bbox(bbox, img_width, img_height): Normalize coordinates
  • detect_header_rows(cells): Identify header rows

Abstract Base Classes

All extractors inherit from these abstract base classes, ensuring consistent interfaces:

from abc import ABC, abstractmethod

class BaseExtractor(ABC):
    """Abstract base class for all extractors."""

    @abstractmethod
    def extract(self, input_path: Union[str, Path]) -> Any:
        """Extract data from input document."""
        pass

    @abstractmethod
    def preprocess_input(self, input_path: Union[str, Path]) -> Any:
        """Preprocess input for extraction."""
        pass

    @abstractmethod
    def postprocess_output(self, raw_output: Any) -> Any:
        """Convert raw output to standardized format."""
        pass

Common Patterns

Initialization Pattern

All extractors follow this initialization pattern:

class SomeExtractor(BaseExtractor):
    def __init__(
        self,
        device: Optional[str] = None,
        show_log: bool = False,
        languages: Optional[List[str]] = None,
        **kwargs
    ):
        super().__init__(device, show_log, languages)
        # Extractor-specific initialization
        self._load_model()

Processing Pipeline

Standard processing flow:

  1. Input Validation: Check file existence and format
  2. Preprocessing: Convert to required format (PIL Image, etc.)
  3. Model Inference: Run the actual extraction
  4. Postprocessing: Convert to standardized output format
  5. Result Packaging: Create result object with metadata

Error Handling

Consistent error handling across extractors:

try:
    result = extractor.extract("document.pdf")
except FileNotFoundError:
    print("Document not found")
except ImportError:
    print("Required dependencies not installed")
except Exception as e:
    print(f"Extraction failed: {e}")

Performance Considerations

Memory Management

  • Use generators for batch processing large datasets
  • Clear GPU memory between large operations
  • Implement proper cleanup in __del__ methods

GPU Utilization

  • Check GPU availability before initialization
  • Batch operations when possible
  • Use appropriate tensor data types

Caching

  • Cache model loading where appropriate
  • Implement result caching for repeated operations
  • Use memory-mapped files for large datasets

Extension Points

Custom Extractors

Create custom extractors by inheriting from base classes:

from omnidocs.tasks.ocr_extraction.base import BaseOCRExtractor

class CustomOCRExtractor(BaseOCRExtractor):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        # Custom initialization

    def _load_model(self):
        # Load your custom model
        pass

    def postprocess_output(self, raw_output, img_size):
        # Convert to OCROutput format
        pass

Custom Mappers

Implement custom language or coordinate mappers:

from omnidocs.tasks.ocr_extraction.base import BaseOCRMapper

class CustomMapper(BaseOCRMapper):
    def __init__(self):
        super().__init__('custom_engine')
        self._setup_custom_mapping()

    def _setup_custom_mapping(self):
        # Define your language mappings
        pass

This core architecture ensures consistency, extensibility, and maintainability across all OmniDocs extractors.