🧩 Core Classes

This section documents the core base classes and fundamental components that power all OmniDocs extractors.

Base Extractor Classes

BaseOCRExtractor

The foundation for all OCR (Optical Character Recognition) extractors.

omnidocs.tasks.ocr_extraction.base.BaseOCRExtractor

BaseOCRExtractor(device: Optional[str] = None, show_log: bool = False, languages: Optional[List[str]] = None, engine_name: Optional[str] = None)

Bases: ABC

Base class for OCR text extraction models.

Initialize the OCR extractor.

Parameters:

Name	Type	Description	Default
`device`	`Optional[str]`	Device to run model on ('cuda' or 'cpu')	`None`
`show_log`	`bool`	Whether to show detailed logs	`False`
`languages`	`Optional[List[str]]`	List of language codes to support (e.g., ['en', 'zh'])	`None`
`engine_name`	`Optional[str]`	Name of the OCR engine for language mapping	`None`

extract `abstractmethod`

extract(input_path: Union[str, Path, Image], **kwargs) -> OCROutput

Extract text from input image.

Parameters:

Name	Type	Description	Default
`input_path`	`Union[str, Path, Image]`	Path to input image or image data	required
`**kwargs`		Additional model-specific parameters	`{}`

Returns:

Type	Description
`OCROutput`	OCROutput containing extracted text

extract_all

extract_all(input_paths: List[Union[str, Path, Image]], **kwargs) -> List[OCROutput]

Extract text from multiple images.

Parameters:

Name	Type	Description	Default
`input_paths`	`List[Union[str, Path, Image]]`	List of image paths or image data	required
`**kwargs`		Additional model-specific parameters	`{}`

Returns:

Type	Description
`List[OCROutput]`	List of OCROutput objects

extract_with_layout

extract_with_layout(input_path: Union[str, Path, Image], layout_regions: Optional[List[Dict]] = None, **kwargs) -> OCROutput

Extract text with optional layout information.

Parameters:

Name	Type	Description	Default
`input_path`	`Union[str, Path, Image]`	Path to input image or image data	required
`layout_regions`	`Optional[List[Dict]]`	Optional list of layout regions to focus OCR on	`None`
`**kwargs`		Additional model-specific parameters	`{}`

Returns:

Type	Description
`OCROutput`	OCROutput containing extracted text

preprocess_input

preprocess_input(input_path: Union[str, Path, Image, ndarray]) -> List[Image.Image]

Convert input to list of PIL Images.

Parameters:

Name	Type	Description	Default
`input_path`	`Union[str, Path, Image, ndarray]`	Input image path or image data	required

Returns:

Type	Description
`List[Image]`	List of PIL Images

postprocess_output

postprocess_output(raw_output: Any, img_size: Tuple[int, int]) -> OCROutput

Convert raw OCR output to standardized OCROutput format.

Parameters:

Name	Type	Description	Default
`raw_output`	`Any`	Raw output from OCR engine	required
`img_size`	`Tuple[int, int]`	Original image size (width, height)	required

Returns:

Type	Description
`OCROutput`	Standardized OCROutput object

visualize

visualize(ocr_result: OCROutput, image_path: Union[str, Path, Image], output_path: str = 'visualized.png', box_color: str = 'red', box_width: int = 2, show_text: bool = False, text_color: str = 'blue', font_size: int = 12) -> None

Visualize OCR results by drawing bounding boxes on the original image.

This method allows users to easily see which extractor is working better by visualizing the detected text regions with bounding boxes.

get_supported_languages

get_supported_languages() -> List[str]

Get list of supported language codes.

set_languages

set_languages(languages: List[str]) -> None

Update supported languages for OCR extraction.

Key Features

Unified Interface: Consistent API across all OCR engines
Language Support: Multi-language text recognition
Batch Processing: Process multiple documents efficiently
Visualization: Built-in result visualization
Device Management: CPU/GPU support

Usage Example

from omnidocs.tasks.ocr_extraction.extractors.easy_ocr import EasyOCRExtractor

# Initialize extractor
extractor = EasyOCRExtractor(
    languages=['en', 'fr'],
    device='cuda',
    show_log=True
)

# Extract text
result = extractor.extract("document.png")
print(f"Extracted: {result.full_text}")

# Visualize results
extractor.visualize(
    result=result,
    image_path="document.png",
    output_path="visualization.png"
)

BaseTableExtractor

The foundation for all table extraction implementations.

omnidocs.tasks.table_extraction.base.BaseTableExtractor

BaseTableExtractor(device: Optional[str] = None, show_log: bool = False, engine_name: Optional[str] = None)

Bases: ABC

Base class for table extraction models.

Initialize the table extractor.

Parameters:

Name	Type	Description	Default
`device`	`Optional[str]`	Device to run model on ('cuda' or 'cpu')	`None`
`show_log`	`bool`	Whether to show detailed logs	`False`
`engine_name`	`Optional[str]`	Name of the table extraction engine	`None`

extract `abstractmethod`

extract(input_path: Union[str, Path, Image], **kwargs) -> TableOutput

Extract tables from input image.

Parameters:

Name	Type	Description	Default
`input_path`	`Union[str, Path, Image]`	Path to input image or image data	required
`**kwargs`		Additional model-specific parameters	`{}`

Returns:

Type	Description
`TableOutput`	TableOutput containing extracted tables

extract_all

extract_all(input_paths: List[Union[str, Path, Image]], **kwargs) -> List[TableOutput]

Extract tables from multiple images.

Parameters:

Name	Type	Description	Default
`input_paths`	`List[Union[str, Path, Image]]`	List of image paths or image data	required
`**kwargs`		Additional model-specific parameters	`{}`

Returns:

Type	Description
`List[TableOutput]`	List of TableOutput objects

extract_with_layout

extract_with_layout(input_path: Union[str, Path, Image], layout_regions: Optional[List[Dict]] = None, **kwargs) -> TableOutput

Extract tables with optional layout information.

Parameters:

Name	Type	Description	Default
`input_path`	`Union[str, Path, Image]`	Path to input image or image data	required
`layout_regions`	`Optional[List[Dict]]`	Optional list of layout regions containing tables	`None`
`**kwargs`		Additional model-specific parameters	`{}`

Returns:

Type	Description
`TableOutput`	TableOutput containing extracted tables

preprocess_input

preprocess_input(input_path: Union[str, Path, Image, ndarray]) -> List[Image.Image]

Convert input to list of PIL Images.

Parameters:

Name	Type	Description	Default
`input_path`	`Union[str, Path, Image, ndarray]`	Input image path or image data	required

Returns:

Type	Description
`List[Image]`	List of PIL Images

postprocess_output

postprocess_output(raw_output: Any, img_size: Tuple[int, int]) -> TableOutput

Convert raw table extraction output to standardized TableOutput format.

Parameters:

Name	Type	Description	Default
`raw_output`	`Any`	Raw output from table extraction engine	required
`img_size`	`Tuple[int, int]`	Original image size (width, height)	required

Returns:

Type	Description
`TableOutput`	Standardized TableOutput object

visualize

visualize(table_result: TableOutput, image_path: Union[str, Path, Image], output_path: str = 'visualized_tables.png', table_color: str = 'red', cell_color: str = 'blue', box_width: int = 2, show_text: bool = False, text_color: str = 'green', font_size: int = 12, show_table_ids: bool = True) -> None

Visualize table extraction results by drawing bounding boxes on the original image.

This method allows users to easily see which extractor is working better by visualizing the detected tables and cells with bounding boxes.

Parameters:

Name	Type	Description	Default
`table_result`	`TableOutput`	TableOutput containing extracted tables	required
`image_path`	`Union[str, Path, Image]`	Path to original image or PIL Image object	required
`output_path`	`str`	Path to save the annotated image	`'visualized_tables.png'`
`table_color`	`str`	Color for table bounding boxes	`'red'`
`cell_color`	`str`	Color for cell bounding boxes	`'blue'`
`box_width`	`int`	Width of bounding box lines	`2`
`show_text`	`bool`	Whether to overlay cell text	`False`
`text_color`	`str`	Color for text overlay	`'green'`
`font_size`	`int`	Font size for text overlay	`12`
`show_table_ids`	`bool`	Whether to show table IDs	`True`

Key Features

Multiple Formats: Support for PDF and image inputs
Structured Output: Returns pandas DataFrames
Coordinate Transformation: Handles PDF to image coordinate mapping
Batch Processing: Process multiple documents
Visualization: Table detection visualization

Usage Example

from omnidocs.tasks.table_extraction.extractors.camelot import CamelotExtractor

# Initialize extractor
extractor = CamelotExtractor(
    flavor='lattice',
    pages='all'
)

# Extract tables
result = extractor.extract("report.pdf")

# Access tables as DataFrames
for i, table in enumerate(result.tables):
    print(f"Table {i} shape: {table.df.shape}")
    table.df.to_csv(f"table_{i}.csv", index=False)

BaseTextExtractor

The foundation for text extraction from documents.

omnidocs.tasks.text_extraction.base.BaseTextExtractor

BaseTextExtractor(device: Optional[str] = None, show_log: bool = False, engine_name: Optional[str] = None, extract_images: bool = False)

Bases: ABC

Base class for text extraction models.

Initialize the text extractor.

Parameters:

Name	Type	Description	Default
`device`	`Optional[str]`	Device to run model on ('cuda' or 'cpu')	`None`
`show_log`	`bool`	Whether to show detailed logs	`False`
`engine_name`	`Optional[str]`	Name of the text extraction engine	`None`
`extract_images`	`bool`	Whether to extract images alongside text	`False`

extract `abstractmethod`

extract(input_path: Union[str, Path], **kwargs) -> TextOutput

Extract text from input document.

Parameters:

Name	Type	Description	Default
`input_path`	`Union[str, Path]`	Path to input document	required
`**kwargs`		Additional model-specific parameters	`{}`

Returns:

Type	Description
`TextOutput`	TextOutput containing extracted text

extract_all

extract_all(input_paths: List[Union[str, Path]], **kwargs) -> List[TextOutput]

Extract text from multiple documents.

Parameters:

Name	Type	Description	Default
`input_paths`	`List[Union[str, Path]]`	List of document paths	required
`**kwargs`		Additional model-specific parameters	`{}`

Returns:

Type	Description
`List[TextOutput]`	List of TextOutput objects

preprocess_input

preprocess_input(input_path: Union[str, Path]) -> Any

Preprocess input document for text extraction.

Parameters:

Name	Type	Description	Default
`input_path`	`Union[str, Path]`	Path to input document	required

Returns:

Type	Description
`Any`	Preprocessed document object

postprocess_output

postprocess_output(raw_output: Any, source_info: Optional[Dict] = None) -> TextOutput

Convert raw text extraction output to standardized TextOutput format.

Parameters:

Name	Type	Description	Default
`raw_output`	`Any`	Raw output from text extraction engine	required
`source_info`	`Optional[Dict]`	Optional source document information	`None`

Returns:

Type	Description
`TextOutput`	Standardized TextOutput object

Key Features

Multi-format Support: PDF, DOCX, HTML, and more
Layout Preservation: Maintains document structure
Metadata Extraction: Document properties and formatting
Batch Processing: Handle multiple documents

Usage Example

from omnidocs.tasks.text_extraction.extractors.pymupdf import PyMuPDFExtractor

# Initialize extractor
extractor = PyMuPDFExtractor()

# Extract text with layout
result = extractor.extract("document.pdf")

# Access structured text
print(f"Full text: {result.full_text}")
for block in result.text_blocks:
    print(f"Block: {block.text[:50]}...")
    print(f"Position: {block.bbox}")

Data Models

OCRText

Represents a single text region detected by OCR.

omnidocs.tasks.ocr_extraction.base.OCRText

Bases: BaseModel

Container for individual OCR text detection.

Attributes:

Name	Type	Description
`text`	`str`	Extracted text content
`confidence`	`Optional[float]`	Confidence score for the text detection
`bbox`	`Optional[List[float]]`	Bounding box coordinates [x1, y1, x2, y2]
`polygon`	`Optional[List[List[float]]]`	Optional polygon coordinates for irregular text regions
`language`	`Optional[str]`	Detected language code (e.g., 'en', 'zh', 'fr')
`reading_order`	`Optional[int]`	Optional reading order index for text sequencing

to_dict

to_dict() -> Dict

Convert to dictionary representation.

Attributes

text (str): The recognized text content
confidence (float): Recognition confidence score (0.0-1.0)
bbox (List[float]): Bounding box coordinates [x1, y1, x2, y2]
polygon (List[List[float]]): Precise polygon coordinates
language (Optional[str]): Detected language code
reading_order (int): Reading order index

Example

# Access OCR text regions
for text_region in ocr_result.texts:
    print(f"Text: {text_region.text}")
    print(f"Confidence: {text_region.confidence:.3f}")
    print(f"Bbox: {text_region.bbox}")
    print(f"Language: {text_region.language}")

OCROutput

Complete OCR extraction result.

omnidocs.tasks.ocr_extraction.base.OCROutput

Bases: BaseModel

Container for OCR extraction results.

Attributes:

Name	Type	Description
`texts`	`List[OCRText]`	List of detected text objects
`full_text`	`str`	Combined text from all detections
`source_img_size`	`Optional[Tuple[int, int]]`	Original image dimensions (width, height)
`processing_time`	`Optional[float]`	Time taken for OCR processing
`metadata`	`Optional[Dict[str, Any]]`	Additional metadata from the OCR engine

get_sorted_by_reading_order

get_sorted_by_reading_order() -> List[OCRText]

Get texts sorted by reading order (top-to-bottom, left-to-right if no reading_order).

get_text_by_confidence

get_text_by_confidence(min_confidence: float = 0.5) -> List[OCRText]

Filter texts by minimum confidence threshold.

save_json

save_json(output_path: Union[str, Path]) -> None

Save output to JSON file.

to_dict

to_dict() -> Dict

Convert to dictionary representation.

Key Methods

get_text_by_confidence(min_confidence): Filter by confidence threshold
get_sorted_by_reading_order(): Sort by reading order
save_json(output_path): Save results to JSON
to_dict(): Convert to dictionary

Example

result = extractor.extract("image.png")

# Filter high-confidence text
high_conf_texts = result.get_text_by_confidence(0.8)
print(f"High confidence regions: {len(high_conf_texts)}")

# Save results
result.save_json("ocr_results.json")

Table

Represents an extracted table with structure and data.

omnidocs.tasks.table_extraction.base.Table

Bases: BaseModel

Container for extracted table.

Attributes:

Name	Type	Description
`cells`	`List[TableCell]`	List of table cells
`num_rows`	`int`	Number of rows in the table
`num_cols`	`int`	Number of columns in the table
`bbox`	`Optional[List[float]]`	Bounding box of the entire table [x1, y1, x2, y2]
`confidence`	`Optional[float]`	Overall table detection confidence
`table_id`	`Optional[str]`	Optional table identifier
`caption`	`Optional[str]`	Optional table caption
`structure_confidence`	`Optional[float]`	Confidence score for table structure detection

to_csv

to_csv() -> str

Convert table to CSV format.

to_dict

to_dict() -> Dict

Convert to dictionary representation.

to_html

to_html() -> str

Convert table to HTML format.

Key Properties

df (pandas.DataFrame): Table data as DataFrame
bbox (List[float]): Table bounding box
confidence (float): Extraction confidence
page_number (int): Source page number

Key Methods

to_csv(): Export as CSV string
to_html(): Export as HTML string
to_dict(): Convert to dictionary

Example

for table in table_result.tables:
    # Access as DataFrame
    df = table.df
    print(f"Table shape: {df.shape}")

    # Export formats
    csv_content = table.to_csv()
    html_content = table.to_html()

    # Save to file
    df.to_excel(f"table_page_{table.page_number}.xlsx")

TableOutput

Complete table extraction result.

omnidocs.tasks.table_extraction.base.TableOutput

Bases: BaseModel

Container for table extraction results.

Attributes:

Name	Type	Description
`tables`	`List[Table]`	List of extracted tables
`source_img_size`	`Optional[Tuple[int, int]]`	Original image dimensions (width, height)
`processing_time`	`Optional[float]`	Time taken for table extraction
`metadata`	`Optional[Dict[str, Any]]`	Additional metadata from the extraction engine

get_tables_by_confidence

get_tables_by_confidence(min_confidence: float = 0.5) -> List[Table]

Filter tables by minimum confidence threshold.

save_json

save_json(output_path: Union[str, Path]) -> None

Save output to JSON file.

save_tables_as_csv

save_tables_as_csv(output_dir: Union[str, Path]) -> List[Path]

Save all tables as separate CSV files.

to_dict

to_dict() -> Dict

Convert to dictionary representation.

Key Methods

get_tables_by_confidence(min_confidence): Filter by confidence
save_tables_as_csv(output_dir): Save all tables as CSV files
save_json(output_path): Save metadata to JSON

Example

result = extractor.extract("document.pdf")

# Filter high-confidence tables
good_tables = result.get_tables_by_confidence(0.7)

# Save all tables
csv_files = result.save_tables_as_csv("output_tables/")
print(f"Saved {len(csv_files)} CSV files")

Mapper Classes

BaseOCRMapper

Handles language code mapping and normalization for OCR engines.

omnidocs.tasks.ocr_extraction.base.BaseOCRMapper

BaseOCRMapper(engine_name: str)

Base class for mapping OCR engine-specific outputs to standardized format.

Initialize mapper for specific OCR engine.

Parameters:

Name	Type	Description	Default
`engine_name`	`str`	Name of the OCR engine (e.g., 'tesseract', 'paddle', 'easyocr')	required

detect_text_language

detect_text_language(text: str) -> Optional[str]

Detect language of extracted text.

from_standard_language

from_standard_language(standard_language: str) -> str

Convert standard ISO 639-1 language code to engine-specific format.

get_supported_languages

get_supported_languages() -> List[str]

Get list of supported languages for this engine.

normalize_bbox

normalize_bbox(bbox: List[float], img_width: int, img_height: int) -> List[float]

Normalize bounding box coordinates to absolute pixel values.

to_standard_language

to_standard_language(engine_language: str) -> str

Convert engine-specific language code to standard ISO 639-1.

Key Methods

to_standard_language(engine_language): Convert to standard language code
from_standard_language(standard_language): Convert from standard language code
get_supported_languages(): List supported languages
normalize_bbox(bbox, img_width, img_height): Normalize bounding box coordinates

BaseTableMapper

Handles coordinate transformation and table structure mapping.

omnidocs.tasks.table_extraction.base.BaseTableMapper

BaseTableMapper(engine_name: str)

Base class for mapping table extraction engine-specific outputs to standardized format.

Initialize mapper for specific table extraction engine.

Parameters:

Name	Type	Description	Default
`engine_name`	`str`	Name of the table extraction engine	required

detect_header_rows

detect_header_rows(cells: List[TableCell]) -> List[TableCell]

Detect and mark header cells based on position and formatting.

normalize_bbox

normalize_bbox(bbox: List[float], img_width: int, img_height: int) -> List[float]

Normalize bounding box coordinates to absolute pixel values.

Key Methods

normalize_bbox(bbox, img_width, img_height): Normalize coordinates
detect_header_rows(cells): Identify header rows

Abstract Base Classes

All extractors inherit from these abstract base classes, ensuring consistent interfaces:

from abc import ABC, abstractmethod

class BaseExtractor(ABC):
    """Abstract base class for all extractors."""

    @abstractmethod
    def extract(self, input_path: Union[str, Path]) -> Any:
        """Extract data from input document."""
        pass

    @abstractmethod
    def preprocess_input(self, input_path: Union[str, Path]) -> Any:
        """Preprocess input for extraction."""
        pass

    @abstractmethod
    def postprocess_output(self, raw_output: Any) -> Any:
        """Convert raw output to standardized format."""
        pass

Common Patterns

Initialization Pattern

All extractors follow this initialization pattern:

class SomeExtractor(BaseExtractor):
    def __init__(
        self,
        device: Optional[str] = None,
        show_log: bool = False,
        languages: Optional[List[str]] = None,
        **kwargs
    ):
        super().__init__(device, show_log, languages)
        # Extractor-specific initialization
        self._load_model()

Processing Pipeline

Standard processing flow:

Input Validation: Check file existence and format
Preprocessing: Convert to required format (PIL Image, etc.)
Model Inference: Run the actual extraction
Postprocessing: Convert to standardized output format
Result Packaging: Create result object with metadata

Error Handling

Consistent error handling across extractors:

try:
    result = extractor.extract("document.pdf")
except FileNotFoundError:
    print("Document not found")
except ImportError:
    print("Required dependencies not installed")
except Exception as e:
    print(f"Extraction failed: {e}")

Performance Considerations

Memory Management

Use generators for batch processing large datasets
Clear GPU memory between large operations
Implement proper cleanup in __del__ methods

GPU Utilization

Check GPU availability before initialization
Batch operations when possible
Use appropriate tensor data types

Caching

Cache model loading where appropriate
Implement result caching for repeated operations
Use memory-mapped files for large datasets

Extension Points

Custom Extractors

Create custom extractors by inheriting from base classes:

from omnidocs.tasks.ocr_extraction.base import BaseOCRExtractor

class CustomOCRExtractor(BaseOCRExtractor):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        # Custom initialization

    def _load_model(self):
        # Load your custom model
        pass

    def postprocess_output(self, raw_output, img_size):
        # Convert to OCROutput format
        pass

Custom Mappers

Implement custom language or coordinate mappers:

from omnidocs.tasks.ocr_extraction.base import BaseOCRMapper

class CustomMapper(BaseOCRMapper):
    def __init__(self):
        super().__init__('custom_engine')
        self._setup_custom_mapping()

    def _setup_custom_mapping(self):
        # Define your language mappings
        pass

This core architecture ensures consistency, extensibility, and maintainability across all OmniDocs extractors.

🧩 Core Classes

Base Extractor Classes

BaseOCRExtractor

omnidocs.tasks.ocr_extraction.base.BaseOCRExtractor

extract abstractmethod

extract_all

extract_with_layout

preprocess_input

postprocess_output

visualize

get_supported_languages

set_languages

Key Features

Usage Example

BaseTableExtractor

omnidocs.tasks.table_extraction.base.BaseTableExtractor

extract abstractmethod

extract_all

extract_with_layout

preprocess_input

postprocess_output

visualize

Key Features

Usage Example

BaseTextExtractor

omnidocs.tasks.text_extraction.base.BaseTextExtractor

extract abstractmethod

extract_all

preprocess_input

postprocess_output

Key Features

Usage Example

Data Models

OCRText

omnidocs.tasks.ocr_extraction.base.OCRText

to_dict

Attributes

Example

OCROutput

omnidocs.tasks.ocr_extraction.base.OCROutput

get_sorted_by_reading_order

get_text_by_confidence

save_json

to_dict

Key Methods

Example

Table

omnidocs.tasks.table_extraction.base.Table

to_csv

to_dict

to_html

Key Properties

Key Methods

Example

TableOutput

omnidocs.tasks.table_extraction.base.TableOutput

get_tables_by_confidence

save_json

save_tables_as_csv

to_dict

Key Methods

Example

Mapper Classes

BaseOCRMapper

omnidocs.tasks.ocr_extraction.base.BaseOCRMapper

detect_text_language

from_standard_language

get_supported_languages

normalize_bbox

to_standard_language

Key Methods

BaseTableMapper

omnidocs.tasks.table_extraction.base.BaseTableMapper

detect_header_rows

normalize_bbox

Key Methods

Abstract Base Classes

Common Patterns

Initialization Pattern

Processing Pipeline

extract `abstractmethod`

extract `abstractmethod`

extract `abstractmethod`