🧩 Core Classes
This section documents the core base classes and fundamental components that power all OmniDocs extractors.
Base Extractor Classes
BaseOCRExtractor
The foundation for all OCR (Optical Character Recognition) extractors.
omnidocs.tasks.ocr_extraction.base.BaseOCRExtractor
BaseOCRExtractor(device: Optional[str] = None, show_log: bool = False, languages: Optional[List[str]] = None, engine_name: Optional[str] = None)
Bases: ABC
Base class for OCR text extraction models.
Initialize the OCR extractor.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
device
|
Optional[str]
|
Device to run model on ('cuda' or 'cpu') |
None
|
show_log
|
bool
|
Whether to show detailed logs |
False
|
languages
|
Optional[List[str]]
|
List of language codes to support (e.g., ['en', 'zh']) |
None
|
engine_name
|
Optional[str]
|
Name of the OCR engine for language mapping |
None
|
extract
abstractmethod
Extract text from input image.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_path
|
Union[str, Path, Image]
|
Path to input image or image data |
required |
**kwargs
|
Additional model-specific parameters |
{}
|
Returns:
Type | Description |
---|---|
OCROutput
|
OCROutput containing extracted text |
extract_all
Extract text from multiple images.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_paths
|
List[Union[str, Path, Image]]
|
List of image paths or image data |
required |
**kwargs
|
Additional model-specific parameters |
{}
|
Returns:
Type | Description |
---|---|
List[OCROutput]
|
List of OCROutput objects |
extract_with_layout
extract_with_layout(input_path: Union[str, Path, Image], layout_regions: Optional[List[Dict]] = None, **kwargs) -> OCROutput
Extract text with optional layout information.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_path
|
Union[str, Path, Image]
|
Path to input image or image data |
required |
layout_regions
|
Optional[List[Dict]]
|
Optional list of layout regions to focus OCR on |
None
|
**kwargs
|
Additional model-specific parameters |
{}
|
Returns:
Type | Description |
---|---|
OCROutput
|
OCROutput containing extracted text |
preprocess_input
Convert input to list of PIL Images.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_path
|
Union[str, Path, Image, ndarray]
|
Input image path or image data |
required |
Returns:
Type | Description |
---|---|
List[Image]
|
List of PIL Images |
postprocess_output
Convert raw OCR output to standardized OCROutput format.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
raw_output
|
Any
|
Raw output from OCR engine |
required |
img_size
|
Tuple[int, int]
|
Original image size (width, height) |
required |
Returns:
Type | Description |
---|---|
OCROutput
|
Standardized OCROutput object |
visualize
visualize(ocr_result: OCROutput, image_path: Union[str, Path, Image], output_path: str = 'visualized.png', box_color: str = 'red', box_width: int = 2, show_text: bool = False, text_color: str = 'blue', font_size: int = 12) -> None
Visualize OCR results by drawing bounding boxes on the original image.
This method allows users to easily see which extractor is working better by visualizing the detected text regions with bounding boxes.
get_supported_languages
Get list of supported language codes.
Key Features
- Unified Interface: Consistent API across all OCR engines
- Language Support: Multi-language text recognition
- Batch Processing: Process multiple documents efficiently
- Visualization: Built-in result visualization
- Device Management: CPU/GPU support
Usage Example
from omnidocs.tasks.ocr_extraction.extractors.easy_ocr import EasyOCRExtractor
# Initialize extractor
extractor = EasyOCRExtractor(
languages=['en', 'fr'],
device='cuda',
show_log=True
)
# Extract text
result = extractor.extract("document.png")
print(f"Extracted: {result.full_text}")
# Visualize results
extractor.visualize(
result=result,
image_path="document.png",
output_path="visualization.png"
)
BaseTableExtractor
The foundation for all table extraction implementations.
omnidocs.tasks.table_extraction.base.BaseTableExtractor
BaseTableExtractor(device: Optional[str] = None, show_log: bool = False, engine_name: Optional[str] = None)
Bases: ABC
Base class for table extraction models.
Initialize the table extractor.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
device
|
Optional[str]
|
Device to run model on ('cuda' or 'cpu') |
None
|
show_log
|
bool
|
Whether to show detailed logs |
False
|
engine_name
|
Optional[str]
|
Name of the table extraction engine |
None
|
extract
abstractmethod
Extract tables from input image.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_path
|
Union[str, Path, Image]
|
Path to input image or image data |
required |
**kwargs
|
Additional model-specific parameters |
{}
|
Returns:
Type | Description |
---|---|
TableOutput
|
TableOutput containing extracted tables |
extract_all
Extract tables from multiple images.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_paths
|
List[Union[str, Path, Image]]
|
List of image paths or image data |
required |
**kwargs
|
Additional model-specific parameters |
{}
|
Returns:
Type | Description |
---|---|
List[TableOutput]
|
List of TableOutput objects |
extract_with_layout
extract_with_layout(input_path: Union[str, Path, Image], layout_regions: Optional[List[Dict]] = None, **kwargs) -> TableOutput
Extract tables with optional layout information.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_path
|
Union[str, Path, Image]
|
Path to input image or image data |
required |
layout_regions
|
Optional[List[Dict]]
|
Optional list of layout regions containing tables |
None
|
**kwargs
|
Additional model-specific parameters |
{}
|
Returns:
Type | Description |
---|---|
TableOutput
|
TableOutput containing extracted tables |
preprocess_input
Convert input to list of PIL Images.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_path
|
Union[str, Path, Image, ndarray]
|
Input image path or image data |
required |
Returns:
Type | Description |
---|---|
List[Image]
|
List of PIL Images |
postprocess_output
Convert raw table extraction output to standardized TableOutput format.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
raw_output
|
Any
|
Raw output from table extraction engine |
required |
img_size
|
Tuple[int, int]
|
Original image size (width, height) |
required |
Returns:
Type | Description |
---|---|
TableOutput
|
Standardized TableOutput object |
visualize
visualize(table_result: TableOutput, image_path: Union[str, Path, Image], output_path: str = 'visualized_tables.png', table_color: str = 'red', cell_color: str = 'blue', box_width: int = 2, show_text: bool = False, text_color: str = 'green', font_size: int = 12, show_table_ids: bool = True) -> None
Visualize table extraction results by drawing bounding boxes on the original image.
This method allows users to easily see which extractor is working better by visualizing the detected tables and cells with bounding boxes.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
table_result
|
TableOutput
|
TableOutput containing extracted tables |
required |
image_path
|
Union[str, Path, Image]
|
Path to original image or PIL Image object |
required |
output_path
|
str
|
Path to save the annotated image |
'visualized_tables.png'
|
table_color
|
str
|
Color for table bounding boxes |
'red'
|
cell_color
|
str
|
Color for cell bounding boxes |
'blue'
|
box_width
|
int
|
Width of bounding box lines |
2
|
show_text
|
bool
|
Whether to overlay cell text |
False
|
text_color
|
str
|
Color for text overlay |
'green'
|
font_size
|
int
|
Font size for text overlay |
12
|
show_table_ids
|
bool
|
Whether to show table IDs |
True
|
Key Features
- Multiple Formats: Support for PDF and image inputs
- Structured Output: Returns pandas DataFrames
- Coordinate Transformation: Handles PDF to image coordinate mapping
- Batch Processing: Process multiple documents
- Visualization: Table detection visualization
Usage Example
from omnidocs.tasks.table_extraction.extractors.camelot import CamelotExtractor
# Initialize extractor
extractor = CamelotExtractor(
flavor='lattice',
pages='all'
)
# Extract tables
result = extractor.extract("report.pdf")
# Access tables as DataFrames
for i, table in enumerate(result.tables):
print(f"Table {i} shape: {table.df.shape}")
table.df.to_csv(f"table_{i}.csv", index=False)
BaseTextExtractor
The foundation for text extraction from documents.
omnidocs.tasks.text_extraction.base.BaseTextExtractor
BaseTextExtractor(device: Optional[str] = None, show_log: bool = False, engine_name: Optional[str] = None, extract_images: bool = False)
Bases: ABC
Base class for text extraction models.
Initialize the text extractor.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
device
|
Optional[str]
|
Device to run model on ('cuda' or 'cpu') |
None
|
show_log
|
bool
|
Whether to show detailed logs |
False
|
engine_name
|
Optional[str]
|
Name of the text extraction engine |
None
|
extract_images
|
bool
|
Whether to extract images alongside text |
False
|
extract
abstractmethod
Extract text from input document.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_path
|
Union[str, Path]
|
Path to input document |
required |
**kwargs
|
Additional model-specific parameters |
{}
|
Returns:
Type | Description |
---|---|
TextOutput
|
TextOutput containing extracted text |
extract_all
Extract text from multiple documents.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_paths
|
List[Union[str, Path]]
|
List of document paths |
required |
**kwargs
|
Additional model-specific parameters |
{}
|
Returns:
Type | Description |
---|---|
List[TextOutput]
|
List of TextOutput objects |
preprocess_input
Preprocess input document for text extraction.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_path
|
Union[str, Path]
|
Path to input document |
required |
Returns:
Type | Description |
---|---|
Any
|
Preprocessed document object |
postprocess_output
Convert raw text extraction output to standardized TextOutput format.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
raw_output
|
Any
|
Raw output from text extraction engine |
required |
source_info
|
Optional[Dict]
|
Optional source document information |
None
|
Returns:
Type | Description |
---|---|
TextOutput
|
Standardized TextOutput object |
Key Features
- Multi-format Support: PDF, DOCX, HTML, and more
- Layout Preservation: Maintains document structure
- Metadata Extraction: Document properties and formatting
- Batch Processing: Handle multiple documents
Usage Example
from omnidocs.tasks.text_extraction.extractors.pymupdf import PyMuPDFExtractor
# Initialize extractor
extractor = PyMuPDFExtractor()
# Extract text with layout
result = extractor.extract("document.pdf")
# Access structured text
print(f"Full text: {result.full_text}")
for block in result.text_blocks:
print(f"Block: {block.text[:50]}...")
print(f"Position: {block.bbox}")
Data Models
OCRText
Represents a single text region detected by OCR.
omnidocs.tasks.ocr_extraction.base.OCRText
Bases: BaseModel
Container for individual OCR text detection.
Attributes:
Name | Type | Description |
---|---|---|
text |
str
|
Extracted text content |
confidence |
Optional[float]
|
Confidence score for the text detection |
bbox |
Optional[List[float]]
|
Bounding box coordinates [x1, y1, x2, y2] |
polygon |
Optional[List[List[float]]]
|
Optional polygon coordinates for irregular text regions |
language |
Optional[str]
|
Detected language code (e.g., 'en', 'zh', 'fr') |
reading_order |
Optional[int]
|
Optional reading order index for text sequencing |
Attributes
text
(str): The recognized text contentconfidence
(float): Recognition confidence score (0.0-1.0)bbox
(List[float]): Bounding box coordinates [x1, y1, x2, y2]polygon
(List[List[float]]): Precise polygon coordinateslanguage
(Optional[str]): Detected language codereading_order
(int): Reading order index
Example
# Access OCR text regions
for text_region in ocr_result.texts:
print(f"Text: {text_region.text}")
print(f"Confidence: {text_region.confidence:.3f}")
print(f"Bbox: {text_region.bbox}")
print(f"Language: {text_region.language}")
OCROutput
Complete OCR extraction result.
omnidocs.tasks.ocr_extraction.base.OCROutput
Bases: BaseModel
Container for OCR extraction results.
Attributes:
Name | Type | Description |
---|---|---|
texts |
List[OCRText]
|
List of detected text objects |
full_text |
str
|
Combined text from all detections |
source_img_size |
Optional[Tuple[int, int]]
|
Original image dimensions (width, height) |
processing_time |
Optional[float]
|
Time taken for OCR processing |
metadata |
Optional[Dict[str, Any]]
|
Additional metadata from the OCR engine |
get_sorted_by_reading_order
Get texts sorted by reading order (top-to-bottom, left-to-right if no reading_order).
get_text_by_confidence
Filter texts by minimum confidence threshold.
Key Methods
get_text_by_confidence(min_confidence)
: Filter by confidence thresholdget_sorted_by_reading_order()
: Sort by reading ordersave_json(output_path)
: Save results to JSONto_dict()
: Convert to dictionary
Example
result = extractor.extract("image.png")
# Filter high-confidence text
high_conf_texts = result.get_text_by_confidence(0.8)
print(f"High confidence regions: {len(high_conf_texts)}")
# Save results
result.save_json("ocr_results.json")
Table
Represents an extracted table with structure and data.
omnidocs.tasks.table_extraction.base.Table
Bases: BaseModel
Container for extracted table.
Attributes:
Name | Type | Description |
---|---|---|
cells |
List[TableCell]
|
List of table cells |
num_rows |
int
|
Number of rows in the table |
num_cols |
int
|
Number of columns in the table |
bbox |
Optional[List[float]]
|
Bounding box of the entire table [x1, y1, x2, y2] |
confidence |
Optional[float]
|
Overall table detection confidence |
table_id |
Optional[str]
|
Optional table identifier |
caption |
Optional[str]
|
Optional table caption |
structure_confidence |
Optional[float]
|
Confidence score for table structure detection |
Key Properties
df
(pandas.DataFrame): Table data as DataFramebbox
(List[float]): Table bounding boxconfidence
(float): Extraction confidencepage_number
(int): Source page number
Key Methods
to_csv()
: Export as CSV stringto_html()
: Export as HTML stringto_dict()
: Convert to dictionary
Example
for table in table_result.tables:
# Access as DataFrame
df = table.df
print(f"Table shape: {df.shape}")
# Export formats
csv_content = table.to_csv()
html_content = table.to_html()
# Save to file
df.to_excel(f"table_page_{table.page_number}.xlsx")
TableOutput
Complete table extraction result.
omnidocs.tasks.table_extraction.base.TableOutput
Bases: BaseModel
Container for table extraction results.
Attributes:
Name | Type | Description |
---|---|---|
tables |
List[Table]
|
List of extracted tables |
source_img_size |
Optional[Tuple[int, int]]
|
Original image dimensions (width, height) |
processing_time |
Optional[float]
|
Time taken for table extraction |
metadata |
Optional[Dict[str, Any]]
|
Additional metadata from the extraction engine |
get_tables_by_confidence
Filter tables by minimum confidence threshold.
save_tables_as_csv
Save all tables as separate CSV files.
Key Methods
get_tables_by_confidence(min_confidence)
: Filter by confidencesave_tables_as_csv(output_dir)
: Save all tables as CSV filessave_json(output_path)
: Save metadata to JSON
Example
result = extractor.extract("document.pdf")
# Filter high-confidence tables
good_tables = result.get_tables_by_confidence(0.7)
# Save all tables
csv_files = result.save_tables_as_csv("output_tables/")
print(f"Saved {len(csv_files)} CSV files")
Mapper Classes
BaseOCRMapper
Handles language code mapping and normalization for OCR engines.
omnidocs.tasks.ocr_extraction.base.BaseOCRMapper
Base class for mapping OCR engine-specific outputs to standardized format.
Initialize mapper for specific OCR engine.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
engine_name
|
str
|
Name of the OCR engine (e.g., 'tesseract', 'paddle', 'easyocr') |
required |
detect_text_language
Detect language of extracted text.
from_standard_language
Convert standard ISO 639-1 language code to engine-specific format.
get_supported_languages
Get list of supported languages for this engine.
normalize_bbox
Normalize bounding box coordinates to absolute pixel values.
Key Methods
to_standard_language(engine_language)
: Convert to standard language codefrom_standard_language(standard_language)
: Convert from standard language codeget_supported_languages()
: List supported languagesnormalize_bbox(bbox, img_width, img_height)
: Normalize bounding box coordinates
BaseTableMapper
Handles coordinate transformation and table structure mapping.
omnidocs.tasks.table_extraction.base.BaseTableMapper
Base class for mapping table extraction engine-specific outputs to standardized format.
Initialize mapper for specific table extraction engine.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
engine_name
|
str
|
Name of the table extraction engine |
required |
detect_header_rows
Detect and mark header cells based on position and formatting.
Key Methods
normalize_bbox(bbox, img_width, img_height)
: Normalize coordinatesdetect_header_rows(cells)
: Identify header rows
Abstract Base Classes
All extractors inherit from these abstract base classes, ensuring consistent interfaces:
from abc import ABC, abstractmethod
class BaseExtractor(ABC):
"""Abstract base class for all extractors."""
@abstractmethod
def extract(self, input_path: Union[str, Path]) -> Any:
"""Extract data from input document."""
pass
@abstractmethod
def preprocess_input(self, input_path: Union[str, Path]) -> Any:
"""Preprocess input for extraction."""
pass
@abstractmethod
def postprocess_output(self, raw_output: Any) -> Any:
"""Convert raw output to standardized format."""
pass
Common Patterns
Initialization Pattern
All extractors follow this initialization pattern:
class SomeExtractor(BaseExtractor):
def __init__(
self,
device: Optional[str] = None,
show_log: bool = False,
languages: Optional[List[str]] = None,
**kwargs
):
super().__init__(device, show_log, languages)
# Extractor-specific initialization
self._load_model()
Processing Pipeline
Standard processing flow:
- Input Validation: Check file existence and format
- Preprocessing: Convert to required format (PIL Image, etc.)
- Model Inference: Run the actual extraction
- Postprocessing: Convert to standardized output format
- Result Packaging: Create result object with metadata
Error Handling
Consistent error handling across extractors:
try:
result = extractor.extract("document.pdf")
except FileNotFoundError:
print("Document not found")
except ImportError:
print("Required dependencies not installed")
except Exception as e:
print(f"Extraction failed: {e}")
Performance Considerations
Memory Management
- Use generators for batch processing large datasets
- Clear GPU memory between large operations
- Implement proper cleanup in
__del__
methods
GPU Utilization
- Check GPU availability before initialization
- Batch operations when possible
- Use appropriate tensor data types
Caching
- Cache model loading where appropriate
- Implement result caching for repeated operations
- Use memory-mapped files for large datasets
Extension Points
Custom Extractors
Create custom extractors by inheriting from base classes:
from omnidocs.tasks.ocr_extraction.base import BaseOCRExtractor
class CustomOCRExtractor(BaseOCRExtractor):
def __init__(self, **kwargs):
super().__init__(**kwargs)
# Custom initialization
def _load_model(self):
# Load your custom model
pass
def postprocess_output(self, raw_output, img_size):
# Convert to OCROutput format
pass
Custom Mappers
Implement custom language or coordinate mappers:
from omnidocs.tasks.ocr_extraction.base import BaseOCRMapper
class CustomMapper(BaseOCRMapper):
def __init__(self):
super().__init__('custom_engine')
self._setup_custom_mapping()
def _setup_custom_mapping(self):
# Define your language mappings
pass
This core architecture ensures consistency, extensibility, and maintainability across all OmniDocs extractors.