📊 Table Extraction
This section documents the API for table extraction tasks, providing various extractors to retrieve tabular data from documents.
Overview
Table extraction in OmniDocs focuses on accurately identifying and extracting structured data from tables within PDFs and images. This is crucial for converting unstructured document data into usable formats like DataFrames.
Available Extractors
CamelotExtractor
Accurate table extraction from PDFs, supporting both lattice (for tables with lines) and stream (for tables without lines) modes.
omnidocs.tasks.table_extraction.extractors.camelot.CamelotExtractor
CamelotExtractor(device: Optional[str] = None, show_log: bool = False, method: str = 'lattice', pages: str = '1', flavor: str = 'lattice', **kwargs)
Bases: BaseTableExtractor
Camelot based table extraction implementation.
TODO: Bbox coordinate transformation from PDF to image space is still broken. Current issues: - Coordinate transformation accuracy issues between PDF points and image pixels - Cell bbox estimation doesn't account for actual cell sizes from Camelot - Need better integration with Camelot's internal coordinate data - Grid-based estimation fallback is inaccurate for real table layouts
Initialize Camelot Table Extractor.
Usage Example
from omnidocs.tasks.table_extraction.extractors.camelot import CamelotExtractor
extractor = CamelotExtractor(flavor='lattice') # or 'stream'
result = extractor.extract("document.pdf")
for i, table in enumerate(result.tables):
print(f"Table {i+1} shape: {table.df.shape}")
print(table.df.head())
PDFPlumberTableExtractor
A lightweight and fast PDF table extraction library.
omnidocs.tasks.table_extraction.extractors.pdfplumber.PDFPlumberExtractor
PDFPlumberExtractor(device: Optional[str] = None, show_log: bool = False, table_settings: Optional[Dict] = None, **kwargs)
Bases: BaseTableExtractor
PDFPlumber based table extraction implementation.
Initialize PDFPlumber Table Extractor.
Usage Example
from omnidocs.tasks.table_extraction.extractors.pdfplumber import PDFPlumberExtractor
extractor = PDFPlumberExtractor()
result = extractor.extract("document.pdf")
for i, table in enumerate(result.tables):
print(f"Table {i+1} shape: {table.df.shape}")
PPStructureTableExtractor
An OCR tool that supports multiple languages and provides table recognition capabilities.
omnidocs.tasks.table_extraction.extractors.ppstructure.PPStructureExtractor
PPStructureExtractor(device: Optional[str] = None, show_log: bool = False, languages: Optional[List[str]] = None, use_gpu: bool = True, layout_model: Optional[str] = None, table_model: Optional[str] = None, return_ocr_result_in_table: bool = True, **kwargs)
Bases: BaseTableExtractor
PaddleOCR PPStructure based table extraction implementation.
Initialize PPStructure Table Extractor.
Usage Example
from omnidocs.tasks.table_extraction.extractors.ppstructure import PPStructureExtractor
extractor = PPStructureExtractor()
result = extractor.extract("image.png")
for i, table in enumerate(result.tables):
print(f"Table {i+1} shape: {table.df.shape}")
SuryaTableExtractor
Deep learning-based table structure recognition, part of the Surya library.
omnidocs.tasks.table_extraction.extractors.surya_table.SuryaTableExtractor
SuryaTableExtractor(device: Optional[str] = None, show_log: bool = False, model_path: Optional[Union[str, Path]] = None, **kwargs)
Bases: BaseTableExtractor
Surya-based table extraction implementation.
Initialize Surya Table Extractor.
Usage Example
from omnidocs.tasks.table_extraction.extractors.surya_table import SuryaTableExtractor
extractor = SuryaTableExtractor()
result = extractor.extract("document.pdf")
for i, table in enumerate(result.tables):
print(f"Table {i+1} shape: {table.df.shape}")
TableTransformerExtractor
A transformer-based model for table detection and extraction.
omnidocs.tasks.table_extraction.extractors.table_transformer.TableTransformerExtractor
TableTransformerExtractor(device: Optional[str] = None, show_log: bool = False, detection_model_path: Optional[str] = None, structure_model_path: Optional[str] = None, detection_threshold: float = 0.7, structure_threshold: float = 0.7, **kwargs)
Bases: BaseTableExtractor
Table Transformer based table extraction implementation.
Initialize Table Transformer Extractor.
Usage Example
from omnidocs.tasks.table_extraction.extractors.table_transformer import TableTransformerExtractor
extractor = TableTransformerExtractor()
result = extractor.extract("image.png")
for i, table in enumerate(result.tables):
print(f"Table {i+1} shape: {table.df.shape}")
TableFormerExtractor
An advanced deep learning model for table structure parsing.
omnidocs.tasks.table_extraction.extractors.tableformer.TableFormerExtractor
TableFormerExtractor(device: Optional[str] = None, show_log: bool = False, model_path: Optional[str] = None, model_type: str = 'structure', confidence_threshold: float = 0.7, max_size: int = 1000, **kwargs)
Bases: BaseTableExtractor
TableFormer based table extraction implementation.
Initialize TableFormer Extractor.
Usage Example
from omnidocs.tasks.table_extraction.extractors.tableformer import TableFormerExtractor
extractor = TableFormerExtractor()
result = extractor.extract("document.pdf")
for i, table in enumerate(result.tables):
print(f"Table {i+1} shape: {table.df.shape}")
TabulaExtractor
A Java-based tool for extracting tables from PDFs. Requires Java runtime installed.
omnidocs.tasks.table_extraction.extractors.tabula.TabulaExtractor
TabulaExtractor(device: Optional[str] = None, show_log: bool = False, method: str = 'lattice', pages: Optional[Union[str, List[int]]] = None, multiple_tables: bool = True, guess: bool = True, area: Optional[List[float]] = None, columns: Optional[List[float]] = None, **kwargs)
Bases: BaseTableExtractor
Tabula based table extraction implementation.
Initialize Tabula Table Extractor.
Usage Example
from omnidocs.tasks.table_extraction.extractors.tabula import TabulaExtractor
extractor = TabulaExtractor()
result = extractor.extract("document.pdf")
for i, table in enumerate(result.tables):
print(f"Table {i+1} shape: {table.df.shape}")
TableOutput
The standardized output format for table extraction results.
omnidocs.tasks.table_extraction.base.TableOutput
Bases: BaseModel
Container for table extraction results.
Attributes:
Name | Type | Description |
---|---|---|
tables |
List[Table]
|
List of extracted tables |
source_img_size |
Optional[Tuple[int, int]]
|
Original image dimensions (width, height) |
processing_time |
Optional[float]
|
Time taken for table extraction |
metadata |
Optional[Dict[str, Any]]
|
Additional metadata from the extraction engine |
get_tables_by_confidence
Filter tables by minimum confidence threshold.
save_tables_as_csv
Save all tables as separate CSV files.
Key Properties
tables
(List[Table]): List of extracted tables.source_file
(str): Path to the processed file.processing_time
(Optional[float]): Time taken for extraction.
Key Methods
save_json(output_path)
: Save results metadata to a JSON file.save_tables_as_csv(output_dir)
: Save all extracted tables as individual CSV files.get_tables_by_confidence(min_confidence)
: Filter tables by confidence score.
Table
Represents a single extracted table.
omnidocs.tasks.table_extraction.base.Table
Bases: BaseModel
Container for extracted table.
Attributes:
Name | Type | Description |
---|---|---|
cells |
List[TableCell]
|
List of table cells |
num_rows |
int
|
Number of rows in the table |
num_cols |
int
|
Number of columns in the table |
bbox |
Optional[List[float]]
|
Bounding box of the entire table [x1, y1, x2, y2] |
confidence |
Optional[float]
|
Overall table detection confidence |
table_id |
Optional[str]
|
Optional table identifier |
caption |
Optional[str]
|
Optional table caption |
structure_confidence |
Optional[float]
|
Confidence score for table structure detection |
Attributes
df
(pandas.DataFrame): The extracted table data as a DataFrame.bbox
(List[float]): Bounding box coordinates of the table.page_number
(int): The page number where the table is found.confidence
(Optional[float]): Confidence score of the table extraction.
Key Methods
to_csv()
: Convert the table DataFrame to a CSV string.to_html()
: Convert the table DataFrame to an HTML string.
BaseTableExtractor
The abstract base class for all table extraction extractors.
omnidocs.tasks.table_extraction.base.BaseTableExtractor
BaseTableExtractor(device: Optional[str] = None, show_log: bool = False, engine_name: Optional[str] = None)
Bases: ABC
Base class for table extraction models.
Initialize the table extractor.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
device
|
Optional[str]
|
Device to run model on ('cuda' or 'cpu') |
None
|
show_log
|
bool
|
Whether to show detailed logs |
False
|
engine_name
|
Optional[str]
|
Name of the table extraction engine |
None
|
extract
abstractmethod
Extract tables from input image.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_path
|
Union[str, Path, Image]
|
Path to input image or image data |
required |
**kwargs
|
Additional model-specific parameters |
{}
|
Returns:
Type | Description |
---|---|
TableOutput
|
TableOutput containing extracted tables |
preprocess_input
Convert input to list of PIL Images.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_path
|
Union[str, Path, Image, ndarray]
|
Input image path or image data |
required |
Returns:
Type | Description |
---|---|
List[Image]
|
List of PIL Images |
postprocess_output
Convert raw table extraction output to standardized TableOutput format.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
raw_output
|
Any
|
Raw output from table extraction engine |
required |
img_size
|
Tuple[int, int]
|
Original image size (width, height) |
required |
Returns:
Type | Description |
---|---|
TableOutput
|
Standardized TableOutput object |
visualize
visualize(table_result: TableOutput, image_path: Union[str, Path, Image], output_path: str = 'visualized_tables.png', table_color: str = 'red', cell_color: str = 'blue', box_width: int = 2, show_text: bool = False, text_color: str = 'green', font_size: int = 12, show_table_ids: bool = True) -> None
Visualize table extraction results by drawing bounding boxes on the original image.
This method allows users to easily see which extractor is working better by visualizing the detected tables and cells with bounding boxes.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
table_result
|
TableOutput
|
TableOutput containing extracted tables |
required |
image_path
|
Union[str, Path, Image]
|
Path to original image or PIL Image object |
required |
output_path
|
str
|
Path to save the annotated image |
'visualized_tables.png'
|
table_color
|
str
|
Color for table bounding boxes |
'red'
|
cell_color
|
str
|
Color for cell bounding boxes |
'blue'
|
box_width
|
int
|
Width of bounding box lines |
2
|
show_text
|
bool
|
Whether to overlay cell text |
False
|
text_color
|
str
|
Color for text overlay |
'green'
|
font_size
|
int
|
Font size for text overlay |
12
|
show_table_ids
|
bool
|
Whether to show table IDs |
True
|
TableMapper
Handles mapping of table-related labels and normalization of bounding boxes.
omnidocs.tasks.table_extraction.base.BaseTableMapper
Base class for mapping table extraction engine-specific outputs to standardized format.
Initialize mapper for specific table extraction engine.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
engine_name
|
str
|
Name of the table extraction engine |
required |
detect_header_rows
Detect and mark header cells based on position and formatting.