📊 Table Extraction

This section documents the API for table extraction tasks, providing various extractors to retrieve tabular data from documents.

Overview

Table extraction in OmniDocs focuses on accurately identifying and extracting structured data from tables within PDFs and images. This is crucial for converting unstructured document data into usable formats like DataFrames.

Available Extractors

CamelotExtractor

Accurate table extraction from PDFs, supporting both lattice (for tables with lines) and stream (for tables without lines) modes.

omnidocs.tasks.table_extraction.extractors.camelot.CamelotExtractor

CamelotExtractor(device: Optional[str] = None, show_log: bool = False, method: str = 'lattice', pages: str = '1', flavor: str = 'lattice', **kwargs)

Bases: BaseTableExtractor

Camelot based table extraction implementation.

TODO: Bbox coordinate transformation from PDF to image space is still broken. Current issues: - Coordinate transformation accuracy issues between PDF points and image pixels - Cell bbox estimation doesn't account for actual cell sizes from Camelot - Need better integration with Camelot's internal coordinate data - Grid-based estimation fallback is inaccurate for real table layouts

Initialize Camelot Table Extractor.

extract

extract(input_path: Union[str, Path, Image], **kwargs) -> TableOutput

Extract tables using Camelot.

Usage Example

from omnidocs.tasks.table_extraction.extractors.camelot import CamelotExtractor

extractor = CamelotExtractor(flavor='lattice') # or 'stream'
result = extractor.extract("document.pdf")
for i, table in enumerate(result.tables):
    print(f"Table {i+1} shape: {table.df.shape}")
    print(table.df.head())

PDFPlumberTableExtractor

A lightweight and fast PDF table extraction library.

omnidocs.tasks.table_extraction.extractors.pdfplumber.PDFPlumberExtractor

PDFPlumberExtractor(device: Optional[str] = None, show_log: bool = False, table_settings: Optional[Dict] = None, **kwargs)

Bases: BaseTableExtractor

PDFPlumber based table extraction implementation.

Initialize PDFPlumber Table Extractor.

extract

extract(input_path: Union[str, Path, Image], **kwargs) -> TableOutput

Extract tables using PDFPlumber.

Usage Example

from omnidocs.tasks.table_extraction.extractors.pdfplumber import PDFPlumberExtractor

extractor = PDFPlumberExtractor()
result = extractor.extract("document.pdf")
for i, table in enumerate(result.tables):
    print(f"Table {i+1} shape: {table.df.shape}")

PPStructureTableExtractor

An OCR tool that supports multiple languages and provides table recognition capabilities.

omnidocs.tasks.table_extraction.extractors.ppstructure.PPStructureExtractor

PPStructureExtractor(device: Optional[str] = None, show_log: bool = False, languages: Optional[List[str]] = None, use_gpu: bool = True, layout_model: Optional[str] = None, table_model: Optional[str] = None, return_ocr_result_in_table: bool = True, **kwargs)

Bases: BaseTableExtractor

PaddleOCR PPStructure based table extraction implementation.

Initialize PPStructure Table Extractor.

extract

extract(input_path: Union[str, Path, Image], **kwargs) -> TableOutput

Extract tables using PPStructure.

Usage Example

from omnidocs.tasks.table_extraction.extractors.ppstructure import PPStructureExtractor

extractor = PPStructureExtractor()
result = extractor.extract("image.png")
for i, table in enumerate(result.tables):
    print(f"Table {i+1} shape: {table.df.shape}")

SuryaTableExtractor

Deep learning-based table structure recognition, part of the Surya library.

omnidocs.tasks.table_extraction.extractors.surya_table.SuryaTableExtractor

SuryaTableExtractor(device: Optional[str] = None, show_log: bool = False, model_path: Optional[Union[str, Path]] = None, **kwargs)

Bases: BaseTableExtractor

Surya-based table extraction implementation.

Initialize Surya Table Extractor.

extract

extract(input_path: Union[str, Path, Image], **kwargs) -> TableOutput

Extract tables using Surya.

Usage Example

from omnidocs.tasks.table_extraction.extractors.surya_table import SuryaTableExtractor

extractor = SuryaTableExtractor()
result = extractor.extract("document.pdf")
for i, table in enumerate(result.tables):
    print(f"Table {i+1} shape: {table.df.shape}")

TableTransformerExtractor

A transformer-based model for table detection and extraction.

omnidocs.tasks.table_extraction.extractors.table_transformer.TableTransformerExtractor

TableTransformerExtractor(device: Optional[str] = None, show_log: bool = False, detection_model_path: Optional[str] = None, structure_model_path: Optional[str] = None, detection_threshold: float = 0.7, structure_threshold: float = 0.7, **kwargs)

Bases: BaseTableExtractor

Table Transformer based table extraction implementation.

Initialize Table Transformer Extractor.

extract

extract(input_path: Union[str, Path, Image], **kwargs) -> TableOutput

Extract tables using Table Transformer.

Usage Example

from omnidocs.tasks.table_extraction.extractors.table_transformer import TableTransformerExtractor

extractor = TableTransformerExtractor()
result = extractor.extract("image.png")
for i, table in enumerate(result.tables):
    print(f"Table {i+1} shape: {table.df.shape}")

TableFormerExtractor

An advanced deep learning model for table structure parsing.

omnidocs.tasks.table_extraction.extractors.tableformer.TableFormerExtractor

TableFormerExtractor(device: Optional[str] = None, show_log: bool = False, model_path: Optional[str] = None, model_type: str = 'structure', confidence_threshold: float = 0.7, max_size: int = 1000, **kwargs)

Bases: BaseTableExtractor

TableFormer based table extraction implementation.

Initialize TableFormer Extractor.

extract

extract(input_path: Union[str, Path, Image], **kwargs) -> TableOutput

Extract tables using TableFormer.

Usage Example

from omnidocs.tasks.table_extraction.extractors.tableformer import TableFormerExtractor

extractor = TableFormerExtractor()
result = extractor.extract("document.pdf")
for i, table in enumerate(result.tables):
    print(f"Table {i+1} shape: {table.df.shape}")

TabulaExtractor

A Java-based tool for extracting tables from PDFs. Requires Java runtime installed.

omnidocs.tasks.table_extraction.extractors.tabula.TabulaExtractor

TabulaExtractor(device: Optional[str] = None, show_log: bool = False, method: str = 'lattice', pages: Optional[Union[str, List[int]]] = None, multiple_tables: bool = True, guess: bool = True, area: Optional[List[float]] = None, columns: Optional[List[float]] = None, **kwargs)

Bases: BaseTableExtractor

Tabula based table extraction implementation.

Initialize Tabula Table Extractor.

extract

extract(input_path: Union[str, Path, Image], **kwargs) -> TableOutput

Extract tables using Tabula.

Usage Example

from omnidocs.tasks.table_extraction.extractors.tabula import TabulaExtractor

extractor = TabulaExtractor()
result = extractor.extract("document.pdf")
for i, table in enumerate(result.tables):
    print(f"Table {i+1} shape: {table.df.shape}")

TableOutput

The standardized output format for table extraction results.

omnidocs.tasks.table_extraction.base.TableOutput

Bases: BaseModel

Container for table extraction results.

Attributes:

Name	Type	Description
`tables`	`List[Table]`	List of extracted tables
`source_img_size`	`Optional[Tuple[int, int]]`	Original image dimensions (width, height)
`processing_time`	`Optional[float]`	Time taken for table extraction
`metadata`	`Optional[Dict[str, Any]]`	Additional metadata from the extraction engine

get_tables_by_confidence

get_tables_by_confidence(min_confidence: float = 0.5) -> List[Table]

Filter tables by minimum confidence threshold.

save_json

save_json(output_path: Union[str, Path]) -> None

Save output to JSON file.

save_tables_as_csv

save_tables_as_csv(output_dir: Union[str, Path]) -> List[Path]

Save all tables as separate CSV files.

to_dict

to_dict() -> Dict

Convert to dictionary representation.

Key Properties

tables (List[Table]): List of extracted tables.
source_file (str): Path to the processed file.
processing_time (Optional[float]): Time taken for extraction.

Key Methods

save_json(output_path): Save results metadata to a JSON file.
save_tables_as_csv(output_dir): Save all extracted tables as individual CSV files.
get_tables_by_confidence(min_confidence): Filter tables by confidence score.

Table

Represents a single extracted table.

omnidocs.tasks.table_extraction.base.Table

Bases: BaseModel

Container for extracted table.

Attributes:

Name	Type	Description
`cells`	`List[TableCell]`	List of table cells
`num_rows`	`int`	Number of rows in the table
`num_cols`	`int`	Number of columns in the table
`bbox`	`Optional[List[float]]`	Bounding box of the entire table [x1, y1, x2, y2]
`confidence`	`Optional[float]`	Overall table detection confidence
`table_id`	`Optional[str]`	Optional table identifier
`caption`	`Optional[str]`	Optional table caption
`structure_confidence`	`Optional[float]`	Confidence score for table structure detection

to_csv

to_csv() -> str

Convert table to CSV format.

to_dict

to_dict() -> Dict

Convert to dictionary representation.

to_html

to_html() -> str

Convert table to HTML format.

Attributes

df (pandas.DataFrame): The extracted table data as a DataFrame.
bbox (List[float]): Bounding box coordinates of the table.
page_number (int): The page number where the table is found.
confidence (Optional[float]): Confidence score of the table extraction.

Key Methods

to_csv(): Convert the table DataFrame to a CSV string.
to_html(): Convert the table DataFrame to an HTML string.

BaseTableExtractor

The abstract base class for all table extraction extractors.

omnidocs.tasks.table_extraction.base.BaseTableExtractor

BaseTableExtractor(device: Optional[str] = None, show_log: bool = False, engine_name: Optional[str] = None)

Bases: ABC

Base class for table extraction models.

Initialize the table extractor.

Parameters:

Name	Type	Description	Default
`device`	`Optional[str]`	Device to run model on ('cuda' or 'cpu')	`None`
`show_log`	`bool`	Whether to show detailed logs	`False`
`engine_name`	`Optional[str]`	Name of the table extraction engine	`None`

extract `abstractmethod`

extract(input_path: Union[str, Path, Image], **kwargs) -> TableOutput

Extract tables from input image.

Parameters:

Name	Type	Description	Default
`input_path`	`Union[str, Path, Image]`	Path to input image or image data	required
`**kwargs`		Additional model-specific parameters	`{}`

Returns:

Type	Description
`TableOutput`	TableOutput containing extracted tables

preprocess_input

preprocess_input(input_path: Union[str, Path, Image, ndarray]) -> List[Image.Image]

Convert input to list of PIL Images.

Parameters:

Name	Type	Description	Default
`input_path`	`Union[str, Path, Image, ndarray]`	Input image path or image data	required

Returns:

Type	Description
`List[Image]`	List of PIL Images

postprocess_output

postprocess_output(raw_output: Any, img_size: Tuple[int, int]) -> TableOutput

Convert raw table extraction output to standardized TableOutput format.

Parameters:

Name	Type	Description	Default
`raw_output`	`Any`	Raw output from table extraction engine	required
`img_size`	`Tuple[int, int]`	Original image size (width, height)	required

Returns:

Type	Description
`TableOutput`	Standardized TableOutput object

visualize

visualize(table_result: TableOutput, image_path: Union[str, Path, Image], output_path: str = 'visualized_tables.png', table_color: str = 'red', cell_color: str = 'blue', box_width: int = 2, show_text: bool = False, text_color: str = 'green', font_size: int = 12, show_table_ids: bool = True) -> None

Visualize table extraction results by drawing bounding boxes on the original image.

This method allows users to easily see which extractor is working better by visualizing the detected tables and cells with bounding boxes.

Parameters:

Name	Type	Description	Default
`table_result`	`TableOutput`	TableOutput containing extracted tables	required
`image_path`	`Union[str, Path, Image]`	Path to original image or PIL Image object	required
`output_path`	`str`	Path to save the annotated image	`'visualized_tables.png'`
`table_color`	`str`	Color for table bounding boxes	`'red'`
`cell_color`	`str`	Color for cell bounding boxes	`'blue'`
`box_width`	`int`	Width of bounding box lines	`2`
`show_text`	`bool`	Whether to overlay cell text	`False`
`text_color`	`str`	Color for text overlay	`'green'`
`font_size`	`int`	Font size for text overlay	`12`
`show_table_ids`	`bool`	Whether to show table IDs	`True`

TableMapper

Handles mapping of table-related labels and normalization of bounding boxes.

omnidocs.tasks.table_extraction.base.BaseTableMapper

BaseTableMapper(engine_name: str)

Base class for mapping table extraction engine-specific outputs to standardized format.

Initialize mapper for specific table extraction engine.

Parameters:

Name	Type	Description	Default
`engine_name`	`str`	Name of the table extraction engine	required

detect_header_rows

detect_header_rows(cells: List[TableCell]) -> List[TableCell]

Detect and mark header cells based on position and formatting.

normalize_bbox

normalize_bbox(bbox: List[float], img_width: int, img_height: int) -> List[float]

Normalize bounding box coordinates to absolute pixel values.

📊 Table Extraction

Overview

Available Extractors

CamelotExtractor

omnidocs.tasks.table_extraction.extractors.camelot.CamelotExtractor

extract

Usage Example

PDFPlumberTableExtractor

omnidocs.tasks.table_extraction.extractors.pdfplumber.PDFPlumberExtractor

extract

Usage Example

PPStructureTableExtractor

omnidocs.tasks.table_extraction.extractors.ppstructure.PPStructureExtractor

extract

Usage Example

SuryaTableExtractor

omnidocs.tasks.table_extraction.extractors.surya_table.SuryaTableExtractor

extract

Usage Example

TableTransformerExtractor

omnidocs.tasks.table_extraction.extractors.table_transformer.TableTransformerExtractor

extract

Usage Example

TableFormerExtractor

omnidocs.tasks.table_extraction.extractors.tableformer.TableFormerExtractor

extract

Usage Example

TabulaExtractor

omnidocs.tasks.table_extraction.extractors.tabula.TabulaExtractor

extract

Usage Example

TableOutput

omnidocs.tasks.table_extraction.base.TableOutput

get_tables_by_confidence

save_json

save_tables_as_csv

to_dict

Key Properties

Key Methods

Table

omnidocs.tasks.table_extraction.base.Table

to_csv

to_dict

to_html

Attributes

Key Methods

BaseTableExtractor

omnidocs.tasks.table_extraction.base.BaseTableExtractor

extract abstractmethod

preprocess_input

postprocess_output

visualize

TableMapper

omnidocs.tasks.table_extraction.base.BaseTableMapper

detect_header_rows

normalize_bbox

Related Resources

extract `abstractmethod`