Skip to content

📊 Table Extraction

This section documents the API for table extraction tasks, providing various extractors to retrieve tabular data from documents.

Overview

Table extraction in OmniDocs focuses on accurately identifying and extracting structured data from tables within PDFs and images. This is crucial for converting unstructured document data into usable formats like DataFrames.

Available Extractors

CamelotExtractor

Accurate table extraction from PDFs, supporting both lattice (for tables with lines) and stream (for tables without lines) modes.

omnidocs.tasks.table_extraction.extractors.camelot.CamelotExtractor

CamelotExtractor(device: Optional[str] = None, show_log: bool = False, method: str = 'lattice', pages: str = '1', flavor: str = 'lattice', **kwargs)

Bases: BaseTableExtractor

Camelot based table extraction implementation.

TODO: Bbox coordinate transformation from PDF to image space is still broken. Current issues: - Coordinate transformation accuracy issues between PDF points and image pixels - Cell bbox estimation doesn't account for actual cell sizes from Camelot - Need better integration with Camelot's internal coordinate data - Grid-based estimation fallback is inaccurate for real table layouts

Initialize Camelot Table Extractor.

extract

extract(input_path: Union[str, Path, Image], **kwargs) -> TableOutput

Extract tables using Camelot.

Usage Example

from omnidocs.tasks.table_extraction.extractors.camelot import CamelotExtractor

extractor = CamelotExtractor(flavor='lattice') # or 'stream'
result = extractor.extract("document.pdf")
for i, table in enumerate(result.tables):
    print(f"Table {i+1} shape: {table.df.shape}")
    print(table.df.head())

PDFPlumberTableExtractor

A lightweight and fast PDF table extraction library.

omnidocs.tasks.table_extraction.extractors.pdfplumber.PDFPlumberExtractor

PDFPlumberExtractor(device: Optional[str] = None, show_log: bool = False, table_settings: Optional[Dict] = None, **kwargs)

Bases: BaseTableExtractor

PDFPlumber based table extraction implementation.

Initialize PDFPlumber Table Extractor.

extract

extract(input_path: Union[str, Path, Image], **kwargs) -> TableOutput

Extract tables using PDFPlumber.

Usage Example

from omnidocs.tasks.table_extraction.extractors.pdfplumber import PDFPlumberExtractor

extractor = PDFPlumberExtractor()
result = extractor.extract("document.pdf")
for i, table in enumerate(result.tables):
    print(f"Table {i+1} shape: {table.df.shape}")

PPStructureTableExtractor

An OCR tool that supports multiple languages and provides table recognition capabilities.

omnidocs.tasks.table_extraction.extractors.ppstructure.PPStructureExtractor

PPStructureExtractor(device: Optional[str] = None, show_log: bool = False, languages: Optional[List[str]] = None, use_gpu: bool = True, layout_model: Optional[str] = None, table_model: Optional[str] = None, return_ocr_result_in_table: bool = True, **kwargs)

Bases: BaseTableExtractor

PaddleOCR PPStructure based table extraction implementation.

Initialize PPStructure Table Extractor.

extract

extract(input_path: Union[str, Path, Image], **kwargs) -> TableOutput

Extract tables using PPStructure.

Usage Example

from omnidocs.tasks.table_extraction.extractors.ppstructure import PPStructureExtractor

extractor = PPStructureExtractor()
result = extractor.extract("image.png")
for i, table in enumerate(result.tables):
    print(f"Table {i+1} shape: {table.df.shape}")

SuryaTableExtractor

Deep learning-based table structure recognition, part of the Surya library.

omnidocs.tasks.table_extraction.extractors.surya_table.SuryaTableExtractor

SuryaTableExtractor(device: Optional[str] = None, show_log: bool = False, model_path: Optional[Union[str, Path]] = None, **kwargs)

Bases: BaseTableExtractor

Surya-based table extraction implementation.

Initialize Surya Table Extractor.

extract

extract(input_path: Union[str, Path, Image], **kwargs) -> TableOutput

Extract tables using Surya.

Usage Example

from omnidocs.tasks.table_extraction.extractors.surya_table import SuryaTableExtractor

extractor = SuryaTableExtractor()
result = extractor.extract("document.pdf")
for i, table in enumerate(result.tables):
    print(f"Table {i+1} shape: {table.df.shape}")

TableTransformerExtractor

A transformer-based model for table detection and extraction.

omnidocs.tasks.table_extraction.extractors.table_transformer.TableTransformerExtractor

TableTransformerExtractor(device: Optional[str] = None, show_log: bool = False, detection_model_path: Optional[str] = None, structure_model_path: Optional[str] = None, detection_threshold: float = 0.7, structure_threshold: float = 0.7, **kwargs)

Bases: BaseTableExtractor

Table Transformer based table extraction implementation.

Initialize Table Transformer Extractor.

extract

extract(input_path: Union[str, Path, Image], **kwargs) -> TableOutput

Extract tables using Table Transformer.

Usage Example

from omnidocs.tasks.table_extraction.extractors.table_transformer import TableTransformerExtractor

extractor = TableTransformerExtractor()
result = extractor.extract("image.png")
for i, table in enumerate(result.tables):
    print(f"Table {i+1} shape: {table.df.shape}")

TableFormerExtractor

An advanced deep learning model for table structure parsing.

omnidocs.tasks.table_extraction.extractors.tableformer.TableFormerExtractor

TableFormerExtractor(device: Optional[str] = None, show_log: bool = False, model_path: Optional[str] = None, model_type: str = 'structure', confidence_threshold: float = 0.7, max_size: int = 1000, **kwargs)

Bases: BaseTableExtractor

TableFormer based table extraction implementation.

Initialize TableFormer Extractor.

extract

extract(input_path: Union[str, Path, Image], **kwargs) -> TableOutput

Extract tables using TableFormer.

Usage Example

from omnidocs.tasks.table_extraction.extractors.tableformer import TableFormerExtractor

extractor = TableFormerExtractor()
result = extractor.extract("document.pdf")
for i, table in enumerate(result.tables):
    print(f"Table {i+1} shape: {table.df.shape}")

TabulaExtractor

A Java-based tool for extracting tables from PDFs. Requires Java runtime installed.

omnidocs.tasks.table_extraction.extractors.tabula.TabulaExtractor

TabulaExtractor(device: Optional[str] = None, show_log: bool = False, method: str = 'lattice', pages: Optional[Union[str, List[int]]] = None, multiple_tables: bool = True, guess: bool = True, area: Optional[List[float]] = None, columns: Optional[List[float]] = None, **kwargs)

Bases: BaseTableExtractor

Tabula based table extraction implementation.

Initialize Tabula Table Extractor.

extract

extract(input_path: Union[str, Path, Image], **kwargs) -> TableOutput

Extract tables using Tabula.

Usage Example

from omnidocs.tasks.table_extraction.extractors.tabula import TabulaExtractor

extractor = TabulaExtractor()
result = extractor.extract("document.pdf")
for i, table in enumerate(result.tables):
    print(f"Table {i+1} shape: {table.df.shape}")

TableOutput

The standardized output format for table extraction results.

omnidocs.tasks.table_extraction.base.TableOutput

Bases: BaseModel

Container for table extraction results.

Attributes:

Name Type Description
tables List[Table]

List of extracted tables

source_img_size Optional[Tuple[int, int]]

Original image dimensions (width, height)

processing_time Optional[float]

Time taken for table extraction

metadata Optional[Dict[str, Any]]

Additional metadata from the extraction engine

get_tables_by_confidence

get_tables_by_confidence(min_confidence: float = 0.5) -> List[Table]

Filter tables by minimum confidence threshold.

save_json

save_json(output_path: Union[str, Path]) -> None

Save output to JSON file.

save_tables_as_csv

save_tables_as_csv(output_dir: Union[str, Path]) -> List[Path]

Save all tables as separate CSV files.

to_dict

to_dict() -> Dict

Convert to dictionary representation.

Key Properties

  • tables (List[Table]): List of extracted tables.
  • source_file (str): Path to the processed file.
  • processing_time (Optional[float]): Time taken for extraction.

Key Methods

  • save_json(output_path): Save results metadata to a JSON file.
  • save_tables_as_csv(output_dir): Save all extracted tables as individual CSV files.
  • get_tables_by_confidence(min_confidence): Filter tables by confidence score.

Table

Represents a single extracted table.

omnidocs.tasks.table_extraction.base.Table

Bases: BaseModel

Container for extracted table.

Attributes:

Name Type Description
cells List[TableCell]

List of table cells

num_rows int

Number of rows in the table

num_cols int

Number of columns in the table

bbox Optional[List[float]]

Bounding box of the entire table [x1, y1, x2, y2]

confidence Optional[float]

Overall table detection confidence

table_id Optional[str]

Optional table identifier

caption Optional[str]

Optional table caption

structure_confidence Optional[float]

Confidence score for table structure detection

to_csv

to_csv() -> str

Convert table to CSV format.

to_dict

to_dict() -> Dict

Convert to dictionary representation.

to_html

to_html() -> str

Convert table to HTML format.

Attributes

  • df (pandas.DataFrame): The extracted table data as a DataFrame.
  • bbox (List[float]): Bounding box coordinates of the table.
  • page_number (int): The page number where the table is found.
  • confidence (Optional[float]): Confidence score of the table extraction.

Key Methods

  • to_csv(): Convert the table DataFrame to a CSV string.
  • to_html(): Convert the table DataFrame to an HTML string.

BaseTableExtractor

The abstract base class for all table extraction extractors.

omnidocs.tasks.table_extraction.base.BaseTableExtractor

BaseTableExtractor(device: Optional[str] = None, show_log: bool = False, engine_name: Optional[str] = None)

Bases: ABC

Base class for table extraction models.

Initialize the table extractor.

Parameters:

Name Type Description Default
device Optional[str]

Device to run model on ('cuda' or 'cpu')

None
show_log bool

Whether to show detailed logs

False
engine_name Optional[str]

Name of the table extraction engine

None

extract abstractmethod

extract(input_path: Union[str, Path, Image], **kwargs) -> TableOutput

Extract tables from input image.

Parameters:

Name Type Description Default
input_path Union[str, Path, Image]

Path to input image or image data

required
**kwargs

Additional model-specific parameters

{}

Returns:

Type Description
TableOutput

TableOutput containing extracted tables

preprocess_input

preprocess_input(input_path: Union[str, Path, Image, ndarray]) -> List[Image.Image]

Convert input to list of PIL Images.

Parameters:

Name Type Description Default
input_path Union[str, Path, Image, ndarray]

Input image path or image data

required

Returns:

Type Description
List[Image]

List of PIL Images

postprocess_output

postprocess_output(raw_output: Any, img_size: Tuple[int, int]) -> TableOutput

Convert raw table extraction output to standardized TableOutput format.

Parameters:

Name Type Description Default
raw_output Any

Raw output from table extraction engine

required
img_size Tuple[int, int]

Original image size (width, height)

required

Returns:

Type Description
TableOutput

Standardized TableOutput object

visualize

visualize(table_result: TableOutput, image_path: Union[str, Path, Image], output_path: str = 'visualized_tables.png', table_color: str = 'red', cell_color: str = 'blue', box_width: int = 2, show_text: bool = False, text_color: str = 'green', font_size: int = 12, show_table_ids: bool = True) -> None

Visualize table extraction results by drawing bounding boxes on the original image.

This method allows users to easily see which extractor is working better by visualizing the detected tables and cells with bounding boxes.

Parameters:

Name Type Description Default
table_result TableOutput

TableOutput containing extracted tables

required
image_path Union[str, Path, Image]

Path to original image or PIL Image object

required
output_path str

Path to save the annotated image

'visualized_tables.png'
table_color str

Color for table bounding boxes

'red'
cell_color str

Color for cell bounding boxes

'blue'
box_width int

Width of bounding box lines

2
show_text bool

Whether to overlay cell text

False
text_color str

Color for text overlay

'green'
font_size int

Font size for text overlay

12
show_table_ids bool

Whether to show table IDs

True

TableMapper

Handles mapping of table-related labels and normalization of bounding boxes.

omnidocs.tasks.table_extraction.base.BaseTableMapper

BaseTableMapper(engine_name: str)

Base class for mapping table extraction engine-specific outputs to standardized format.

Initialize mapper for specific table extraction engine.

Parameters:

Name Type Description Default
engine_name str

Name of the table extraction engine

required

detect_header_rows

detect_header_rows(cells: List[TableCell]) -> List[TableCell]

Detect and mark header cells based on position and formatting.

normalize_bbox

normalize_bbox(bbox: List[float], img_width: int, img_height: int) -> List[float]

Normalize bounding box coordinates to absolute pixel values.