Skip to content

📝 Text Extraction

This section documents the API for text extraction tasks, providing various extractors to retrieve textual content from documents.

Overview

Text extraction in OmniDocs focuses on accurately pulling out text from different document formats (PDFs, images, etc.), often preserving layout and structural information. This is a fundamental step for many document understanding applications.

Available Extractors

DoclingParseExtractor

A unified parsing library for PDF, DOCX, PPTX, HTML, and MD, with OCR and structure capabilities.

omnidocs.tasks.text_extraction.extractors.docling_parse.DoclingTextExtractor

DoclingTextExtractor(device: Optional[str] = None, show_log: bool = False, extract_images: bool = False, ocr_enabled: bool = True, table_structure_enabled: bool = True)

Bases: BaseTextExtractor

Text extractor using Docling.

Initialize Docling text extractor.

Parameters:

Name Type Description Default
device Optional[str]

Device to run on (not used for Docling)

None
show_log bool

Whether to show detailed logs

False
extract_images bool

Whether to extract images alongside text

False
ocr_enabled bool

Whether to enable OCR for scanned documents

True
table_structure_enabled bool

Whether to enable table structure detection

True

extract

extract(input_path: Union[str, Path], **kwargs) -> TextOutput

Extract text from document using Docling.

Parameters:

Name Type Description Default
input_path Union[str, Path]

Path to input document

required
**kwargs

Additional parameters (ignored for Docling)

{}

Returns:

Type Description
TextOutput

TextOutput containing extracted text

Usage Example

from omnidocs.tasks.text_extraction.extractors.docling_parse import DoclingTextExtractor

extractor = DoclingTextExtractor()
result = extractor.extract("document.pdf")
print(f"Extracted text: {result.full_text[:200]}...")

PDFPlumberTextExtractor

A library for extracting text and tables from PDFs with layout details.

omnidocs.tasks.text_extraction.extractors.pdfplumber.PdfplumberTextExtractor

PdfplumberTextExtractor(device: Optional[str] = None, show_log: bool = False, extract_images: bool = False, extract_tables: bool = False, use_layout: bool = True)

Bases: BaseTextExtractor

Text extractor using pdfplumber.

Initialize pdfplumber text extractor.

Parameters:

Name Type Description Default
device Optional[str]

Device to run on (not used for pdfplumber)

None
show_log bool

Whether to show detailed logs

False
extract_images bool

Whether to extract images alongside text

False
extract_tables bool

Whether to extract tables

False
use_layout bool

Whether to use layout information for text extraction

True

extract

extract(input_path: Union[str, Path], **kwargs) -> TextOutput

Extract text from PDF using pdfplumber.

Parameters:

Name Type Description Default
input_path Union[str, Path]

Path to input PDF

required
**kwargs

Additional parameters (ignored for pdfplumber)

{}

Returns:

Type Description
TextOutput

TextOutput containing extracted text

Usage Example

from omnidocs.tasks.text_extraction.extractors.pdfplumber import PdfplumberTextExtractor

extractor = PdfplumberTextExtractor()
result = extractor.extract("document.pdf")
print(f"Extracted text: {result.full_text[:200]}...")

PDFTextExtractor

A simple, fast PDF text extraction with layout options.

omnidocs.tasks.text_extraction.extractors.pdftext.PdftextTextExtractor

PdftextTextExtractor(device: Optional[str] = None, show_log: bool = False, extract_images: bool = False, keep_layout: bool = False, physical_layout: bool = False)

Bases: BaseTextExtractor

Text extractor using pdftext.

Initialize pdftext text extractor.

Parameters:

Name Type Description Default
device Optional[str]

Device to run on (not used for pdftext)

None
show_log bool

Whether to show detailed logs

False
extract_images bool

Whether to extract images alongside text

False
keep_layout bool

Whether to keep original layout formatting

False
physical_layout bool

Whether to use physical layout analysis

False

extract

extract(input_path: Union[str, Path], **kwargs) -> TextOutput

Extract text from PDF using pdftext.

Parameters:

Name Type Description Default
input_path Union[str, Path]

Path to input PDF

required
**kwargs

Additional parameters (ignored for pdftext)

{}

Returns:

Type Description
TextOutput

TextOutput containing extracted text

Usage Example

from omnidocs.tasks.text_extraction.extractors.pdftext import PdftextTextExtractor

extractor = PdftextTextExtractor()
result = extractor.extract("document.pdf")
print(f"Extracted text: {result.full_text[:200]}...")

PyMuPDFTextExtractor

A fast, multi-format text extraction library with layout and font information.

omnidocs.tasks.text_extraction.extractors.pymupdf.PyMuPDFTextExtractor

PyMuPDFTextExtractor(device: Optional[str] = None, show_log: bool = False, extract_images: bool = False, extract_tables: bool = False, flags: int = 0, clip: Optional[tuple] = None)

Bases: BaseTextExtractor

Text extractor using PyMuPDF (fitz).

Initialize PyMuPDF text extractor.

Parameters:

Name Type Description Default
device Optional[str]

Device to run on (not used for PyMuPDF)

None
show_log bool

Whether to show detailed logs

False
extract_images bool

Whether to extract images alongside text

False
extract_tables bool

Whether to extract tables

False
flags int

Text extraction flags (fitz.TEXT_PRESERVE_LIGATURES, etc.)

0
clip Optional[tuple]

Optional clipping rectangle (x0, y0, x1, y1)

None

extract

extract(input_path: Union[str, Path], use_layout: bool = True, **kwargs) -> TextOutput

Extract text from document using PyMuPDF.

Parameters:

Name Type Description Default
input_path Union[str, Path]

Path to input document

required
use_layout bool

Whether to use layout information for extraction

True
**kwargs

Additional parameters

{}

Returns:

Type Description
TextOutput

TextOutput containing extracted text

Usage Example

from omnidocs.tasks.text_extraction.extractors.pymupdf import PyMuPDFTextExtractor

extractor = PyMuPDFTextExtractor()
result = extractor.extract("document.pdf")
print(f"Extracted text: {result.full_text[:200]}...")

PyPDF2TextExtractor

A pure Python library for extracting text from PDFs, supporting encrypted PDFs and form fields.

omnidocs.tasks.text_extraction.extractors.pypdf2.PyPDF2TextExtractor

PyPDF2TextExtractor(device: Optional[str] = None, show_log: bool = False, extract_images: bool = False, ignore_images: bool = True, extract_forms: bool = False)

Bases: BaseTextExtractor

Text extractor using PyPDF2.

Initialize PyPDF2 text extractor.

Parameters:

Name Type Description Default
device Optional[str]

Device to run on (not used for PyPDF2)

None
show_log bool

Whether to show detailed logs

False
extract_images bool

Whether to extract images alongside text

False
ignore_images bool

Whether to ignore images during text extraction

True
extract_forms bool

Whether to extract form fields

False

extract

extract(input_path: Union[str, Path], password: Optional[str] = None, **kwargs) -> TextOutput

Extract text from PDF using PyPDF2.

Parameters:

Name Type Description Default
input_path Union[str, Path]

Path to input PDF

required
password Optional[str]

Optional password for encrypted PDFs

None
**kwargs

Additional parameters (ignored for PyPDF2)

{}

Returns:

Type Description
TextOutput

TextOutput containing extracted text

Usage Example

from omnidocs.tasks.text_extraction.extractors.pypdf2 import PyPDF2TextExtractor

extractor = PyPDF2TextExtractor()
result = extractor.extract("document.pdf")
print(f"Extracted text: {result.full_text[:200]}...")

SuryaTextExtractor

Surya-based text extraction for images and documents.

omnidocs.tasks.text_extraction.extractors.surya_text.SuryaTextExtractor

SuryaTextExtractor(device: Optional[str] = None, show_log: bool = False, extract_images: bool = False, model_path: Optional[Union[str, Path]] = None, **kwargs)

Bases: BaseTextExtractor

Surya-based text extraction implementation for images and documents.

Initialize Surya Text Extractor.

extract

extract(input_path: Union[str, Path, Image], **kwargs) -> TextOutput

Extract text using Surya OCR.

Usage Example

from omnidocs.tasks.text_extraction.extractors.surya_text import SuryaTextExtractor

extractor = SuryaTextExtractor()
result = extractor.extract("image.png")
print(f"Extracted text: {result.full_text[:200]}...")

TextOutput

The standardized output format for text extraction results.

omnidocs.tasks.text_extraction.base.TextOutput

Bases: BaseModel

Container for text extraction results.

Attributes:

Name Type Description
text_blocks List[TextBlock]

List of extracted text blocks

full_text str

Combined text from all blocks

metadata Optional[Dict[str, Any]]

Additional metadata from extraction

source_info Optional[Dict[str, Any]]

Information about the source document

processing_time Optional[float]

Time taken for text extraction

page_count int

Number of pages in the document

get_sorted_by_reading_order

get_sorted_by_reading_order() -> List[TextBlock]

Get text blocks sorted by reading order.

get_text_by_confidence

get_text_by_confidence(min_confidence: float = 0.5) -> List[TextBlock]

Filter text blocks by minimum confidence threshold.

get_text_by_page

get_text_by_page(page_num: int) -> List[TextBlock]

Get text blocks from a specific page.

get_text_by_type

get_text_by_type(block_type: str) -> List[TextBlock]

Get text blocks of a specific type.

save_json

save_json(output_path: Union[str, Path]) -> None

Save output to JSON file.

save_markdown

save_markdown(output_path: Union[str, Path]) -> None

Save text as markdown with basic formatting.

save_text

save_text(output_path: Union[str, Path]) -> None

Save full text to a text file.

to_dict

to_dict() -> Dict

Convert to dictionary representation.

Key Properties

  • text_blocks (List[TextBlock]): List of extracted text blocks with positions.
  • full_text (str): The complete extracted text content.
  • source_file (str): Path to the processed file.

Key Methods

  • save_json(output_path): Save results to a JSON file.

TextBlock

Represents a single block of text with its bounding box.

omnidocs.tasks.text_extraction.base.TextBlock

Bases: BaseModel

Container for individual text block.

Attributes:

Name Type Description
text str

Text content

bbox Optional[List[float]]

Bounding box coordinates [x1, y1, x2, y2]

confidence Optional[float]

Confidence score for text extraction

page_num int

Page number (for multi-page documents)

block_type Optional[str]

Type of text block (paragraph, heading, list, etc.)

font_info Optional[Dict[str, Any]]

Optional font information

reading_order Optional[int]

Reading order index

language Optional[str]

Detected language of the text

to_dict

to_dict() -> Dict

Convert to dictionary representation.

Attributes

  • text (str): The text content of the block.
  • bbox (List[float]): Bounding box coordinates [x1, y1, x2, y2].
  • page_number (int): The page number where the text block is found.

BaseTextExtractor

The abstract base class for all text extraction extractors.

omnidocs.tasks.text_extraction.base.BaseTextExtractor

BaseTextExtractor(device: Optional[str] = None, show_log: bool = False, engine_name: Optional[str] = None, extract_images: bool = False)

Bases: ABC

Base class for text extraction models.

Initialize the text extractor.

Parameters:

Name Type Description Default
device Optional[str]

Device to run model on ('cuda' or 'cpu')

None
show_log bool

Whether to show detailed logs

False
engine_name Optional[str]

Name of the text extraction engine

None
extract_images bool

Whether to extract images alongside text

False

extract abstractmethod

extract(input_path: Union[str, Path], **kwargs) -> TextOutput

Extract text from input document.

Parameters:

Name Type Description Default
input_path Union[str, Path]

Path to input document

required
**kwargs

Additional model-specific parameters

{}

Returns:

Type Description
TextOutput

TextOutput containing extracted text

preprocess_input

preprocess_input(input_path: Union[str, Path]) -> Any

Preprocess input document for text extraction.

Parameters:

Name Type Description Default
input_path Union[str, Path]

Path to input document

required

Returns:

Type Description
Any

Preprocessed document object

postprocess_output

postprocess_output(raw_output: Any, source_info: Optional[Dict] = None) -> TextOutput

Convert raw text extraction output to standardized TextOutput format.

Parameters:

Name Type Description Default
raw_output Any

Raw output from text extraction engine

required
source_info Optional[Dict]

Optional source document information

None

Returns:

Type Description
TextOutput

Standardized TextOutput object