📝 Text Extraction

This section documents the API for text extraction tasks, providing various extractors to retrieve textual content from documents.

Overview

Text extraction in OmniDocs focuses on accurately pulling out text from different document formats (PDFs, images, etc.), often preserving layout and structural information. This is a fundamental step for many document understanding applications.

Available Extractors

DoclingParseExtractor

A unified parsing library for PDF, DOCX, PPTX, HTML, and MD, with OCR and structure capabilities.

omnidocs.tasks.text_extraction.extractors.docling_parse.DoclingTextExtractor

DoclingTextExtractor(device: Optional[str] = None, show_log: bool = False, extract_images: bool = False, ocr_enabled: bool = True, table_structure_enabled: bool = True)

Bases: BaseTextExtractor

Text extractor using Docling.

Initialize Docling text extractor.

Parameters:

Name	Type	Description	Default
`device`	`Optional[str]`	Device to run on (not used for Docling)	`None`
`show_log`	`bool`	Whether to show detailed logs	`False`
`extract_images`	`bool`	Whether to extract images alongside text	`False`
`ocr_enabled`	`bool`	Whether to enable OCR for scanned documents	`True`
`table_structure_enabled`	`bool`	Whether to enable table structure detection	`True`

extract

extract(input_path: Union[str, Path], **kwargs) -> TextOutput

Extract text from document using Docling.

Parameters:

Name	Type	Description	Default
`input_path`	`Union[str, Path]`	Path to input document	required
`**kwargs`		Additional parameters (ignored for Docling)	`{}`

Returns:

Type	Description
`TextOutput`	TextOutput containing extracted text

Usage Example

from omnidocs.tasks.text_extraction.extractors.docling_parse import DoclingTextExtractor

extractor = DoclingTextExtractor()
result = extractor.extract("document.pdf")
print(f"Extracted text: {result.full_text[:200]}...")

PDFPlumberTextExtractor

A library for extracting text and tables from PDFs with layout details.

omnidocs.tasks.text_extraction.extractors.pdfplumber.PdfplumberTextExtractor

PdfplumberTextExtractor(device: Optional[str] = None, show_log: bool = False, extract_images: bool = False, extract_tables: bool = False, use_layout: bool = True)

Bases: BaseTextExtractor

Text extractor using pdfplumber.

Initialize pdfplumber text extractor.

Parameters:

Name	Type	Description	Default
`device`	`Optional[str]`	Device to run on (not used for pdfplumber)	`None`
`show_log`	`bool`	Whether to show detailed logs	`False`
`extract_images`	`bool`	Whether to extract images alongside text	`False`
`extract_tables`	`bool`	Whether to extract tables	`False`
`use_layout`	`bool`	Whether to use layout information for text extraction	`True`

extract

extract(input_path: Union[str, Path], **kwargs) -> TextOutput

Extract text from PDF using pdfplumber.

Parameters:

Name	Type	Description	Default
`input_path`	`Union[str, Path]`	Path to input PDF	required
`**kwargs`		Additional parameters (ignored for pdfplumber)	`{}`

Returns:

Type	Description
`TextOutput`	TextOutput containing extracted text

Usage Example

from omnidocs.tasks.text_extraction.extractors.pdfplumber import PdfplumberTextExtractor

extractor = PdfplumberTextExtractor()
result = extractor.extract("document.pdf")
print(f"Extracted text: {result.full_text[:200]}...")

PDFTextExtractor

A simple, fast PDF text extraction with layout options.

omnidocs.tasks.text_extraction.extractors.pdftext.PdftextTextExtractor

PdftextTextExtractor(device: Optional[str] = None, show_log: bool = False, extract_images: bool = False, keep_layout: bool = False, physical_layout: bool = False)

Bases: BaseTextExtractor

Text extractor using pdftext.

Initialize pdftext text extractor.

Parameters:

Name	Type	Description	Default
`device`	`Optional[str]`	Device to run on (not used for pdftext)	`None`
`show_log`	`bool`	Whether to show detailed logs	`False`
`extract_images`	`bool`	Whether to extract images alongside text	`False`
`keep_layout`	`bool`	Whether to keep original layout formatting	`False`
`physical_layout`	`bool`	Whether to use physical layout analysis	`False`

extract

extract(input_path: Union[str, Path], **kwargs) -> TextOutput

Extract text from PDF using pdftext.

Parameters:

Name	Type	Description	Default
`input_path`	`Union[str, Path]`	Path to input PDF	required
`**kwargs`		Additional parameters (ignored for pdftext)	`{}`

Returns:

Type	Description
`TextOutput`	TextOutput containing extracted text

Usage Example

from omnidocs.tasks.text_extraction.extractors.pdftext import PdftextTextExtractor

extractor = PdftextTextExtractor()
result = extractor.extract("document.pdf")
print(f"Extracted text: {result.full_text[:200]}...")

PyMuPDFTextExtractor

A fast, multi-format text extraction library with layout and font information.

omnidocs.tasks.text_extraction.extractors.pymupdf.PyMuPDFTextExtractor

PyMuPDFTextExtractor(device: Optional[str] = None, show_log: bool = False, extract_images: bool = False, extract_tables: bool = False, flags: int = 0, clip: Optional[tuple] = None)

Bases: BaseTextExtractor

Text extractor using PyMuPDF (fitz).

Initialize PyMuPDF text extractor.

Parameters:

Name	Type	Description	Default
`device`	`Optional[str]`	Device to run on (not used for PyMuPDF)	`None`
`show_log`	`bool`	Whether to show detailed logs	`False`
`extract_images`	`bool`	Whether to extract images alongside text	`False`
`extract_tables`	`bool`	Whether to extract tables	`False`
`flags`	`int`	Text extraction flags (fitz.TEXT_PRESERVE_LIGATURES, etc.)	`0`
`clip`	`Optional[tuple]`	Optional clipping rectangle (x0, y0, x1, y1)	`None`

extract

extract(input_path: Union[str, Path], use_layout: bool = True, **kwargs) -> TextOutput

Extract text from document using PyMuPDF.

Parameters:

Name	Type	Description	Default
`input_path`	`Union[str, Path]`	Path to input document	required
`use_layout`	`bool`	Whether to use layout information for extraction	`True`
`**kwargs`		Additional parameters	`{}`

Returns:

Type	Description
`TextOutput`	TextOutput containing extracted text

Usage Example

from omnidocs.tasks.text_extraction.extractors.pymupdf import PyMuPDFTextExtractor

extractor = PyMuPDFTextExtractor()
result = extractor.extract("document.pdf")
print(f"Extracted text: {result.full_text[:200]}...")

PyPDF2TextExtractor

A pure Python library for extracting text from PDFs, supporting encrypted PDFs and form fields.

omnidocs.tasks.text_extraction.extractors.pypdf2.PyPDF2TextExtractor

PyPDF2TextExtractor(device: Optional[str] = None, show_log: bool = False, extract_images: bool = False, ignore_images: bool = True, extract_forms: bool = False)

Bases: BaseTextExtractor

Text extractor using PyPDF2.

Initialize PyPDF2 text extractor.

Parameters:

Name	Type	Description	Default
`device`	`Optional[str]`	Device to run on (not used for PyPDF2)	`None`
`show_log`	`bool`	Whether to show detailed logs	`False`
`extract_images`	`bool`	Whether to extract images alongside text	`False`
`ignore_images`	`bool`	Whether to ignore images during text extraction	`True`
`extract_forms`	`bool`	Whether to extract form fields	`False`

extract

extract(input_path: Union[str, Path], password: Optional[str] = None, **kwargs) -> TextOutput

Extract text from PDF using PyPDF2.

Parameters:

Name	Type	Description	Default
`input_path`	`Union[str, Path]`	Path to input PDF	required
`password`	`Optional[str]`	Optional password for encrypted PDFs	`None`
`**kwargs`		Additional parameters (ignored for PyPDF2)	`{}`

Returns:

Type	Description
`TextOutput`	TextOutput containing extracted text

Usage Example

from omnidocs.tasks.text_extraction.extractors.pypdf2 import PyPDF2TextExtractor

extractor = PyPDF2TextExtractor()
result = extractor.extract("document.pdf")
print(f"Extracted text: {result.full_text[:200]}...")

SuryaTextExtractor

Surya-based text extraction for images and documents.

omnidocs.tasks.text_extraction.extractors.surya_text.SuryaTextExtractor

SuryaTextExtractor(device: Optional[str] = None, show_log: bool = False, extract_images: bool = False, model_path: Optional[Union[str, Path]] = None, **kwargs)

Bases: BaseTextExtractor

Surya-based text extraction implementation for images and documents.

Initialize Surya Text Extractor.

extract

extract(input_path: Union[str, Path, Image], **kwargs) -> TextOutput

Extract text using Surya OCR.

Usage Example

from omnidocs.tasks.text_extraction.extractors.surya_text import SuryaTextExtractor

extractor = SuryaTextExtractor()
result = extractor.extract("image.png")
print(f"Extracted text: {result.full_text[:200]}...")

TextOutput

The standardized output format for text extraction results.

omnidocs.tasks.text_extraction.base.TextOutput

Bases: BaseModel

Container for text extraction results.

Attributes:

Name	Type	Description
`text_blocks`	`List[TextBlock]`	List of extracted text blocks
`full_text`	`str`	Combined text from all blocks
`metadata`	`Optional[Dict[str, Any]]`	Additional metadata from extraction
`source_info`	`Optional[Dict[str, Any]]`	Information about the source document
`processing_time`	`Optional[float]`	Time taken for text extraction
`page_count`	`int`	Number of pages in the document

get_sorted_by_reading_order

get_sorted_by_reading_order() -> List[TextBlock]

Get text blocks sorted by reading order.

get_text_by_confidence

get_text_by_confidence(min_confidence: float = 0.5) -> List[TextBlock]

Filter text blocks by minimum confidence threshold.

get_text_by_page

get_text_by_page(page_num: int) -> List[TextBlock]

Get text blocks from a specific page.

get_text_by_type

get_text_by_type(block_type: str) -> List[TextBlock]

Get text blocks of a specific type.

save_json

save_json(output_path: Union[str, Path]) -> None

Save output to JSON file.

save_markdown

save_markdown(output_path: Union[str, Path]) -> None

Save text as markdown with basic formatting.

save_text

save_text(output_path: Union[str, Path]) -> None

Save full text to a text file.

to_dict

to_dict() -> Dict

Convert to dictionary representation.

Key Properties

text_blocks (List[TextBlock]): List of extracted text blocks with positions.
full_text (str): The complete extracted text content.
source_file (str): Path to the processed file.

Key Methods

save_json(output_path): Save results to a JSON file.

TextBlock

Represents a single block of text with its bounding box.

omnidocs.tasks.text_extraction.base.TextBlock

Bases: BaseModel

Container for individual text block.

Attributes:

Name	Type	Description
`text`	`str`	Text content
`bbox`	`Optional[List[float]]`	Bounding box coordinates [x1, y1, x2, y2]
`confidence`	`Optional[float]`	Confidence score for text extraction
`page_num`	`int`	Page number (for multi-page documents)
`block_type`	`Optional[str]`	Type of text block (paragraph, heading, list, etc.)
`font_info`	`Optional[Dict[str, Any]]`	Optional font information
`reading_order`	`Optional[int]`	Reading order index
`language`	`Optional[str]`	Detected language of the text

to_dict

to_dict() -> Dict

Convert to dictionary representation.

Attributes

text (str): The text content of the block.
bbox (List[float]): Bounding box coordinates [x1, y1, x2, y2].
page_number (int): The page number where the text block is found.

BaseTextExtractor

The abstract base class for all text extraction extractors.

omnidocs.tasks.text_extraction.base.BaseTextExtractor

BaseTextExtractor(device: Optional[str] = None, show_log: bool = False, engine_name: Optional[str] = None, extract_images: bool = False)

Bases: ABC

Base class for text extraction models.

Initialize the text extractor.

Parameters:

Name	Type	Description	Default
`device`	`Optional[str]`	Device to run model on ('cuda' or 'cpu')	`None`
`show_log`	`bool`	Whether to show detailed logs	`False`
`engine_name`	`Optional[str]`	Name of the text extraction engine	`None`
`extract_images`	`bool`	Whether to extract images alongside text	`False`

extract `abstractmethod`

extract(input_path: Union[str, Path], **kwargs) -> TextOutput

Extract text from input document.

Parameters:

Name	Type	Description	Default
`input_path`	`Union[str, Path]`	Path to input document	required
`**kwargs`		Additional model-specific parameters	`{}`

Returns:

Type	Description
`TextOutput`	TextOutput containing extracted text

preprocess_input

preprocess_input(input_path: Union[str, Path]) -> Any

Preprocess input document for text extraction.

Parameters:

Name	Type	Description	Default
`input_path`	`Union[str, Path]`	Path to input document	required

Returns:

Type	Description
`Any`	Preprocessed document object

postprocess_output

postprocess_output(raw_output: Any, source_info: Optional[Dict] = None) -> TextOutput

Convert raw text extraction output to standardized TextOutput format.

Parameters:

Name	Type	Description	Default
`raw_output`	`Any`	Raw output from text extraction engine	required
`source_info`	`Optional[Dict]`	Optional source document information	`None`

Returns:

Type	Description
`TextOutput`	Standardized TextOutput object

📝 Text Extraction

Overview

Available Extractors

DoclingParseExtractor

omnidocs.tasks.text_extraction.extractors.docling_parse.DoclingTextExtractor

extract

Usage Example

PDFPlumberTextExtractor

omnidocs.tasks.text_extraction.extractors.pdfplumber.PdfplumberTextExtractor

extract

Usage Example

PDFTextExtractor

omnidocs.tasks.text_extraction.extractors.pdftext.PdftextTextExtractor

extract

Usage Example

PyMuPDFTextExtractor

omnidocs.tasks.text_extraction.extractors.pymupdf.PyMuPDFTextExtractor

extract

Usage Example

PyPDF2TextExtractor

omnidocs.tasks.text_extraction.extractors.pypdf2.PyPDF2TextExtractor

extract

Usage Example

SuryaTextExtractor

omnidocs.tasks.text_extraction.extractors.surya_text.SuryaTextExtractor

extract

Usage Example

TextOutput

omnidocs.tasks.text_extraction.base.TextOutput

get_sorted_by_reading_order

get_text_by_confidence

get_text_by_page

get_text_by_type

save_json

save_markdown

save_text

to_dict

Key Properties

Key Methods

TextBlock

omnidocs.tasks.text_extraction.base.TextBlock

to_dict

Attributes

BaseTextExtractor

omnidocs.tasks.text_extraction.base.BaseTextExtractor

extract abstractmethod

preprocess_input

postprocess_output

Related Resources

extract `abstractmethod`