📝 Text Extraction
This section documents the API for text extraction tasks, providing various extractors to retrieve textual content from documents.
Overview
Text extraction in OmniDocs focuses on accurately pulling out text from different document formats (PDFs, images, etc.), often preserving layout and structural information. This is a fundamental step for many document understanding applications.
Available Extractors
DoclingParseExtractor
A unified parsing library for PDF, DOCX, PPTX, HTML, and MD, with OCR and structure capabilities.
omnidocs.tasks.text_extraction.extractors.docling_parse.DoclingTextExtractor
DoclingTextExtractor(device: Optional[str] = None, show_log: bool = False, extract_images: bool = False, ocr_enabled: bool = True, table_structure_enabled: bool = True)
Bases: BaseTextExtractor
Text extractor using Docling.
Initialize Docling text extractor.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
device
|
Optional[str]
|
Device to run on (not used for Docling) |
None
|
show_log
|
bool
|
Whether to show detailed logs |
False
|
extract_images
|
bool
|
Whether to extract images alongside text |
False
|
ocr_enabled
|
bool
|
Whether to enable OCR for scanned documents |
True
|
table_structure_enabled
|
bool
|
Whether to enable table structure detection |
True
|
extract
Extract text from document using Docling.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_path
|
Union[str, Path]
|
Path to input document |
required |
**kwargs
|
Additional parameters (ignored for Docling) |
{}
|
Returns:
Type | Description |
---|---|
TextOutput
|
TextOutput containing extracted text |
Usage Example
from omnidocs.tasks.text_extraction.extractors.docling_parse import DoclingTextExtractor
extractor = DoclingTextExtractor()
result = extractor.extract("document.pdf")
print(f"Extracted text: {result.full_text[:200]}...")
PDFPlumberTextExtractor
A library for extracting text and tables from PDFs with layout details.
omnidocs.tasks.text_extraction.extractors.pdfplumber.PdfplumberTextExtractor
PdfplumberTextExtractor(device: Optional[str] = None, show_log: bool = False, extract_images: bool = False, extract_tables: bool = False, use_layout: bool = True)
Bases: BaseTextExtractor
Text extractor using pdfplumber.
Initialize pdfplumber text extractor.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
device
|
Optional[str]
|
Device to run on (not used for pdfplumber) |
None
|
show_log
|
bool
|
Whether to show detailed logs |
False
|
extract_images
|
bool
|
Whether to extract images alongside text |
False
|
extract_tables
|
bool
|
Whether to extract tables |
False
|
use_layout
|
bool
|
Whether to use layout information for text extraction |
True
|
extract
Extract text from PDF using pdfplumber.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_path
|
Union[str, Path]
|
Path to input PDF |
required |
**kwargs
|
Additional parameters (ignored for pdfplumber) |
{}
|
Returns:
Type | Description |
---|---|
TextOutput
|
TextOutput containing extracted text |
Usage Example
from omnidocs.tasks.text_extraction.extractors.pdfplumber import PdfplumberTextExtractor
extractor = PdfplumberTextExtractor()
result = extractor.extract("document.pdf")
print(f"Extracted text: {result.full_text[:200]}...")
PDFTextExtractor
A simple, fast PDF text extraction with layout options.
omnidocs.tasks.text_extraction.extractors.pdftext.PdftextTextExtractor
PdftextTextExtractor(device: Optional[str] = None, show_log: bool = False, extract_images: bool = False, keep_layout: bool = False, physical_layout: bool = False)
Bases: BaseTextExtractor
Text extractor using pdftext.
Initialize pdftext text extractor.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
device
|
Optional[str]
|
Device to run on (not used for pdftext) |
None
|
show_log
|
bool
|
Whether to show detailed logs |
False
|
extract_images
|
bool
|
Whether to extract images alongside text |
False
|
keep_layout
|
bool
|
Whether to keep original layout formatting |
False
|
physical_layout
|
bool
|
Whether to use physical layout analysis |
False
|
extract
Extract text from PDF using pdftext.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_path
|
Union[str, Path]
|
Path to input PDF |
required |
**kwargs
|
Additional parameters (ignored for pdftext) |
{}
|
Returns:
Type | Description |
---|---|
TextOutput
|
TextOutput containing extracted text |
Usage Example
from omnidocs.tasks.text_extraction.extractors.pdftext import PdftextTextExtractor
extractor = PdftextTextExtractor()
result = extractor.extract("document.pdf")
print(f"Extracted text: {result.full_text[:200]}...")
PyMuPDFTextExtractor
A fast, multi-format text extraction library with layout and font information.
omnidocs.tasks.text_extraction.extractors.pymupdf.PyMuPDFTextExtractor
PyMuPDFTextExtractor(device: Optional[str] = None, show_log: bool = False, extract_images: bool = False, extract_tables: bool = False, flags: int = 0, clip: Optional[tuple] = None)
Bases: BaseTextExtractor
Text extractor using PyMuPDF (fitz).
Initialize PyMuPDF text extractor.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
device
|
Optional[str]
|
Device to run on (not used for PyMuPDF) |
None
|
show_log
|
bool
|
Whether to show detailed logs |
False
|
extract_images
|
bool
|
Whether to extract images alongside text |
False
|
extract_tables
|
bool
|
Whether to extract tables |
False
|
flags
|
int
|
Text extraction flags (fitz.TEXT_PRESERVE_LIGATURES, etc.) |
0
|
clip
|
Optional[tuple]
|
Optional clipping rectangle (x0, y0, x1, y1) |
None
|
extract
Extract text from document using PyMuPDF.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_path
|
Union[str, Path]
|
Path to input document |
required |
use_layout
|
bool
|
Whether to use layout information for extraction |
True
|
**kwargs
|
Additional parameters |
{}
|
Returns:
Type | Description |
---|---|
TextOutput
|
TextOutput containing extracted text |
Usage Example
from omnidocs.tasks.text_extraction.extractors.pymupdf import PyMuPDFTextExtractor
extractor = PyMuPDFTextExtractor()
result = extractor.extract("document.pdf")
print(f"Extracted text: {result.full_text[:200]}...")
PyPDF2TextExtractor
A pure Python library for extracting text from PDFs, supporting encrypted PDFs and form fields.
omnidocs.tasks.text_extraction.extractors.pypdf2.PyPDF2TextExtractor
PyPDF2TextExtractor(device: Optional[str] = None, show_log: bool = False, extract_images: bool = False, ignore_images: bool = True, extract_forms: bool = False)
Bases: BaseTextExtractor
Text extractor using PyPDF2.
Initialize PyPDF2 text extractor.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
device
|
Optional[str]
|
Device to run on (not used for PyPDF2) |
None
|
show_log
|
bool
|
Whether to show detailed logs |
False
|
extract_images
|
bool
|
Whether to extract images alongside text |
False
|
ignore_images
|
bool
|
Whether to ignore images during text extraction |
True
|
extract_forms
|
bool
|
Whether to extract form fields |
False
|
extract
Extract text from PDF using PyPDF2.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_path
|
Union[str, Path]
|
Path to input PDF |
required |
password
|
Optional[str]
|
Optional password for encrypted PDFs |
None
|
**kwargs
|
Additional parameters (ignored for PyPDF2) |
{}
|
Returns:
Type | Description |
---|---|
TextOutput
|
TextOutput containing extracted text |
Usage Example
from omnidocs.tasks.text_extraction.extractors.pypdf2 import PyPDF2TextExtractor
extractor = PyPDF2TextExtractor()
result = extractor.extract("document.pdf")
print(f"Extracted text: {result.full_text[:200]}...")
SuryaTextExtractor
Surya-based text extraction for images and documents.
omnidocs.tasks.text_extraction.extractors.surya_text.SuryaTextExtractor
SuryaTextExtractor(device: Optional[str] = None, show_log: bool = False, extract_images: bool = False, model_path: Optional[Union[str, Path]] = None, **kwargs)
Bases: BaseTextExtractor
Surya-based text extraction implementation for images and documents.
Initialize Surya Text Extractor.
Usage Example
from omnidocs.tasks.text_extraction.extractors.surya_text import SuryaTextExtractor
extractor = SuryaTextExtractor()
result = extractor.extract("image.png")
print(f"Extracted text: {result.full_text[:200]}...")
TextOutput
The standardized output format for text extraction results.
omnidocs.tasks.text_extraction.base.TextOutput
Bases: BaseModel
Container for text extraction results.
Attributes:
Name | Type | Description |
---|---|---|
text_blocks |
List[TextBlock]
|
List of extracted text blocks |
full_text |
str
|
Combined text from all blocks |
metadata |
Optional[Dict[str, Any]]
|
Additional metadata from extraction |
source_info |
Optional[Dict[str, Any]]
|
Information about the source document |
processing_time |
Optional[float]
|
Time taken for text extraction |
page_count |
int
|
Number of pages in the document |
get_sorted_by_reading_order
Get text blocks sorted by reading order.
get_text_by_confidence
Filter text blocks by minimum confidence threshold.
get_text_by_page
Get text blocks from a specific page.
get_text_by_type
Get text blocks of a specific type.
save_markdown
Save text as markdown with basic formatting.
Key Properties
text_blocks
(List[TextBlock]): List of extracted text blocks with positions.full_text
(str): The complete extracted text content.source_file
(str): Path to the processed file.
Key Methods
save_json(output_path)
: Save results to a JSON file.
TextBlock
Represents a single block of text with its bounding box.
omnidocs.tasks.text_extraction.base.TextBlock
Bases: BaseModel
Container for individual text block.
Attributes:
Name | Type | Description |
---|---|---|
text |
str
|
Text content |
bbox |
Optional[List[float]]
|
Bounding box coordinates [x1, y1, x2, y2] |
confidence |
Optional[float]
|
Confidence score for text extraction |
page_num |
int
|
Page number (for multi-page documents) |
block_type |
Optional[str]
|
Type of text block (paragraph, heading, list, etc.) |
font_info |
Optional[Dict[str, Any]]
|
Optional font information |
reading_order |
Optional[int]
|
Reading order index |
language |
Optional[str]
|
Detected language of the text |
Attributes
text
(str): The text content of the block.bbox
(List[float]): Bounding box coordinates [x1, y1, x2, y2].page_number
(int): The page number where the text block is found.
BaseTextExtractor
The abstract base class for all text extraction extractors.
omnidocs.tasks.text_extraction.base.BaseTextExtractor
BaseTextExtractor(device: Optional[str] = None, show_log: bool = False, engine_name: Optional[str] = None, extract_images: bool = False)
Bases: ABC
Base class for text extraction models.
Initialize the text extractor.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
device
|
Optional[str]
|
Device to run model on ('cuda' or 'cpu') |
None
|
show_log
|
bool
|
Whether to show detailed logs |
False
|
engine_name
|
Optional[str]
|
Name of the text extraction engine |
None
|
extract_images
|
bool
|
Whether to extract images alongside text |
False
|
extract
abstractmethod
Extract text from input document.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_path
|
Union[str, Path]
|
Path to input document |
required |
**kwargs
|
Additional model-specific parameters |
{}
|
Returns:
Type | Description |
---|---|
TextOutput
|
TextOutput containing extracted text |
preprocess_input
Preprocess input document for text extraction.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_path
|
Union[str, Path]
|
Path to input document |
required |
Returns:
Type | Description |
---|---|
Any
|
Preprocessed document object |
postprocess_output
Convert raw text extraction output to standardized TextOutput format.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
raw_output
|
Any
|
Raw output from text extraction engine |
required |
source_info
|
Optional[Dict]
|
Optional source document information |
None
|
Returns:
Type | Description |
---|---|
TextOutput
|
Standardized TextOutput object |