🖹 OCR (Optical Character Recognition)
This section documents the API for OCR tasks, providing various extractors to recognize and extract text from images and scanned documents.
Overview
OCR in OmniDocs enables the conversion of images (e.g., scanned documents, photos) into machine-readable text. It supports multiple engines, allowing you to choose the best balance of speed, accuracy, and language support for your needs.
Available Extractors
EasyOCRExtractor
A simple and easy-to-use OCR library that supports multiple languages and is built on PyTorch.
omnidocs.tasks.ocr_extraction.extractors.easy_ocr.EasyOCRExtractor
EasyOCRExtractor(device: Optional[str] = None, show_log: bool = False, languages: Optional[List[str]] = None, gpu: bool = True, **kwargs)
Usage Example
from omnidocs.tasks.ocr_extraction.extractors.easy_ocr import EasyOCRExtractor
extractor = EasyOCRExtractor(languages=['en'])
result = extractor.extract("scanned_document.png")
print(f"Extracted text: {result.full_text[:200]}...")
PaddleOCRExtractor
An OCR tool that supports multiple languages and provides layout detection capabilities.
omnidocs.tasks.ocr_extraction.extractors.paddle.PaddleOCRExtractor
PaddleOCRExtractor(device: Optional[str] = None, show_log: bool = False, languages: Optional[List[str]] = None, use_angle_cls: bool = True, use_gpu: bool = True, drop_score: float = 0.5, model_path: Optional[str] = None, **kwargs)
Bases: BaseOCRExtractor
PaddleOCR based text extraction implementation.
Initialize PaddleOCR Extractor.
Usage Example
from omnidocs.tasks.ocr_extraction.extractors.paddle import PaddleOCRExtractor
extractor = PaddleOCRExtractor(languages=['en'])
result = extractor.extract("scanned_document.png")
print(f"Extracted text: {result.full_text[:200]}...")
SuryaOCRExtractor
A modern, high-accuracy OCR engine, part of the Surya library, with strong support for Indian languages.
omnidocs.tasks.ocr_extraction.extractors.surya_ocr.SuryaOCRExtractor
SuryaOCRExtractor(device: Optional[str] = None, show_log: bool = False, languages: Optional[List[str]] = None, **kwargs)
Bases: BaseOCRExtractor
Surya OCR based text extraction implementation.
Initialize Surya OCR Extractor.
Usage Example
from omnidocs.tasks.ocr_extraction.extractors.surya_ocr import SuryaOCRExtractor
extractor = SuryaOCRExtractor(languages=['en'])
result = extractor.extract("scanned_document.png")
print(f"Extracted text: {result.full_text[:200]}...")
TesseractOCRExtractor
An open-source OCR engine that supports multiple languages and is widely used for text extraction from images.
omnidocs.tasks.ocr_extraction.extractors.tesseract_ocr.TesseractOCRExtractor
TesseractOCRExtractor(device: Optional[str] = None, show_log: bool = False, languages: Optional[List[str]] = None, psm: int = 6, oem: int = 3, config: str = '', **kwargs)
Bases: BaseOCRExtractor
Tesseract OCR based text extraction implementation.
Initialize Tesseract OCR Extractor.
Usage Example
from omnidocs.tasks.ocr_extraction.extractors.tesseract_ocr import TesseractOCRExtractor
extractor = TesseractOCRExtractor(languages=['eng']) # Tesseract uses 'eng' for English
result = extractor.extract("scanned_document.png")
print(f"Extracted text: {result.full_text[:200]}...")
OCROutput
The standardized output format for OCR results.
omnidocs.tasks.ocr_extraction.base.OCROutput
Bases: BaseModel
Container for OCR extraction results.
Attributes:
Name | Type | Description |
---|---|---|
texts |
List[OCRText]
|
List of detected text objects |
full_text |
str
|
Combined text from all detections |
source_img_size |
Optional[Tuple[int, int]]
|
Original image dimensions (width, height) |
processing_time |
Optional[float]
|
Time taken for OCR processing |
metadata |
Optional[Dict[str, Any]]
|
Additional metadata from the OCR engine |
get_sorted_by_reading_order
Get texts sorted by reading order (top-to-bottom, left-to-right if no reading_order).
get_text_by_confidence
Filter texts by minimum confidence threshold.
Key Properties
texts
(List[OCRText]): List of individual text regions detected.full_text
(str): The combined text from all detected regions.source_img_size
(Tuple[int, int]): Dimensions of the source image.
Key Methods
save_json(output_path)
: Save results to a JSON file.visualize(image_path, output_path)
: Visualize OCR results with bounding boxes on the source image.get_text_by_confidence(min_confidence)
: Filter text regions by confidence score.get_sorted_by_reading_order()
: Sort text regions by reading order.
OCRText
Represents a single text region detected by OCR.
omnidocs.tasks.ocr_extraction.base.OCRText
Bases: BaseModel
Container for individual OCR text detection.
Attributes:
Name | Type | Description |
---|---|---|
text |
str
|
Extracted text content |
confidence |
Optional[float]
|
Confidence score for the text detection |
bbox |
Optional[List[float]]
|
Bounding box coordinates [x1, y1, x2, y2] |
polygon |
Optional[List[List[float]]]
|
Optional polygon coordinates for irregular text regions |
language |
Optional[str]
|
Detected language code (e.g., 'en', 'zh', 'fr') |
reading_order |
Optional[int]
|
Optional reading order index for text sequencing |
Attributes
text
(str): The recognized text content.confidence
(float): Confidence score of the recognition (0.0-1.0).bbox
(List[float]): Bounding box coordinates [x1, y1, x2, y2].polygon
(List[List[float]]): Precise polygon coordinates of the text region.language
(Optional[str]): Detected language code.reading_order
(int): Reading order index of the text region.
BaseOCRExtractor
The abstract base class for all OCR extractors.
omnidocs.tasks.ocr_extraction.base.BaseOCRExtractor
BaseOCRExtractor(device: Optional[str] = None, show_log: bool = False, languages: Optional[List[str]] = None, engine_name: Optional[str] = None)
Bases: ABC
Base class for OCR text extraction models.
Initialize the OCR extractor.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
device
|
Optional[str]
|
Device to run model on ('cuda' or 'cpu') |
None
|
show_log
|
bool
|
Whether to show detailed logs |
False
|
languages
|
Optional[List[str]]
|
List of language codes to support (e.g., ['en', 'zh']) |
None
|
engine_name
|
Optional[str]
|
Name of the OCR engine for language mapping |
None
|
extract
abstractmethod
Extract text from input image.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_path
|
Union[str, Path, Image]
|
Path to input image or image data |
required |
**kwargs
|
Additional model-specific parameters |
{}
|
Returns:
Type | Description |
---|---|
OCROutput
|
OCROutput containing extracted text |
extract_all
Extract text from multiple images.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_paths
|
List[Union[str, Path, Image]]
|
List of image paths or image data |
required |
**kwargs
|
Additional model-specific parameters |
{}
|
Returns:
Type | Description |
---|---|
List[OCROutput]
|
List of OCROutput objects |
extract_with_layout
extract_with_layout(input_path: Union[str, Path, Image], layout_regions: Optional[List[Dict]] = None, **kwargs) -> OCROutput
Extract text with optional layout information.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_path
|
Union[str, Path, Image]
|
Path to input image or image data |
required |
layout_regions
|
Optional[List[Dict]]
|
Optional list of layout regions to focus OCR on |
None
|
**kwargs
|
Additional model-specific parameters |
{}
|
Returns:
Type | Description |
---|---|
OCROutput
|
OCROutput containing extracted text |
preprocess_input
Convert input to list of PIL Images.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_path
|
Union[str, Path, Image, ndarray]
|
Input image path or image data |
required |
Returns:
Type | Description |
---|---|
List[Image]
|
List of PIL Images |
postprocess_output
Convert raw OCR output to standardized OCROutput format.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
raw_output
|
Any
|
Raw output from OCR engine |
required |
img_size
|
Tuple[int, int]
|
Original image size (width, height) |
required |
Returns:
Type | Description |
---|---|
OCROutput
|
Standardized OCROutput object |
visualize
visualize(ocr_result: OCROutput, image_path: Union[str, Path, Image], output_path: str = 'visualized.png', box_color: str = 'red', box_width: int = 2, show_text: bool = False, text_color: str = 'blue', font_size: int = 12) -> None
Visualize OCR results by drawing bounding boxes on the original image.
This method allows users to easily see which extractor is working better by visualizing the detected text regions with bounding boxes.
get_supported_languages
Get list of supported language codes.
BaseOCRMapper
Handles language code mapping and normalization for OCR engines.
omnidocs.tasks.ocr_extraction.base.BaseOCRMapper
Base class for mapping OCR engine-specific outputs to standardized format.
Initialize mapper for specific OCR engine.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
engine_name
|
str
|
Name of the OCR engine (e.g., 'tesseract', 'paddle', 'easyocr') |
required |
detect_text_language
Detect language of extracted text.
from_standard_language
Convert standard ISO 639-1 language code to engine-specific format.
get_supported_languages
Get list of supported languages for this engine.
normalize_bbox
Normalize bounding box coordinates to absolute pixel values.