Skip to content

🖹 OCR (Optical Character Recognition)

This section documents the API for OCR tasks, providing various extractors to recognize and extract text from images and scanned documents.

Overview

OCR in OmniDocs enables the conversion of images (e.g., scanned documents, photos) into machine-readable text. It supports multiple engines, allowing you to choose the best balance of speed, accuracy, and language support for your needs.

Available Extractors

EasyOCRExtractor

A simple and easy-to-use OCR library that supports multiple languages and is built on PyTorch.

omnidocs.tasks.ocr_extraction.extractors.easy_ocr.EasyOCRExtractor

EasyOCRExtractor(device: Optional[str] = None, show_log: bool = False, languages: Optional[List[str]] = None, gpu: bool = True, **kwargs)

Bases: BaseOCRExtractor

EasyOCR based text extraction implementation.

Initialize EasyOCR Extractor.

extract

extract(input_path: Union[str, Path, Image], detail: int = 1, paragraph: bool = False, width_ths: float = 0.7, height_ths: float = 0.7, **kwargs) -> OCROutput

Extract text using EasyOCR.

Usage Example

from omnidocs.tasks.ocr_extraction.extractors.easy_ocr import EasyOCRExtractor

extractor = EasyOCRExtractor(languages=['en'])
result = extractor.extract("scanned_document.png")
print(f"Extracted text: {result.full_text[:200]}...")

PaddleOCRExtractor

An OCR tool that supports multiple languages and provides layout detection capabilities.

omnidocs.tasks.ocr_extraction.extractors.paddle.PaddleOCRExtractor

PaddleOCRExtractor(device: Optional[str] = None, show_log: bool = False, languages: Optional[List[str]] = None, use_angle_cls: bool = True, use_gpu: bool = True, drop_score: float = 0.5, model_path: Optional[str] = None, **kwargs)

Bases: BaseOCRExtractor

PaddleOCR based text extraction implementation.

Initialize PaddleOCR Extractor.

extract

extract(input_path: Union[str, Path, Image], **kwargs) -> OCROutput

Extract text using PaddleOCR.

Usage Example

from omnidocs.tasks.ocr_extraction.extractors.paddle import PaddleOCRExtractor

extractor = PaddleOCRExtractor(languages=['en'])
result = extractor.extract("scanned_document.png")
print(f"Extracted text: {result.full_text[:200]}...")

SuryaOCRExtractor

A modern, high-accuracy OCR engine, part of the Surya library, with strong support for Indian languages.

omnidocs.tasks.ocr_extraction.extractors.surya_ocr.SuryaOCRExtractor

SuryaOCRExtractor(device: Optional[str] = None, show_log: bool = False, languages: Optional[List[str]] = None, **kwargs)

Bases: BaseOCRExtractor

Surya OCR based text extraction implementation.

Initialize Surya OCR Extractor.

extract

extract(input_path: Union[str, Path, Image], **kwargs) -> OCROutput

Extract text using Surya OCR.

Usage Example

from omnidocs.tasks.ocr_extraction.extractors.surya_ocr import SuryaOCRExtractor

extractor = SuryaOCRExtractor(languages=['en'])
result = extractor.extract("scanned_document.png")
print(f"Extracted text: {result.full_text[:200]}...")

TesseractOCRExtractor

An open-source OCR engine that supports multiple languages and is widely used for text extraction from images.

omnidocs.tasks.ocr_extraction.extractors.tesseract_ocr.TesseractOCRExtractor

TesseractOCRExtractor(device: Optional[str] = None, show_log: bool = False, languages: Optional[List[str]] = None, psm: int = 6, oem: int = 3, config: str = '', **kwargs)

Bases: BaseOCRExtractor

Tesseract OCR based text extraction implementation.

Initialize Tesseract OCR Extractor.

extract

extract(input_path: Union[str, Path, Image], **kwargs) -> OCROutput

Extract text using Tesseract OCR.

Usage Example

from omnidocs.tasks.ocr_extraction.extractors.tesseract_ocr import TesseractOCRExtractor

extractor = TesseractOCRExtractor(languages=['eng']) # Tesseract uses 'eng' for English
result = extractor.extract("scanned_document.png")
print(f"Extracted text: {result.full_text[:200]}...")

OCROutput

The standardized output format for OCR results.

omnidocs.tasks.ocr_extraction.base.OCROutput

Bases: BaseModel

Container for OCR extraction results.

Attributes:

Name Type Description
texts List[OCRText]

List of detected text objects

full_text str

Combined text from all detections

source_img_size Optional[Tuple[int, int]]

Original image dimensions (width, height)

processing_time Optional[float]

Time taken for OCR processing

metadata Optional[Dict[str, Any]]

Additional metadata from the OCR engine

get_sorted_by_reading_order

get_sorted_by_reading_order() -> List[OCRText]

Get texts sorted by reading order (top-to-bottom, left-to-right if no reading_order).

get_text_by_confidence

get_text_by_confidence(min_confidence: float = 0.5) -> List[OCRText]

Filter texts by minimum confidence threshold.

save_json

save_json(output_path: Union[str, Path]) -> None

Save output to JSON file.

to_dict

to_dict() -> Dict

Convert to dictionary representation.

Key Properties

  • texts (List[OCRText]): List of individual text regions detected.
  • full_text (str): The combined text from all detected regions.
  • source_img_size (Tuple[int, int]): Dimensions of the source image.

Key Methods

  • save_json(output_path): Save results to a JSON file.
  • visualize(image_path, output_path): Visualize OCR results with bounding boxes on the source image.
  • get_text_by_confidence(min_confidence): Filter text regions by confidence score.
  • get_sorted_by_reading_order(): Sort text regions by reading order.

OCRText

Represents a single text region detected by OCR.

omnidocs.tasks.ocr_extraction.base.OCRText

Bases: BaseModel

Container for individual OCR text detection.

Attributes:

Name Type Description
text str

Extracted text content

confidence Optional[float]

Confidence score for the text detection

bbox Optional[List[float]]

Bounding box coordinates [x1, y1, x2, y2]

polygon Optional[List[List[float]]]

Optional polygon coordinates for irregular text regions

language Optional[str]

Detected language code (e.g., 'en', 'zh', 'fr')

reading_order Optional[int]

Optional reading order index for text sequencing

to_dict

to_dict() -> Dict

Convert to dictionary representation.

Attributes

  • text (str): The recognized text content.
  • confidence (float): Confidence score of the recognition (0.0-1.0).
  • bbox (List[float]): Bounding box coordinates [x1, y1, x2, y2].
  • polygon (List[List[float]]): Precise polygon coordinates of the text region.
  • language (Optional[str]): Detected language code.
  • reading_order (int): Reading order index of the text region.

BaseOCRExtractor

The abstract base class for all OCR extractors.

omnidocs.tasks.ocr_extraction.base.BaseOCRExtractor

BaseOCRExtractor(device: Optional[str] = None, show_log: bool = False, languages: Optional[List[str]] = None, engine_name: Optional[str] = None)

Bases: ABC

Base class for OCR text extraction models.

Initialize the OCR extractor.

Parameters:

Name Type Description Default
device Optional[str]

Device to run model on ('cuda' or 'cpu')

None
show_log bool

Whether to show detailed logs

False
languages Optional[List[str]]

List of language codes to support (e.g., ['en', 'zh'])

None
engine_name Optional[str]

Name of the OCR engine for language mapping

None

extract abstractmethod

extract(input_path: Union[str, Path, Image], **kwargs) -> OCROutput

Extract text from input image.

Parameters:

Name Type Description Default
input_path Union[str, Path, Image]

Path to input image or image data

required
**kwargs

Additional model-specific parameters

{}

Returns:

Type Description
OCROutput

OCROutput containing extracted text

extract_all

extract_all(input_paths: List[Union[str, Path, Image]], **kwargs) -> List[OCROutput]

Extract text from multiple images.

Parameters:

Name Type Description Default
input_paths List[Union[str, Path, Image]]

List of image paths or image data

required
**kwargs

Additional model-specific parameters

{}

Returns:

Type Description
List[OCROutput]

List of OCROutput objects

extract_with_layout

extract_with_layout(input_path: Union[str, Path, Image], layout_regions: Optional[List[Dict]] = None, **kwargs) -> OCROutput

Extract text with optional layout information.

Parameters:

Name Type Description Default
input_path Union[str, Path, Image]

Path to input image or image data

required
layout_regions Optional[List[Dict]]

Optional list of layout regions to focus OCR on

None
**kwargs

Additional model-specific parameters

{}

Returns:

Type Description
OCROutput

OCROutput containing extracted text

preprocess_input

preprocess_input(input_path: Union[str, Path, Image, ndarray]) -> List[Image.Image]

Convert input to list of PIL Images.

Parameters:

Name Type Description Default
input_path Union[str, Path, Image, ndarray]

Input image path or image data

required

Returns:

Type Description
List[Image]

List of PIL Images

postprocess_output

postprocess_output(raw_output: Any, img_size: Tuple[int, int]) -> OCROutput

Convert raw OCR output to standardized OCROutput format.

Parameters:

Name Type Description Default
raw_output Any

Raw output from OCR engine

required
img_size Tuple[int, int]

Original image size (width, height)

required

Returns:

Type Description
OCROutput

Standardized OCROutput object

visualize

visualize(ocr_result: OCROutput, image_path: Union[str, Path, Image], output_path: str = 'visualized.png', box_color: str = 'red', box_width: int = 2, show_text: bool = False, text_color: str = 'blue', font_size: int = 12) -> None

Visualize OCR results by drawing bounding boxes on the original image.

This method allows users to easily see which extractor is working better by visualizing the detected text regions with bounding boxes.

get_supported_languages

get_supported_languages() -> List[str]

Get list of supported language codes.

set_languages

set_languages(languages: List[str]) -> None

Update supported languages for OCR extraction.

BaseOCRMapper

Handles language code mapping and normalization for OCR engines.

omnidocs.tasks.ocr_extraction.base.BaseOCRMapper

BaseOCRMapper(engine_name: str)

Base class for mapping OCR engine-specific outputs to standardized format.

Initialize mapper for specific OCR engine.

Parameters:

Name Type Description Default
engine_name str

Name of the OCR engine (e.g., 'tesseract', 'paddle', 'easyocr')

required

detect_text_language

detect_text_language(text: str) -> Optional[str]

Detect language of extracted text.

from_standard_language

from_standard_language(standard_language: str) -> str

Convert standard ISO 639-1 language code to engine-specific format.

get_supported_languages

get_supported_languages() -> List[str]

Get list of supported languages for this engine.

normalize_bbox

normalize_bbox(bbox: List[float], img_width: int, img_height: int) -> List[float]

Normalize bounding box coordinates to absolute pixel values.

to_standard_language

to_standard_language(engine_language: str) -> str

Convert engine-specific language code to standard ISO 639-1.