Tesseract¶

Traditional OCR engine with 100+ language support.

Overview¶


Tasks	OCR
Backends	CPU only
Speed	0.5-1s/page
Quality	Good
Memory	Minimal

Why Tesseract¶

No GPU required - Runs on any machine
100+ languages - Best multilingual support
Free & open source - Apache 2.0 license
Battle-tested - Decades of production use
Lightweight - Minimal dependencies

Basic Usage¶

from omnidocs.tasks.ocr_extraction import TesseractOCR, TesseractConfig
from PIL import Image

image = Image.open("document.png")

ocr = TesseractOCR(
    config=TesseractConfig(languages=["eng"])
)

result = ocr.extract(image)

for block in result.text_blocks:
    print(f"'{block.text}' @ {block.bbox}")

Configuration¶

config = TesseractConfig(
    languages=["eng"],           # Language codes
    config="--psm 3",            # Page segmentation mode
)

Page Segmentation Modes (PSM)¶

Mode	Description
`--psm 0`	Orientation and script detection only
`--psm 1`	Automatic with OSD
`--psm 3`	Fully automatic (default)
`--psm 6`	Assume uniform block of text
`--psm 11`	Sparse text, no order
`--psm 13`	Raw line, single text line

Multi-Language Support¶

# Single language
config = TesseractConfig(languages=["eng"])

# Multiple languages
config = TesseractConfig(languages=["eng", "fra", "deu"])

# All available languages
config = TesseractConfig(languages=["eng", "chi_sim", "jpn", "ara"])

Common Language Codes¶

Code	Language
`eng`	English
`chi_sim`	Chinese (Simplified)
`chi_tra`	Chinese (Traditional)
`jpn`	Japanese
`kor`	Korean
`ara`	Arabic
`hin`	Hindi
`fra`	French
`deu`	German
`spa`	Spanish
`por`	Portuguese
`rus`	Russian

Full list: Tesseract Languages

Filtering Results¶

# By confidence
confident = [b for b in result.text_blocks if b.confidence >= 0.9]

# By text length
words = [b for b in result.text_blocks if len(b.text) >= 2]

# By region
top_half = [b for b in result.text_blocks if b.bbox.y1 < image.height / 2]

Installation¶

Tesseract must be installed on your system:

# macOS
brew install tesseract

# Ubuntu/Debian
sudo apt install tesseract-ocr

# Install additional languages
sudo apt install tesseract-ocr-chi-sim  # Chinese
sudo apt install tesseract-ocr-jpn      # Japanese

Troubleshooting¶

"tesseract not found"

# Install Tesseract system package
brew install tesseract  # macOS
sudo apt install tesseract-ocr  # Linux

Low accuracy - Increase image resolution (300 DPI recommended) - Improve image contrast - Use single language mode - Try different PSM mode

Missing language

# Install language data
sudo apt install tesseract-ocr-[lang]