Tesseract¶
Traditional OCR engine with 100+ language support.
Overview¶
| Tasks | OCR |
| Backends | CPU only |
| Speed | 0.5-1s/page |
| Quality | Good |
| Memory | Minimal |
Why Tesseract¶
- No GPU required - Runs on any machine
- 100+ languages - Best multilingual support
- Free & open source - Apache 2.0 license
- Battle-tested - Decades of production use
- Lightweight - Minimal dependencies
Basic Usage¶
from omnidocs.tasks.ocr_extraction import TesseractOCR, TesseractConfig
from PIL import Image
image = Image.open("document.png")
ocr = TesseractOCR(
config=TesseractConfig(languages=["eng"])
)
result = ocr.extract(image)
for block in result.text_blocks:
print(f"'{block.text}' @ {block.bbox}")
Configuration¶
config = TesseractConfig(
languages=["eng"], # Language codes
config="--psm 3", # Page segmentation mode
)
Page Segmentation Modes (PSM)¶
| Mode | Description |
|---|---|
--psm 0 |
Orientation and script detection only |
--psm 1 |
Automatic with OSD |
--psm 3 |
Fully automatic (default) |
--psm 6 |
Assume uniform block of text |
--psm 11 |
Sparse text, no order |
--psm 13 |
Raw line, single text line |
Multi-Language Support¶
# Single language
config = TesseractConfig(languages=["eng"])
# Multiple languages
config = TesseractConfig(languages=["eng", "fra", "deu"])
# All available languages
config = TesseractConfig(languages=["eng", "chi_sim", "jpn", "ara"])
Common Language Codes¶
| Code | Language |
|---|---|
eng |
English |
chi_sim |
Chinese (Simplified) |
chi_tra |
Chinese (Traditional) |
jpn |
Japanese |
kor |
Korean |
ara |
Arabic |
hin |
Hindi |
fra |
French |
deu |
German |
spa |
Spanish |
por |
Portuguese |
rus |
Russian |
Full list: Tesseract Languages
Filtering Results¶
# By confidence
confident = [b for b in result.text_blocks if b.confidence >= 0.9]
# By text length
words = [b for b in result.text_blocks if len(b.text) >= 2]
# By region
top_half = [b for b in result.text_blocks if b.bbox.y1 < image.height / 2]
Installation¶
Tesseract must be installed on your system:
# macOS
brew install tesseract
# Ubuntu/Debian
sudo apt install tesseract-ocr
# Install additional languages
sudo apt install tesseract-ocr-chi-sim # Chinese
sudo apt install tesseract-ocr-jpn # Japanese
Troubleshooting¶
"tesseract not found"
# Install Tesseract system package
brew install tesseract # macOS
sudo apt install tesseract-ocr # Linux
Low accuracy - Increase image resolution (300 DPI recommended) - Improve image contrast - Use single language mode - Try different PSM mode
Missing language