Skip to content

Tesseract OCR

Model Overview

Tesseract is the leading open-source Optical Character Recognition (OCR) engine, maintained by Google since 2006. It's the most widely deployed OCR solution and excels at printed text in 100+ languages.

Project: GitHub tesseract-ocr Architecture: Traditional OCR (legacy and LSTM-based) Training Focus: Printed documents in all major languages Framework: C/C++ with Python bindings

Key Capabilities

  • Language Support: 100+ languages with high quality
  • Multilingual Documents: Seamlessly handle mixed-language text
  • Word-Level Bounding Boxes: Get exact position of each word
  • Line-Level Grouping: Option to return line-level blocks
  • CPU-Only: No GPU required, runs anywhere
  • Free & Open Source: No license or API costs
  • Configurable: Fine-tuned via OCR engine modes and page segmentation

Limitations

  • Printed Text Only: Struggles with handwriting (see Surya for handwritten)
  • CPU-Bound: Slower than GPU-based OCR (2-5 seconds per page)
  • Quality Variance: Heavily dependent on image quality and preprocessing
  • Skewed Documents: Needs de-skewing for rotated documents
  • Low Contrast: Performs poorly on light text or images
  • No Layout Analysis: Returns text only, no structural information (use DocLayout-YOLO for layout)

System Installation

Required System Dependencies

Tesseract must be installed at the operating system level before Python can use it.

macOS (using Homebrew):

brew install tesseract

Ubuntu/Debian:

sudo apt-get update
sudo apt-get install tesseract-ocr

Windows: Download and install from GitHub releases

Verify Installation:

tesseract --version
# Should output version and supported languages

Python Package Installation

# Install OmniDocs with OCR support
pip install omnidocs[pytorch]

# Or install pytesseract directly
pip install pytesseract

# Verify
python -c "import pytesseract; print(pytesseract.get_languages())"

Configuration

Basic Configuration

from omnidocs.tasks.ocr_extraction import TesseractOCR, TesseractOCRConfig

config = TesseractOCRConfig(
    languages=["eng"],           # Single language
    oem=3,                       # OCR Engine Mode (default)
    psm=3,                       # Page Segmentation Mode
)

ocr = TesseractOCR(config=config)

Config Parameters:

Parameter Type Default Description
languages List[str] ["eng"] Language codes (e.g., ["eng", "fra", "deu"])
tessdata_dir str None Custom tessdata directory path
oem int 3 OCR Engine Mode (0-3)
psm int 3 Page Segmentation Mode (0-13)
config_params Dict None Additional Tesseract config options

Available Languages

# List all installed languages
tesseract --list-langs

# Sample common languages:
# eng (English)        fra (French)         deu (German)
# spa (Spanish)        ita (Italian)        por (Portuguese)
# chi_sim (Simplified Chinese)  jpn (Japanese)  kor (Korean)
# ara (Arabic)         rus (Russian)        hin (Hindi)

OCR Engine Modes (OEM)

OEM Name Best For Speed
0 Legacy Old documents Fast
1 LSTM Modern text Accurate
2 Legacy+LSTM Mixed quality Medium
3 Default Auto-detect Medium

Recommendation: Use OEM=3 (automatic, recommended for most documents)

Page Segmentation Modes (PSM)

PSM Description Use Case
0 OSD only Orientation detection only
3 Fully automatic (default) Mixed layouts, images, text
6 Uniform block Single column of text
7 Single text line Single line input
11 Sparse text Scattered text, forms
13 Raw line Treat each line as a word

Recommendation: Use PSM=3 for documents, PSM=11 for forms


Usage Examples

Basic Text Extraction

from omnidocs.tasks.ocr_extraction import TesseractOCR, TesseractOCRConfig
from PIL import Image

# Initialize
config = TesseractOCRConfig(languages=["eng"])
ocr = TesseractOCR(config=config)

# Extract text
image = Image.open("document.png")
result = ocr.extract(image)

# Access results
print(result.full_text)           # Complete extracted text
print(result.text_blocks)         # List of TextBlock objects

for block in result.text_blocks:
    print(f"'{block.text}' @ {block.bbox.to_list()} ({block.confidence:.2%})")

Line-Level Extraction

# Extract at line level (grouped words)
result = ocr.extract_lines(image)

for block in result.text_blocks:
    print(f"{block.text}")

# Useful for:
# - Preserving line breaks
# - Document structure
# - Form processing

Multilingual Documents

# Extract from document with mixed languages
config = TesseractOCRConfig(
    languages=["eng", "fra", "deu"],  # English, French, German
    oem=2,  # Legacy+LSTM for better multilingual support
)
ocr = TesseractOCR(config=config)

result = ocr.extract(image)
print(f"Languages detected: {result.languages_detected}")
print(f"Text: {result.full_text}")

Specialized Document Configuration

# For forms with sparse text
form_config = TesseractOCRConfig(
    languages=["eng"],
    psm=11,  # Sparse text mode
    config_params={
        "tessedit_char_whitelist": "0123456789/-.()",  # Digits, symbols only
    },
)
ocr_form = TesseractOCR(config=form_config)

result = ocr_form.extract(form_image)

Batch Processing

from pathlib import Path
import json

# Process multiple images
doc_dir = Path("documents/")
results = {}

config = TesseractOCRConfig(languages=["eng"])
ocr = TesseractOCR(config=config)

for img_path in sorted(doc_dir.glob("*.png"))[:10]:
    print(f"Processing {img_path.name}...")
    image = Image.open(img_path)
    result = ocr.extract(image)

    results[img_path.name] = {
        "text": result.full_text,
        "word_count": len(result.text_blocks),
        "confidence": sum(
            b.confidence for b in result.text_blocks
        ) / len(result.text_blocks) if result.text_blocks else 0,
    }

# Save results
with open("ocr_results.json", "w") as f:
    json.dump(results, f, indent=2)

Image Preprocessing for Better Results

OCR quality depends heavily on image quality. Pre-process images for best results:

Contrast Enhancement

from PIL import Image, ImageEnhance

image = Image.open("document.png")

# Increase contrast
enhancer = ImageEnhance.Contrast(image)
image = enhancer.enhance(1.5)  # 1.5x contrast

result = ocr.extract(image)

Grayscale Conversion

# Convert to grayscale (Tesseract prefers grayscale)
image = Image.open("document.png").convert("L")

result = ocr.extract(image)

Deskew (Rotate)

from PIL import Image
import numpy as np

# For skewed documents, rotate to horizontal
image = Image.open("skewed_document.png")

# Simple 90-degree rotations
image = image.rotate(90, expand=True)

# For arbitrary angles (requires deskew library)
from deskew import determine_skew
angle = determine_skew(np.array(image))
if angle:
    image = image.rotate(angle, expand=True)

result = ocr.extract(image)

Upscale Small Text

from PIL import Image

image = Image.open("document.png")

# If text is very small, upscale
if image.size[0] < 1000:
    scale = 2
    new_size = (image.size[0] * scale, image.size[1] * scale)
    image = image.resize(new_size, Image.Resampling.LANCZOS)

result = ocr.extract(image)

Complete Preprocessing Pipeline

from PIL import Image, ImageEnhance, ImageFilter
import numpy as np

def preprocess_image(image_path):
    img = Image.open(image_path)

    # 1. Convert to grayscale
    img = img.convert("L")

    # 2. Enhance contrast
    enhancer = ImageEnhance.Contrast(img)
    img = enhancer.enhance(1.5)

    # 3. Sharpen
    img = img.filter(ImageFilter.SHARPEN)

    # 4. Upscale if small
    if img.size[0] < 1000:
        new_size = (img.size[0] * 2, img.size[1] * 2)
        img = img.resize(new_size, Image.Resampling.LANCZOS)

    return img

# Use in OCR
preprocessed = preprocess_image("low_quality.png")
result = ocr.extract(preprocessed)

Performance & Accuracy

Speed Characteristics

Setup Image Size Speed Device
Single-threaded 1024x768 2-3s CPU
4-threaded 1024x768 0.5-1s CPU (4 cores)
GPU-accelerated 1024x768 0.2-0.5s GPU (if compiled)

Accuracy by Document Type

Document Type Quality Accuracy Notes
Printed text High 95-99% Best case scenario
Scanned PDF Medium 85-95% Needs preprocessing
Handwriting High 30-60% Poor, use Surya instead
Low contrast Low 20-50% Needs enhancement
Multiple languages Medium 80-92% OEM 2 recommended

Language Accuracy

Language Accuracy Notes
English (Latin) 95-99% Excellent
European languages 92-98% Very good
Asian languages 80-90% Good (requires language pack)
Mixed script 75-85% Challenging

Troubleshooting

Installation Issues

Symptom: ModuleNotFoundError: No module named 'tesseract'

Solution:

# Install system Tesseract first (OS-specific)
# macOS
brew install tesseract

# Then install Python package
pip install pytesseract

Symptom: Python can't find Tesseract binary

Solution:

import pytesseract
from pathlib import Path

# Option 1: Specify path in code
pytesseract.pytesseract.pytesseract_cmd = r'/usr/local/bin/tesseract'

# Option 2: Configure in TesseractOCRConfig
config = TesseractOCRConfig(
    tessdata_dir="/path/to/tessdata",
)

Language Pack Issues

Symptom: Language not found when trying to use non-English

Solution:

# Check installed languages
tesseract --list-langs

# Install additional language data (macOS)
brew install tesseract-lang

# Verify after installation
tesseract --list-langs | grep fra  # Check for French

Poor OCR Quality

Symptom: Garbled or incomplete text output

Solutions (in order of likelihood):

# 1. Preprocess image (most common fix)
image = image.convert("L")  # Grayscale
enhancer = ImageEnhance.Contrast(image)
image = enhancer.enhance(1.5)

# 2. Try different PSM
config = TesseractOCRConfig(psm=6)  # Uniform block

# 3. Try different OEM
config = TesseractOCRConfig(oem=1)  # LSTM only

# 4. Upscale image
image = image.resize(
    (image.size[0] * 2, image.size[1] * 2),
    Image.Resampling.LANCZOS
)

# 5. Try line-level (may preserve structure)
result = ocr.extract_lines(image)

Performance Issues (Slow)

Symptom: OCR takes 5+ seconds per page

Solutions:

# 1. Use simpler PSM (fewer segmentation steps)
config = TesseractOCRConfig(psm=6)  # Faster

# 2. Reduce image size
image.thumbnail((2048, 2048))

# 3. Use faster OEM
config = TesseractOCRConfig(oem=0)  # Legacy (faster)

# 4. Process on machine with more CPU cores
# (Tesseract can use multiple cores)

Tesseract vs Other OCR Models

Feature Tesseract EasyOCR PaddleOCR Surya
Speed Medium Fast Very Fast Medium
Language Support 100+ 80+ 80+ Multi
Handwriting Poor Medium Medium Excellent
GPU Required No Yes Yes Yes
Setup System install Python only Python only Python only
Cost Free Free Free Free
Best For Printed docs General Asian languages Handwriting

Choose Tesseract if: - You need CPU-only processing - Processing printed text in 100+ languages - Want zero GPU dependency

Not ideal for: - Handwritten documents (use Surya) - Real-time processing (use PaddleOCR) - Asian documents only (use PaddleOCR)


API Reference

TesseractOCR.extract()

def extract(image: Union[Image.Image, np.ndarray, str, Path]) -> OCROutput:
    """
    Run word-level OCR on an image.

    Args:
        image: Input image (PIL Image, numpy array, or path)

    Returns:
        OCROutput with word-level text blocks
    """

TesseractOCR.extract_lines()

def extract_lines(image: Union[Image.Image, np.ndarray, str, Path]) -> OCROutput:
    """
    Run line-level OCR on an image.

    Groups words into lines based on Tesseract's line detection.

    Args:
        image: Input image

    Returns:
        OCROutput with line-level text blocks
    """

OCROutput Properties

result = ocr.extract(image)

# Text content
result.full_text            # Complete text (word-separated)
result.text_blocks          # List[TextBlock] objects
result.model_name           # "tesseract"
result.languages_detected   # Languages used

# Image info
result.image_width          # Source width in pixels
result.image_height         # Source height in pixels

# Statistics
len(result.text_blocks)     # Number of detected words/lines

TextBlock Properties

for block in result.text_blocks:
    block.text              # Word or line text
    block.bbox              # BoundingBox object
    block.confidence        # float (0-1)
    block.granularity       # WORD or LINE
    block.language          # Detected language code

BoundingBox Methods

bbox = block.bbox

# Access coordinates
bbox.x1, bbox.y1           # Top-left corner
bbox.x2, bbox.y2           # Bottom-right corner
bbox.width                 # Width
bbox.height                # Height

# Convert formats
bbox.to_list()             # [x1, y1, x2, y2]
bbox.to_xyxy()             # (x1, y1, x2, y2)
bbox.to_xywh()             # (x, y, width, height)

Advanced Configuration

Custom Tesseract Parameters

# Additional config parameters
config = TesseractOCRConfig(
    languages=["eng"],
    config_params={
        # Whitelist specific characters
        "tessedit_char_whitelist": "0123456789ABCDEFabcdef",

        # Ignore words shorter than N characters
        "min_characters_to_try": 3,

        # Set segmentation to all caps
        "tessedit_create_pdf": 0,  # Don't create PDF
    },
)

Parallel Processing

from concurrent.futures import ThreadPoolExecutor
from pathlib import Path

def process_image(image_path):
    image = Image.open(image_path)
    return ocr.extract(image)

# Process multiple images in parallel
doc_dir = Path("documents/")
images = list(doc_dir.glob("*.png"))

with ThreadPoolExecutor(max_workers=4) as executor:
    results = list(executor.map(process_image, images))

print(f"Processed {len(results)} images")

See Also