OCR Extraction Guide¶
Extract text with precise bounding boxes at character, word, line, or block level. This guide covers when to use OCR, available models, multi-language support, and performance considerations.
Table of Contents¶
- OCR vs Text Extraction vs Layout
- Available Models
- Basic Usage
- Extracting Bounding Boxes
- Filtering Results
- Multi-Language Support
- Performance Comparison
- Troubleshooting
OCR vs Text Extraction vs Layout¶
| Feature | OCR | Text Extraction | Layout Detection |
|---|---|---|---|
| Returns | Text + bounding boxes | Formatted text only | Bounding boxes only |
| Granularity | Character/word/line | Full document | Element-level |
| Location Info | Yes (precise) | No | Yes (element regions) |
| Output Type | List of text blocks | Single formatted string | List of elements |
| Use Case | Word spotting, re-OCR, handwriting | Document parsing | Structure analysis |
| Latency | 1-2 sec per page | 2-5 sec per page | 0.5-1 sec per page |
| Example Output | [{"text": "Hello", "bbox": [10, 20, 50, 35]}] |
"# Hello\n\nWorld" |
[{"label": "title", "bbox": [...]}] |
Choose OCR when: - You need precise character locations - Building re-OCR or correction pipelines - Extracting structured data from tables (get cell coordinates first) - Analyzing handwriting - Building word spotting systems
Choose Text Extraction when: - Converting documents to readable format - Extracting full document content - Building markdown/HTML outputs - Focus on content quality over location
Choose Layout Detection when: - Understanding document structure - Filtering unwanted elements - Multi-stage processing
Available Models¶
Model Comparison¶
| Model | Speed | Accuracy | Languages | GPU Req | Best For |
|---|---|---|---|---|---|
| Tesseract | ⭐⭐⭐⭐⭐ (Fast) | ⭐⭐⭐ (Good) | 100+ | None | Legacy, CPU-only |
| EasyOCR | ⭐⭐⭐ (Medium) | ⭐⭐⭐⭐ (Very Good) | 80+ | Optional | Production use |
| PaddleOCR | ⭐⭐⭐⭐ (Very Fast) | ⭐⭐⭐⭐ (Very Good) | 11 | Optional | Speed-critical, Asian text |
| CRAFT | ⭐⭐⭐ (Medium) | ⭐⭐⭐⭐ (Very Good) | English | Optional | Scene text detection |
1. Tesseract (CPU-only)¶
Traditional OCR engine, excellent for clean printed text.
Strengths: - No GPU required, CPU-only - Extremely fast - Supports 100+ languages - Proven and reliable - Opensource (Apache 2.0)
Weaknesses: - Lower accuracy on complex layouts - Struggles with handwriting - Needs training data for custom fonts
When to use: - CPU-only systems (Raspberry Pi, servers) - Clean printed documents - Cost-sensitive applications - Multi-language documents
Languages: 100+ (English, Chinese, Arabic, Hindi, etc.)
2. EasyOCR (GPU-recommended)¶
Deep learning OCR with excellent accuracy.
Strengths: - Very high accuracy on diverse text - Supports 80+ languages - Works with or without GPU - Easy API - Good on real-world documents
Weaknesses: - Slower than PaddleOCR - Higher memory usage - Requires downloading large models
When to use: - High accuracy needed - Mixed language documents - Production systems - Irregular text layouts
Languages: English, Chinese, Japanese, Korean, Arabic, Hindi, etc. (80+ total)
3. PaddleOCR (Fastest with GPU)¶
Lightweight OCR optimized for speed.
Strengths: - Fastest inference speed - Small model size - Excellent Asian language support - Works on CPU and GPU - Very efficient
Weaknesses: - Fewer languages than EasyOCR - Slightly lower accuracy on English - Limited handwriting support
When to use: - Performance-critical applications - Asian language documents - Resource-constrained environments - High-throughput pipelines
Languages: English, Chinese, Japanese, Korean, Arabic (main languages)
Basic Usage¶
Example 1: Simple Word-Level OCR¶
Extract text with word-level bounding boxes.
from omnidocs.tasks.ocr_extraction import EasyOCR, EasyOCRConfig
from PIL import Image
image = Image.open("document_page.png")
# Initialize EasyOCR for high accuracy
config = EasyOCRConfig(
languages=["en"], # English only (faster)
gpu=True, # Use GPU if available
)
ocr = EasyOCR(config=config)
# Extract text with bounding boxes
result = ocr.extract(image)
print(f"Extracted {len(result.text_blocks)} text blocks")
# Access text and locations
for block in result.text_blocks:
print(f"Text: '{block.text}'")
print(f"Bbox: {block.bbox}")
print(f"Confidence: {block.confidence:.2f}")
print()
Output Example:
Extracted 5 text blocks
Text: 'Document'
Bbox: BoundingBox(x1=10, y1=5, x2=120, y2=30)
Confidence: 0.98
Text: 'Title'
Bbox: BoundingBox(x1=10, y1=35, x2=100, y2=55)
Confidence: 0.97
...
Example 2: Fast CPU-Only OCR (Tesseract)¶
Use Tesseract for fast CPU-only extraction.
from omnidocs.tasks.ocr_extraction import Tesseract, TesseractConfig
from PIL import Image
image = Image.open("document_page.png")
# Initialize Tesseract (CPU only, no GPU)
config = TesseractConfig(
language="eng", # Single language for speed
config="--psm 3", # Page segmentation mode
)
ocr = Tesseract(config=config)
# Extract
result = ocr.extract(image)
print(f"Found {len(result.text_blocks)} words")
# Display results with confidence
high_confidence = [b for b in result.text_blocks if b.confidence > 0.9]
print(f"High confidence blocks: {len(high_confidence)}")
# Get plain text
print("\nExtracted text:")
print(" ".join(block.text for block in result.text_blocks))
Example 3: Multi-Language OCR¶
Extract from documents with multiple languages.
from omnidocs.tasks.ocr_extraction import EasyOCR, EasyOCRConfig
from PIL import Image
image = Image.open("multilingual_document.png")
# Support multiple languages
config = EasyOCRConfig(
languages=["en", "zh", "ar"], # English, Chinese, Arabic
gpu=True,
)
ocr = EasyOCR(config=config)
result = ocr.extract(image)
# Group by detected language (if available)
for block in result.text_blocks:
print(f"[{block.language}] {block.text}")
Example 4: PDF with Character-Level Extraction¶
Extract at character granularity from PDF.
from omnidocs import Document
from omnidocs.tasks.ocr_extraction import PaddleOCR, PaddleOCRConfig
# Load PDF
doc = Document.from_pdf("document.pdf")
# Initialize PaddleOCR for character-level extraction
config = PaddleOCRConfig(
languages=["en", "ch"], # English and Chinese
gpu=True,
)
ocr = PaddleOCR(config=config)
# Process first page
page_image = doc.get_page(0)
result = ocr.extract(page_image, granularity="character")
# Access character-level data
char_count = len(result.text_blocks)
print(f"Extracted {char_count} characters")
# Find coordinates of specific character
for block in result.text_blocks:
if block.text == "A":
print(f"Found 'A' at {block.bbox}")
break
Extracting Bounding Boxes¶
Get Text Blocks with Coordinates¶
from omnidocs.tasks.ocr_extraction import EasyOCR, EasyOCRConfig
from PIL import Image
image = Image.open("document.png")
config = EasyOCRConfig(languages=["en"], gpu=True)
ocr = EasyOCR(config=config)
result = ocr.extract(image)
# Print detailed block information
for block in result.text_blocks:
x1, y1, x2, y2 = block.bbox.x1, block.bbox.y1, block.bbox.x2, block.bbox.y2
width = x2 - x1
height = y2 - y1
print(f"'{block.text}' @ ({x1:.0f}, {y1:.0f}) "
f"size: {width:.0f}x{height:.0f} "
f"conf: {block.confidence:.2f}")
Convert to Normalized Coordinates¶
Convert pixel coordinates to 0-1024 normalized range.
# Normalize bounding boxes to 0-1024 range
image_width, image_height = image.size
normalized_blocks = result.get_normalized_blocks()
for block in normalized_blocks:
# Coordinates now in 0-1024 range
print(f"'{block.text}' @ {block.bbox} (normalized)")
# Manual normalization
NORM_SIZE = 1024
def normalize_bbox(bbox, image_size):
"""Convert pixel bbox to normalized 0-1024."""
img_w, img_h = image_size
x1 = int(bbox.x1 * NORM_SIZE / img_w)
y1 = int(bbox.y1 * NORM_SIZE / img_h)
x2 = int(bbox.x2 * NORM_SIZE / img_w)
y2 = int(bbox.y2 * NORM_SIZE / img_h)
return (x1, y1, x2, y2)
for block in result.text_blocks:
norm_bbox = normalize_bbox(block.bbox, (image_width, image_height))
print(f"Normalized: {norm_bbox}")
Extract from Specific Regions¶
Get OCR results from a cropped region.
# Crop image to specific region
region_bbox = (100, 100, 500, 400) # x1, y1, x2, y2
cropped = image.crop(region_bbox)
# Run OCR on crop
result_crop = ocr.extract(cropped)
# Adjust bboxes back to original image coordinates
x1_offset, y1_offset = region_bbox[0], region_bbox[1]
for block in result_crop.text_blocks:
# Shift coordinates
adjusted_bbox = (
block.bbox.x1 + x1_offset,
block.bbox.y1 + y1_offset,
block.bbox.x2 + x1_offset,
block.bbox.y2 + y1_offset,
)
print(f"'{block.text}' @ {adjusted_bbox}")
Filtering Results¶
Filter by Confidence¶
Keep only high-confidence extractions.
# Filter by confidence threshold
min_confidence = 0.85
confident_blocks = [
b for b in result.text_blocks
if b.confidence >= min_confidence
]
print(f"Original: {len(result.text_blocks)} blocks")
print(f"Filtered (conf >= {min_confidence}): {len(confident_blocks)} blocks")
# Display confidence distribution
confidences = [b.confidence for b in result.text_blocks]
print(f"Confidence range: {min(confidences):.2f} - {max(confidences):.2f}")
Filter by Region¶
Extract OCR results from specific image regions.
def is_in_region(bbox, region):
"""Check if bbox overlaps with region."""
rx1, ry1, rx2, ry2 = region
return not (bbox.x2 < rx1 or bbox.x1 > rx2 or
bbox.y2 < ry1 or bbox.y1 > ry2)
# Top-left region
top_left = (0, 0, image.width//2, image.height//2)
top_left_blocks = [b for b in result.text_blocks if is_in_region(b.bbox, top_left)]
# Sidebar region
sidebar = (0, 0, 200, image.height)
sidebar_blocks = [b for b in result.text_blocks if is_in_region(b.bbox, sidebar)]
print(f"Top-left blocks: {len(top_left_blocks)}")
print(f"Sidebar blocks: {len(sidebar_blocks)}")
Filter by Text Content¶
Find blocks matching patterns.
import re
# Find numbers
number_blocks = [
b for b in result.text_blocks
if re.match(r'^\d+$', b.text.strip())
]
# Find email addresses
email_blocks = [
b for b in result.text_blocks
if re.match(r'^[^\s@]+@[^\s@]+\.[^\s@]+$', b.text.strip())
]
# Find specific phrases
phrase_blocks = [
b for b in result.text_blocks
if "important" in b.text.lower()
]
print(f"Numbers: {len(number_blocks)}")
print(f"Emails: {len(email_blocks)}")
print(f"'Important' mentions: {len(phrase_blocks)}")
Filter by Size¶
Exclude very small or very large blocks.
# Calculate block dimensions
def get_size(bbox):
return (bbox.x2 - bbox.x1, bbox.y2 - bbox.y1)
# Keep medium-sized blocks
medium_blocks = []
for b in result.text_blocks:
width, height = get_size(b.bbox)
if 30 < width < 500 and 10 < height < 100:
medium_blocks.append(b)
print(f"Medium-sized blocks: {len(medium_blocks)}/{len(result.text_blocks)}")
# Analyze size distribution
sizes = [get_size(b.bbox) for b in result.text_blocks]
avg_width = sum(w for w, h in sizes) / len(sizes)
avg_height = sum(h for w, h in sizes) / len(sizes)
print(f"Average block size: {avg_width:.0f}x{avg_height:.0f}")
Multi-Language Support¶
Auto-Detect Language¶
EasyOCR can auto-detect language.
from omnidocs.tasks.ocr_extraction import EasyOCR, EasyOCRConfig
# Auto-detect (leave empty or None)
config = EasyOCRConfig(
languages=None, # Auto-detect all languages
gpu=True,
)
ocr = EasyOCR(config=config)
result = ocr.extract(image)
# Check detected languages
detected_langs = set()
for block in result.text_blocks:
if hasattr(block, 'language'):
detected_langs.add(block.language)
print(f"Detected languages: {detected_langs}")
Process Mixed-Language Documents¶
Handle documents with multiple languages.
# Support common languages
config = EasyOCRConfig(
languages=["en", "zh", "ar", "hi", "ja"], # English, Chinese, Arabic, Hindi, Japanese
gpu=True,
)
ocr = EasyOCR(config=config)
result = ocr.extract(image)
# Group results by language
from collections import defaultdict
by_language = defaultdict(list)
for block in result.text_blocks:
lang = getattr(block, 'language', 'unknown')
by_language[lang].append(block)
for lang, blocks in by_language.items():
print(f"\n{lang.upper()} ({len(blocks)} blocks):")
for block in blocks[:3]: # Show first 3
print(f" {block.text}")
Language-Specific Optimization¶
Different languages need different models.
# For English only (fastest)
config_en = EasyOCRConfig(languages=["en"], gpu=True)
# For Asian languages (use PaddleOCR)
from omnidocs.tasks.ocr_extraction import PaddleOCR, PaddleOCRConfig
config_cn = PaddleOCRConfig(languages=["ch"], gpu=True) # Chinese
# For Arabic/Hebrew (right-to-left)
config_rtl = EasyOCRConfig(languages=["ar"], gpu=True)
# For handwriting
config_hw = EasyOCRConfig(languages=["en"], gpu=True)
# Note: Most OCR models struggle with handwriting
Performance Comparison¶
Speed Benchmarks¶
Processing a typical page (300 DPI, ~2000x3000px):
| Model | CPU | GPU | Latency | Memory |
|---|---|---|---|---|
| Tesseract | 0.5-1.0s | N/A | Very Fast | ~100MB |
| PaddleOCR | 1-2s | 0.3-0.5s | Fast | ~500MB |
| EasyOCR | 2-4s | 0.5-1.0s | Medium | ~1GB |
Choose by Speed Requirements¶
| Requirement | Model |
|---|---|
| <200ms per page | Tesseract or PaddleOCR (GPU) |
| <500ms per page | PaddleOCR (GPU) or Tesseract |
| <1s per page | EasyOCR (GPU) or PaddleOCR (CPU) |
| 1-2s acceptable | EasyOCR (GPU) |
| Accuracy paramount | EasyOCR |
Optimization for Speed¶
import time
# Fast configuration
config_fast = PaddleOCRConfig(
languages=["en"], # Single language
gpu=True,
)
ocr = PaddleOCR(config=config_fast)
# Benchmark
images = [Image.open(f"doc{i}.png") for i in range(5)]
start = time.time()
for img in images:
result = ocr.extract(img)
elapsed = time.time() - start
print(f"Processed {len(images)} images in {elapsed:.1f}s")
print(f"Average: {elapsed/len(images):.2f}s per image")
Troubleshooting¶
Low Accuracy¶
Problem: OCR results have many errors.
Solutions: 1. Try different model (EasyOCR typically better) 2. Improve image quality 3. Try single language (faster + more accurate)
# Solution 1: Use EasyOCR (more accurate)
from omnidocs.tasks.ocr_extraction import EasyOCR, EasyOCRConfig
config = EasyOCRConfig(languages=["en"], gpu=True)
# Solution 2: Improve image quality
from PIL import Image, ImageEnhance
img = Image.open("noisy_scan.png")
# Increase contrast
enhancer = ImageEnhance.Contrast(img)
img = enhancer.enhance(1.5) # 50% more contrast
# Increase sharpness
enhancer = ImageEnhance.Sharpness(img)
img = enhancer.enhance(2.0) # 2x sharpness
# Resize if too small
if img.width < 1024:
img = img.resize((img.width * 2, img.height * 2), Image.Resampling.LANCZOS)
result = ocr.extract(img)
# Solution 3: Single language
config = EasyOCRConfig(languages=["en"], gpu=True) # Just English
Missing Text¶
Problem: Some text not detected.
Solutions: 1. Check image quality 2. Try lower confidence threshold 3. Use different OCR model
# Solution 1: Check image
print(f"Image size: {image.size}")
print(f"Image mode: {image.mode}")
# Solution 2: Lower confidence
all_blocks = result.text_blocks # Includes low confidence
confidence_dist = [b.confidence for b in result.text_blocks]
print(f"Confidence range: {min(confidence_dist):.2f}-{max(confidence_dist):.2f}")
# Get even low-confidence blocks
low_conf_blocks = [b for b in result.text_blocks if b.confidence < 0.5]
print(f"Low confidence blocks: {len(low_conf_blocks)}")
False Detections¶
Problem: Non-text detected as text.
Solutions: 1. Increase confidence threshold 2. Filter by text length 3. Manual post-processing
# Solution 1: Increase confidence
high_conf = [b for b in result.text_blocks if b.confidence > 0.95]
# Solution 2: Filter short blocks (likely noise)
MIN_CHARS = 2
valid_blocks = [b for b in result.text_blocks if len(b.text) >= MIN_CHARS]
# Solution 3: Remove non-alphabetic text
import string
alpha_blocks = [
b for b in result.text_blocks
if any(c.isalpha() for c in b.text)
]
Slow Performance¶
Problem: OCR taking too long.
Solutions: 1. Use faster model (PaddleOCR or Tesseract) 2. Reduce image resolution 3. Use GPU 4. Single language
# Solution 1: Use PaddleOCR (faster)
from omnidocs.tasks.ocr_extraction import PaddleOCR
ocr = PaddleOCR(gpu=True) # Fastest on GPU
# Solution 2: Reduce resolution
image = image.resize((image.width // 2, image.height // 2))
# Solution 3: Ensure GPU enabled
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
# Solution 4: Single language
config = PaddleOCRConfig(languages=["en"], gpu=True)
Next Steps: - See Text Extraction Guide for formatted document output - See Layout Analysis Guide for document structure - See Batch Processing Guide for processing many documents