Text Extraction Guide¶
Extract formatted text content (Markdown/HTML) from document images using vision-language models. This guide covers when to use text extraction, available models, output formats, and practical examples.
Table of Contents¶
- Quick Comparison: Text Extraction vs OCR vs Layout
- Available Models
- Basic Usage
- Output Formats
- Advanced Features
- Performance Optimization
- Troubleshooting
Quick Comparison¶
| Feature | Text Extraction | OCR | Layout Detection |
|---|---|---|---|
| Output | Formatted text (MD/HTML) | Text + bounding boxes | Element bounding boxes |
| Use Case | Document parsing, markdown export | Word/character localization | Document structure analysis |
| Models | Qwen3-VL, DotsOCR, Nanonets | Tesseract, EasyOCR, PaddleOCR | DocLayoutYOLO, Qwen-Layout |
| Latency | ~2-5 sec per page | ~1-2 sec per page | ~0.5-1 sec per page |
| Output Type | Single string | List of text blocks | List of bounding boxes |
| Layout Info | Optional (DotsOCR only) | No | Yes (with labels) |
Choose Text Extraction when: - Converting documents to Markdown/HTML - Extracting complete page content as formatted text - Working with complex documents (multi-column, figures, tables) - You need readable output for downstream processing
Choose OCR when: - You need precise character/word locations - Building re-OCR pipelines (e.g., for correction) - Requiring character-level accuracy metrics
Choose Layout Detection when: - You need document structure without text content - Building advanced pipelines (layout + text) - Analyzing document semantics
Available Models¶
1. Qwen3-VL (Recommended for most cases)¶
High-quality general-purpose vision-language model.
Strengths: - Best output quality across diverse documents - Multi-backend support (PyTorch, VLLM, MLX, API) - Consistent Markdown/HTML output - Good at handling complex layouts
Backends: - PyTorch: Local GPU inference (single GPU) - VLLM: High-throughput serving (multiple GPUs) - MLX: Apple Silicon (local) - API: Hosted models (cloud)
Model Variants:
- Qwen/Qwen3-VL-8B-Instruct: Recommended (8B parameters)
- Qwen/Qwen3-VL-32B-Instruct: Higher quality (32B, slower, more VRAM)
2. DotsOCR (Best for technical documents)¶
Optimized for complex technical documents with precise layout preservation.
Strengths: - Layout-aware extraction with bounding boxes - Specialized formatting for tables (HTML) and formulas (LaTeX) - Reading order preservation - 11-category layout detection
Weaknesses: - Slower than Qwen (requires layout analysis) - Higher VRAM requirements
Backends: - PyTorch: Local GPU inference - VLLM: High-throughput serving - API: Hosted models
Output Types: - Structured JSON with layout information - Markdown with coordinate annotations - HTML with bbox attributes
3. Nanonets (Coming soon)¶
Specialized for OCR-quality text extraction.
Basic Usage¶
Example 1: Simple Markdown Extraction¶
Extract a document page to Markdown using PyTorch backend.
from omnidocs import Document
from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig
from PIL import Image
# Load a single image
image = Image.open("document_page.png")
# Initialize extractor with PyTorch backend
config = QwenTextPyTorchConfig(
model="Qwen/Qwen3-VL-8B-Instruct",
device="cuda", # or "cpu"
torch_dtype="auto", # Automatic dtype selection
)
extractor = QwenTextExtractor(backend=config)
# Extract text in Markdown format
result = extractor.extract(image, output_format="markdown")
# Access the extracted content
print(result.content) # Formatted Markdown text
print(result.word_count) # Number of words
print(f"Model: {result.model_name}")
Example 2: Extract with Layout Information¶
Use DotsOCR to get text plus layout annotations.
from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRPyTorchConfig
from PIL import Image
import json
image = Image.open("complex_document.png")
# Initialize DotsOCR with layout detection
config = DotsOCRPyTorchConfig(
device="cuda",
max_new_tokens=8192, # Higher for complex documents
)
extractor = DotsOCRTextExtractor(backend=config)
# Extract with layout information
result = extractor.extract(image, include_layout=True)
# Access layout elements
print(f"Found {result.num_layout_elements} layout elements")
print(f"Content length: {result.content_length} characters")
# Iterate through layout elements
for element in result.layout:
print(f"[{element.category}] @{element.bbox}: {element.text[:50]}...")
# Save layout information to JSON
layout_json = [elem.model_dump() for elem in result.layout]
with open("layout.json", "w") as f:
json.dump(layout_json, f, indent=2)
Example 3: Extract PDF Document¶
Process multiple pages of a PDF document.
from omnidocs import Document
from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig
from pathlib import Path
# Load PDF document
doc = Document.from_pdf("multi_page_document.pdf")
print(f"Loaded PDF with {doc.page_count} pages")
# Initialize extractor
config = QwenTextPyTorchConfig(
model="Qwen/Qwen3-VL-8B-Instruct",
device="cuda",
)
extractor = QwenTextExtractor(backend=config)
# Extract text from all pages
all_text = []
for page_idx in range(min(3, doc.page_count)): # First 3 pages
page_image = doc.get_page(page_idx)
result = extractor.extract(page_image, output_format="markdown")
all_text.append(result.content)
print(f"Page {page_idx + 1}: {result.word_count} words")
# Combine results
full_document = "\n\n---\n\n".join(all_text)
print(f"\nTotal content: {len(full_document)} characters")
# Save to file
with open("extracted_document.md", "w") as f:
f.write(full_document)
Example 4: Batch Processing with Progress Tracking¶
Process multiple documents with progress reporting.
from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig
from pathlib import Path
from PIL import Image
import time
# Find all image files
image_dir = Path("documents/")
image_files = list(image_dir.glob("*.png")) + list(image_dir.glob("*.jpg"))
print(f"Found {len(image_files)} images to process")
# Initialize extractor
config = QwenTextPyTorchConfig(
model="Qwen/Qwen3-VL-8B-Instruct",
device="cuda",
max_new_tokens=4096,
)
extractor = QwenTextExtractor(backend=config)
# Process with progress tracking
results = {}
start_time = time.time()
for idx, image_path in enumerate(image_files, 1):
print(f"[{idx}/{len(image_files)}] Processing {image_path.name}...", end=" ")
try:
image = Image.open(image_path)
result = extractor.extract(image, output_format="markdown")
results[str(image_path)] = {
"content_length": result.content_length,
"word_count": result.word_count,
}
print(f"✓ ({result.word_count} words)")
except Exception as e:
print(f"✗ Error: {e}")
results[str(image_path)] = {"error": str(e)}
# Summary
elapsed = time.time() - start_time
print(f"\nCompleted in {elapsed:.1f}s ({elapsed/len(image_files):.2f}s per image)")
print(f"Successful: {sum(1 for r in results.values() if 'error' not in r)}")
Output Formats¶
Markdown Format¶
Human-readable format with standard Markdown syntax. Best for documentation and web publishing.
result = extractor.extract(image, output_format="markdown")
print(result.content)
# Example output:
# # Document Title
#
# This is the main content with **bold** and *italic* text.
#
# ## Section 1
#
# - Bullet point 1
# - Bullet point 2
#
# | Column 1 | Column 2 |
# |----------|----------|
# | Cell 1 | Cell 2 |
Advantages: - Human-readable - Git-friendly (version control) - Easy to edit - Good for documentation
Limitations: - Loses some layout information - Tables converted to Markdown tables (may lose formatting) - No bounding box information
HTML Format¶
Structured HTML with semantic tags. Better for preserving layout in web contexts.
result = extractor.extract(image, output_format="html")
print(result.content)
# Example output:
# <div class="document">
# <h1>Document Title</h1>
# <p>This is the main content with <b>bold</b> and <i>italic</i> text.</p>
# <h2>Section 1</h2>
# <ul>
# <li>Bullet point 1</li>
# <li>Bullet point 2</li>
# </ul>
# <table>...</table>
# </div>
Advantages: - Structured and semantic - Better layout preservation - Good for web rendering - Supports nested elements
Limitations: - More verbose - Requires HTML parser for processing - Layout information may still be approximate
Plain Text (Fallback)¶
Extract plain text without any formatting.
# Get plain text version
plain_text = result.plain_text
print(plain_text)
# Also available as property:
from omnidocs.tasks.text_extraction import QwenTextExtractor
# ... after extraction ...
print(result.plain_text) # No formatting, just raw text
DotsOCR JSON Format¶
Structured JSON with layout information (DotsOCR only).
result = extractor.extract(image, output_format="json", include_layout=True)
# Result includes:
# {
# "content": "Full text...",
# "layout": [
# {
# "bbox": [100, 50, 400, 80],
# "category": "Title",
# "text": "Document Title"
# },
# ...
# ]
# }
Advanced Features¶
Custom Prompts¶
Override the default extraction prompt for specialized use cases.
from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig
from PIL import Image
image = Image.open("document.png")
config = QwenTextPyTorchConfig(device="cuda")
extractor = QwenTextExtractor(backend=config)
# Custom prompt for extractive summarization
custom_prompt = """
Extract the most important information from this document image.
Focus on key facts, numbers, and action items.
Format as a concise Markdown list.
"""
result = extractor.extract(
image,
output_format="markdown",
custom_prompt=custom_prompt,
)
print(result.content)
Temperature Control (PyTorch only)¶
Adjust model creativity/determinism via temperature parameter.
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig
# Lower temperature = more deterministic (better for factual extraction)
config = QwenTextPyTorchConfig(
device="cuda",
temperature=0.1, # Default: 0.1 (deterministic)
)
# Higher temperature = more creative (for summarization, etc.)
config_creative = QwenTextPyTorchConfig(
device="cuda",
temperature=0.7,
)
Backend Switching¶
Easily switch between backends without changing extraction code.
from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import (
QwenTextPyTorchConfig,
QwenTextVLLMConfig,
QwenTextMLXConfig,
QwenTextAPIConfig,
)
from PIL import Image
image = Image.open("document.png")
# Use PyTorch for single-GPU inference
pytorch_extractor = QwenTextExtractor(
backend=QwenTextPyTorchConfig(device="cuda")
)
result1 = pytorch_extractor.extract(image, output_format="markdown")
# Use VLLM for high-throughput inference
vllm_extractor = QwenTextExtractor(
backend=QwenTextVLLMConfig(
model="Qwen/Qwen3-VL-8B-Instruct",
tensor_parallel_size=1,
)
)
result2 = vllm_extractor.extract(image, output_format="markdown")
# Use MLX for Apple Silicon
mlx_extractor = QwenTextExtractor(
backend=QwenTextMLXConfig(device="gpu")
)
result3 = mlx_extractor.extract(image, output_format="markdown")
# Use API for hosted models
api_extractor = QwenTextExtractor(
backend=QwenTextAPIConfig(
model="qwen3-vl-8b",
api_key="your-api-key",
base_url="https://api.example.com/v1",
)
)
result4 = api_extractor.extract(image, output_format="markdown")
print(f"PyTorch: {result1.word_count} words")
print(f"VLLM: {result2.word_count} words")
print(f"MLX: {result3.word_count} words")
print(f"API: {result4.word_count} words")
Performance Optimization¶
Model Selection¶
| Model | Latency | Quality | VRAM | Speed |
|---|---|---|---|---|
| Qwen3-VL-8B | 2-3 sec | Excellent | 16GB | Fast |
| Qwen3-VL-32B | 5-8 sec | Outstanding | 32GB | Slow |
| DotsOCR | 3-5 sec | Very Good (technical) | 20GB | Medium |
Recommendation: Start with Qwen3-VL-8B (best quality/speed tradeoff).
Backend Optimization¶
PyTorch (Single GPU): - Best for development and small batches - Load time: ~2-3 seconds - Per-image latency: ~2-3 seconds
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig
config = QwenTextPyTorchConfig(
model="Qwen/Qwen3-VL-8B-Instruct",
device="cuda",
torch_dtype="auto", # Let PyTorch choose optimal dtype
max_new_tokens=4096, # Reduce for faster inference
)
VLLM (Multi-GPU): - Best for batch processing / high throughput - Load time: ~5-8 seconds (slightly slower but amortizes) - Throughput: 2-4x better than PyTorch for multiple requests
from omnidocs.tasks.text_extraction.qwen import QwenTextVLLMConfig
config = QwenTextVLLMConfig(
model="Qwen/Qwen3-VL-8B-Instruct",
tensor_parallel_size=2, # Use 2 GPUs
gpu_memory_utilization=0.9, # Use 90% of VRAM
max_tokens=4096,
)
MLX (Apple Silicon): - Best for MacBook development - No GPU-related issues - Slightly slower than VRAM-constrained models
from omnidocs.tasks.text_extraction.qwen import QwenTextMLXConfig
config = QwenTextMLXConfig(
model="Qwen/Qwen3-VL-8B-Instruct-MLX",
device="gpu",
quantization="4bit", # Quantization reduces VRAM
)
Batch Processing Strategy¶
For processing many documents, batch requests to amortize model loading.
from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextVLLMConfig
from pathlib import Path
from PIL import Image
import time
# Initialize once (expensive)
config = QwenTextVLLMConfig(
model="Qwen/Qwen3-VL-8B-Instruct",
tensor_parallel_size=1,
gpu_memory_utilization=0.85,
max_tokens=4096,
)
extractor = QwenTextExtractor(backend=config)
# Process many documents (cheap)
image_paths = list(Path("documents/").glob("*.png"))
results = []
start = time.time()
for image_path in image_paths:
image = Image.open(image_path)
result = extractor.extract(image, output_format="markdown")
results.append(result)
elapsed = time.time() - start
print(f"Processed {len(results)} images in {elapsed:.1f}s")
print(f"Average: {elapsed/len(results):.2f}s per image")
Token Limit Tuning¶
Adjust max_new_tokens based on expected output length.
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig
# For short documents (< 1000 words)
config_short = QwenTextPyTorchConfig(
device="cuda",
max_new_tokens=2048, # Faster
)
# For medium documents (1000-5000 words)
config_medium = QwenTextPyTorchConfig(
device="cuda",
max_new_tokens=4096, # Default
)
# For long documents (> 5000 words)
config_long = QwenTextPyTorchConfig(
device="cuda",
max_new_tokens=8192, # Slower but handles longer docs
)
Troubleshooting¶
Out of Memory (OOM) Errors¶
Problem: CUDA out of memory during inference.
Solutions:
1. Reduce max_new_tokens
2. Use smaller model variant (8B instead of 32B)
3. Switch to VLLM with tensor_parallel_size > 1
4. Use quantization (if available)
# Option 1: Reduce max_new_tokens
config = QwenTextPyTorchConfig(
device="cuda",
max_new_tokens=2048, # Reduced from 4096
)
# Option 2: Smaller model
config = QwenTextPyTorchConfig(
model="Qwen/Qwen3-VL-8B-Instruct", # Instead of 32B
device="cuda",
)
# Option 3: VLLM with tensor parallelism
from omnidocs.tasks.text_extraction.qwen import QwenTextVLLMConfig
config = QwenTextVLLMConfig(
tensor_parallel_size=2, # Distribute across 2 GPUs
max_tokens=4096,
)
Slow Inference¶
Problem: Text extraction takes too long.
Solutions:
1. Check GPU utilization (should be >80%)
2. Reduce max_new_tokens
3. Use VLLM instead of PyTorch
4. Use VLLM tensor parallelism
import subprocess
# Check GPU usage during extraction
result = subprocess.run(
["nvidia-smi", "--query-gpu=utilization.gpu", "--format=csv,noheader"],
capture_output=True,
text=True
)
print(f"GPU Utilization: {result.stdout.strip()}%")
# If <50%, increase batch size or use VLLM
Incorrect or Garbled Output¶
Problem: Extracted text is incomplete or corrupted.
Solutions: 1. Check image quality (min 1024px width recommended) 2. Verify model downloaded correctly 3. Try with explicit output format
from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig
from PIL import Image
image = Image.open("document.png")
# Check image size
print(f"Image size: {image.size}") # Should be at least (1024, 768)
# Resize if too small
if image.width < 1024:
image = image.resize((image.width * 2, image.height * 2))
# Try extraction
config = QwenTextPyTorchConfig(device="cuda")
extractor = QwenTextExtractor(backend=config)
result = extractor.extract(image, output_format="markdown")
# Check result
if len(result.content) < 10:
print("Warning: Very short output, may indicate extraction failure")
print(f"Raw output: {result.raw_output}")
Model Download Issues¶
Problem: Model fails to download or load.
Solutions: 1. Check internet connection 2. Verify HuggingFace token 3. Set custom cache directory
import os
# Set HuggingFace token
os.environ["HF_TOKEN"] = "your-token-here"
# Set custom cache directory
os.environ["HF_HOME"] = "/large/disk/hf_cache"
# Verify download by loading model explicitly
from transformers import AutoTokenizer, AutoModel
model_id = "Qwen/Qwen3-VL-8B-Instruct"
try:
tokenizer = AutoTokenizer.from_pretrained(model_id)
print(f"✓ Model {model_id} loaded successfully")
except Exception as e:
print(f"✗ Failed to load model: {e}")
API Backend Timeouts¶
Problem: API requests timeout or fail.
Solutions: 1. Increase timeout value 2. Check API credentials 3. Reduce batch size
from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextAPIConfig
config = QwenTextAPIConfig(
model="qwen3-vl-8b",
api_key="your-api-key",
base_url="https://api.example.com/v1",
timeout=60, # Increase timeout
rate_limit=5, # Reduce concurrent requests
)
extractor = QwenTextExtractor(backend=config)
Next Steps: - See Batch Processing Guide for processing many documents - See Deployment Guide for scaling on Modal - See Layout Analysis Guide for structure-aware extraction