Skip to content

Text Extraction Guide

Extract formatted text content (Markdown/HTML) from document images using vision-language models. This guide covers when to use text extraction, available models, output formats, and practical examples.

Table of Contents

Quick Comparison

Feature Text Extraction OCR Layout Detection
Output Formatted text (MD/HTML) Text + bounding boxes Element bounding boxes
Use Case Document parsing, markdown export Word/character localization Document structure analysis
Models Qwen3-VL, DotsOCR, Nanonets Tesseract, EasyOCR, PaddleOCR DocLayoutYOLO, Qwen-Layout
Latency ~2-5 sec per page ~1-2 sec per page ~0.5-1 sec per page
Output Type Single string List of text blocks List of bounding boxes
Layout Info Optional (DotsOCR only) No Yes (with labels)

Choose Text Extraction when: - Converting documents to Markdown/HTML - Extracting complete page content as formatted text - Working with complex documents (multi-column, figures, tables) - You need readable output for downstream processing

Choose OCR when: - You need precise character/word locations - Building re-OCR pipelines (e.g., for correction) - Requiring character-level accuracy metrics

Choose Layout Detection when: - You need document structure without text content - Building advanced pipelines (layout + text) - Analyzing document semantics

Available Models

High-quality general-purpose vision-language model.

Strengths: - Best output quality across diverse documents - Multi-backend support (PyTorch, VLLM, MLX, API) - Consistent Markdown/HTML output - Good at handling complex layouts

Backends: - PyTorch: Local GPU inference (single GPU) - VLLM: High-throughput serving (multiple GPUs) - MLX: Apple Silicon (local) - API: Hosted models (cloud)

Model Variants: - Qwen/Qwen3-VL-8B-Instruct: Recommended (8B parameters) - Qwen/Qwen3-VL-32B-Instruct: Higher quality (32B, slower, more VRAM)

2. DotsOCR (Best for technical documents)

Optimized for complex technical documents with precise layout preservation.

Strengths: - Layout-aware extraction with bounding boxes - Specialized formatting for tables (HTML) and formulas (LaTeX) - Reading order preservation - 11-category layout detection

Weaknesses: - Slower than Qwen (requires layout analysis) - Higher VRAM requirements

Backends: - PyTorch: Local GPU inference - VLLM: High-throughput serving - API: Hosted models

Output Types: - Structured JSON with layout information - Markdown with coordinate annotations - HTML with bbox attributes

3. Nanonets (Coming soon)

Specialized for OCR-quality text extraction.

Basic Usage

Example 1: Simple Markdown Extraction

Extract a document page to Markdown using PyTorch backend.

from omnidocs import Document
from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig
from PIL import Image

# Load a single image
image = Image.open("document_page.png")

# Initialize extractor with PyTorch backend
config = QwenTextPyTorchConfig(
    model="Qwen/Qwen3-VL-8B-Instruct",
    device="cuda",  # or "cpu"
    torch_dtype="auto",  # Automatic dtype selection
)
extractor = QwenTextExtractor(backend=config)

# Extract text in Markdown format
result = extractor.extract(image, output_format="markdown")

# Access the extracted content
print(result.content)  # Formatted Markdown text
print(result.word_count)  # Number of words
print(f"Model: {result.model_name}")

Example 2: Extract with Layout Information

Use DotsOCR to get text plus layout annotations.

from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
from omnidocs.tasks.text_extraction.dotsocr import DotsOCRPyTorchConfig
from PIL import Image
import json

image = Image.open("complex_document.png")

# Initialize DotsOCR with layout detection
config = DotsOCRPyTorchConfig(
    device="cuda",
    max_new_tokens=8192,  # Higher for complex documents
)
extractor = DotsOCRTextExtractor(backend=config)

# Extract with layout information
result = extractor.extract(image, include_layout=True)

# Access layout elements
print(f"Found {result.num_layout_elements} layout elements")
print(f"Content length: {result.content_length} characters")

# Iterate through layout elements
for element in result.layout:
    print(f"[{element.category}] @{element.bbox}: {element.text[:50]}...")

# Save layout information to JSON
layout_json = [elem.model_dump() for elem in result.layout]
with open("layout.json", "w") as f:
    json.dump(layout_json, f, indent=2)

Example 3: Extract PDF Document

Process multiple pages of a PDF document.

from omnidocs import Document
from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig
from pathlib import Path

# Load PDF document
doc = Document.from_pdf("multi_page_document.pdf")
print(f"Loaded PDF with {doc.page_count} pages")

# Initialize extractor
config = QwenTextPyTorchConfig(
    model="Qwen/Qwen3-VL-8B-Instruct",
    device="cuda",
)
extractor = QwenTextExtractor(backend=config)

# Extract text from all pages
all_text = []
for page_idx in range(min(3, doc.page_count)):  # First 3 pages
    page_image = doc.get_page(page_idx)
    result = extractor.extract(page_image, output_format="markdown")
    all_text.append(result.content)
    print(f"Page {page_idx + 1}: {result.word_count} words")

# Combine results
full_document = "\n\n---\n\n".join(all_text)
print(f"\nTotal content: {len(full_document)} characters")

# Save to file
with open("extracted_document.md", "w") as f:
    f.write(full_document)

Example 4: Batch Processing with Progress Tracking

Process multiple documents with progress reporting.

from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig
from pathlib import Path
from PIL import Image
import time

# Find all image files
image_dir = Path("documents/")
image_files = list(image_dir.glob("*.png")) + list(image_dir.glob("*.jpg"))
print(f"Found {len(image_files)} images to process")

# Initialize extractor
config = QwenTextPyTorchConfig(
    model="Qwen/Qwen3-VL-8B-Instruct",
    device="cuda",
    max_new_tokens=4096,
)
extractor = QwenTextExtractor(backend=config)

# Process with progress tracking
results = {}
start_time = time.time()

for idx, image_path in enumerate(image_files, 1):
    print(f"[{idx}/{len(image_files)}] Processing {image_path.name}...", end=" ")

    try:
        image = Image.open(image_path)
        result = extractor.extract(image, output_format="markdown")
        results[str(image_path)] = {
            "content_length": result.content_length,
            "word_count": result.word_count,
        }
        print(f"✓ ({result.word_count} words)")
    except Exception as e:
        print(f"✗ Error: {e}")
        results[str(image_path)] = {"error": str(e)}

# Summary
elapsed = time.time() - start_time
print(f"\nCompleted in {elapsed:.1f}s ({elapsed/len(image_files):.2f}s per image)")
print(f"Successful: {sum(1 for r in results.values() if 'error' not in r)}")

Output Formats

Markdown Format

Human-readable format with standard Markdown syntax. Best for documentation and web publishing.

result = extractor.extract(image, output_format="markdown")
print(result.content)

# Example output:
# # Document Title
#
# This is the main content with **bold** and *italic* text.
#
# ## Section 1
#
# - Bullet point 1
# - Bullet point 2
#
# | Column 1 | Column 2 |
# |----------|----------|
# | Cell 1   | Cell 2   |

Advantages: - Human-readable - Git-friendly (version control) - Easy to edit - Good for documentation

Limitations: - Loses some layout information - Tables converted to Markdown tables (may lose formatting) - No bounding box information

HTML Format

Structured HTML with semantic tags. Better for preserving layout in web contexts.

result = extractor.extract(image, output_format="html")
print(result.content)

# Example output:
# <div class="document">
#   <h1>Document Title</h1>
#   <p>This is the main content with <b>bold</b> and <i>italic</i> text.</p>
#   <h2>Section 1</h2>
#   <ul>
#     <li>Bullet point 1</li>
#     <li>Bullet point 2</li>
#   </ul>
#   <table>...</table>
# </div>

Advantages: - Structured and semantic - Better layout preservation - Good for web rendering - Supports nested elements

Limitations: - More verbose - Requires HTML parser for processing - Layout information may still be approximate

Plain Text (Fallback)

Extract plain text without any formatting.

# Get plain text version
plain_text = result.plain_text
print(plain_text)

# Also available as property:
from omnidocs.tasks.text_extraction import QwenTextExtractor
# ... after extraction ...
print(result.plain_text)  # No formatting, just raw text

DotsOCR JSON Format

Structured JSON with layout information (DotsOCR only).

result = extractor.extract(image, output_format="json", include_layout=True)

# Result includes:
# {
#   "content": "Full text...",
#   "layout": [
#     {
#       "bbox": [100, 50, 400, 80],
#       "category": "Title",
#       "text": "Document Title"
#     },
#     ...
#   ]
# }

Advanced Features

Custom Prompts

Override the default extraction prompt for specialized use cases.

from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig
from PIL import Image

image = Image.open("document.png")
config = QwenTextPyTorchConfig(device="cuda")
extractor = QwenTextExtractor(backend=config)

# Custom prompt for extractive summarization
custom_prompt = """
Extract the most important information from this document image.
Focus on key facts, numbers, and action items.
Format as a concise Markdown list.
"""

result = extractor.extract(
    image,
    output_format="markdown",
    custom_prompt=custom_prompt,
)

print(result.content)

Temperature Control (PyTorch only)

Adjust model creativity/determinism via temperature parameter.

from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig

# Lower temperature = more deterministic (better for factual extraction)
config = QwenTextPyTorchConfig(
    device="cuda",
    temperature=0.1,  # Default: 0.1 (deterministic)
)

# Higher temperature = more creative (for summarization, etc.)
config_creative = QwenTextPyTorchConfig(
    device="cuda",
    temperature=0.7,
)

Backend Switching

Easily switch between backends without changing extraction code.

from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import (
    QwenTextPyTorchConfig,
    QwenTextVLLMConfig,
    QwenTextMLXConfig,
    QwenTextAPIConfig,
)
from PIL import Image

image = Image.open("document.png")

# Use PyTorch for single-GPU inference
pytorch_extractor = QwenTextExtractor(
    backend=QwenTextPyTorchConfig(device="cuda")
)
result1 = pytorch_extractor.extract(image, output_format="markdown")

# Use VLLM for high-throughput inference
vllm_extractor = QwenTextExtractor(
    backend=QwenTextVLLMConfig(
        model="Qwen/Qwen3-VL-8B-Instruct",
        tensor_parallel_size=1,
    )
)
result2 = vllm_extractor.extract(image, output_format="markdown")

# Use MLX for Apple Silicon
mlx_extractor = QwenTextExtractor(
    backend=QwenTextMLXConfig(device="gpu")
)
result3 = mlx_extractor.extract(image, output_format="markdown")

# Use API for hosted models
api_extractor = QwenTextExtractor(
    backend=QwenTextAPIConfig(
        model="qwen3-vl-8b",
        api_key="your-api-key",
        base_url="https://api.example.com/v1",
    )
)
result4 = api_extractor.extract(image, output_format="markdown")

print(f"PyTorch: {result1.word_count} words")
print(f"VLLM: {result2.word_count} words")
print(f"MLX: {result3.word_count} words")
print(f"API: {result4.word_count} words")

Performance Optimization

Model Selection

Model Latency Quality VRAM Speed
Qwen3-VL-8B 2-3 sec Excellent 16GB Fast
Qwen3-VL-32B 5-8 sec Outstanding 32GB Slow
DotsOCR 3-5 sec Very Good (technical) 20GB Medium

Recommendation: Start with Qwen3-VL-8B (best quality/speed tradeoff).

Backend Optimization

PyTorch (Single GPU): - Best for development and small batches - Load time: ~2-3 seconds - Per-image latency: ~2-3 seconds

from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig

config = QwenTextPyTorchConfig(
    model="Qwen/Qwen3-VL-8B-Instruct",
    device="cuda",
    torch_dtype="auto",  # Let PyTorch choose optimal dtype
    max_new_tokens=4096,  # Reduce for faster inference
)

VLLM (Multi-GPU): - Best for batch processing / high throughput - Load time: ~5-8 seconds (slightly slower but amortizes) - Throughput: 2-4x better than PyTorch for multiple requests

from omnidocs.tasks.text_extraction.qwen import QwenTextVLLMConfig

config = QwenTextVLLMConfig(
    model="Qwen/Qwen3-VL-8B-Instruct",
    tensor_parallel_size=2,  # Use 2 GPUs
    gpu_memory_utilization=0.9,  # Use 90% of VRAM
    max_tokens=4096,
)

MLX (Apple Silicon): - Best for MacBook development - No GPU-related issues - Slightly slower than VRAM-constrained models

from omnidocs.tasks.text_extraction.qwen import QwenTextMLXConfig

config = QwenTextMLXConfig(
    model="Qwen/Qwen3-VL-8B-Instruct-MLX",
    device="gpu",
    quantization="4bit",  # Quantization reduces VRAM
)

Batch Processing Strategy

For processing many documents, batch requests to amortize model loading.

from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextVLLMConfig
from pathlib import Path
from PIL import Image
import time

# Initialize once (expensive)
config = QwenTextVLLMConfig(
    model="Qwen/Qwen3-VL-8B-Instruct",
    tensor_parallel_size=1,
    gpu_memory_utilization=0.85,
    max_tokens=4096,
)
extractor = QwenTextExtractor(backend=config)

# Process many documents (cheap)
image_paths = list(Path("documents/").glob("*.png"))
results = []

start = time.time()
for image_path in image_paths:
    image = Image.open(image_path)
    result = extractor.extract(image, output_format="markdown")
    results.append(result)

elapsed = time.time() - start
print(f"Processed {len(results)} images in {elapsed:.1f}s")
print(f"Average: {elapsed/len(results):.2f}s per image")

Token Limit Tuning

Adjust max_new_tokens based on expected output length.

from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig

# For short documents (< 1000 words)
config_short = QwenTextPyTorchConfig(
    device="cuda",
    max_new_tokens=2048,  # Faster
)

# For medium documents (1000-5000 words)
config_medium = QwenTextPyTorchConfig(
    device="cuda",
    max_new_tokens=4096,  # Default
)

# For long documents (> 5000 words)
config_long = QwenTextPyTorchConfig(
    device="cuda",
    max_new_tokens=8192,  # Slower but handles longer docs
)

Troubleshooting

Out of Memory (OOM) Errors

Problem: CUDA out of memory during inference.

Solutions: 1. Reduce max_new_tokens 2. Use smaller model variant (8B instead of 32B) 3. Switch to VLLM with tensor_parallel_size > 1 4. Use quantization (if available)

# Option 1: Reduce max_new_tokens
config = QwenTextPyTorchConfig(
    device="cuda",
    max_new_tokens=2048,  # Reduced from 4096
)

# Option 2: Smaller model
config = QwenTextPyTorchConfig(
    model="Qwen/Qwen3-VL-8B-Instruct",  # Instead of 32B
    device="cuda",
)

# Option 3: VLLM with tensor parallelism
from omnidocs.tasks.text_extraction.qwen import QwenTextVLLMConfig
config = QwenTextVLLMConfig(
    tensor_parallel_size=2,  # Distribute across 2 GPUs
    max_tokens=4096,
)

Slow Inference

Problem: Text extraction takes too long.

Solutions: 1. Check GPU utilization (should be >80%) 2. Reduce max_new_tokens 3. Use VLLM instead of PyTorch 4. Use VLLM tensor parallelism

import subprocess

# Check GPU usage during extraction
result = subprocess.run(
    ["nvidia-smi", "--query-gpu=utilization.gpu", "--format=csv,noheader"],
    capture_output=True,
    text=True
)
print(f"GPU Utilization: {result.stdout.strip()}%")

# If <50%, increase batch size or use VLLM

Incorrect or Garbled Output

Problem: Extracted text is incomplete or corrupted.

Solutions: 1. Check image quality (min 1024px width recommended) 2. Verify model downloaded correctly 3. Try with explicit output format

from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig
from PIL import Image

image = Image.open("document.png")

# Check image size
print(f"Image size: {image.size}")  # Should be at least (1024, 768)

# Resize if too small
if image.width < 1024:
    image = image.resize((image.width * 2, image.height * 2))

# Try extraction
config = QwenTextPyTorchConfig(device="cuda")
extractor = QwenTextExtractor(backend=config)
result = extractor.extract(image, output_format="markdown")

# Check result
if len(result.content) < 10:
    print("Warning: Very short output, may indicate extraction failure")
    print(f"Raw output: {result.raw_output}")

Model Download Issues

Problem: Model fails to download or load.

Solutions: 1. Check internet connection 2. Verify HuggingFace token 3. Set custom cache directory

import os

# Set HuggingFace token
os.environ["HF_TOKEN"] = "your-token-here"

# Set custom cache directory
os.environ["HF_HOME"] = "/large/disk/hf_cache"

# Verify download by loading model explicitly
from transformers import AutoTokenizer, AutoModel

model_id = "Qwen/Qwen3-VL-8B-Instruct"
try:
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    print(f"✓ Model {model_id} loaded successfully")
except Exception as e:
    print(f"✗ Failed to load model: {e}")

API Backend Timeouts

Problem: API requests timeout or fail.

Solutions: 1. Increase timeout value 2. Check API credentials 3. Reduce batch size

from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextAPIConfig

config = QwenTextAPIConfig(
    model="qwen3-vl-8b",
    api_key="your-api-key",
    base_url="https://api.example.com/v1",
    timeout=60,  # Increase timeout
    rate_limit=5,  # Reduce concurrent requests
)
extractor = QwenTextExtractor(backend=config)

Next Steps: - See Batch Processing Guide for processing many documents - See Deployment Guide for scaling on Modal - See Layout Analysis Guide for structure-aware extraction