Skip to content

Text Extraction

Convert document images to readable, formatted text.


Input / Output

Input: Document image (PNG, JPG) or PDF page

Output: Formatted text (Markdown or HTML)

result = extractor.extract(image, output_format="markdown")
print(result.content)
# Document Title

This is the first paragraph with **bold** and *italic* text.

## Section 1

- Bullet point 1
- Bullet point 2

| Column A | Column B |
|----------|----------|
| Data 1   | Data 2   |

Quick Start

from omnidocs import Document
from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenPyTorchConfig

# Load document
doc = Document.from_pdf("document.pdf")

# Initialize extractor
extractor = QwenTextExtractor(
    backend=QwenPyTorchConfig(device="cuda")
)

# Extract
result = extractor.extract(doc.get_page(0), output_format="markdown")
print(result.content)

Output Formats

# Markdown (default)
result = extractor.extract(image, output_format="markdown")

# HTML
result = extractor.extract(image, output_format="html")

Available Models

Model Speed Quality Best For
VLM API Varies Depends on model No GPU, any cloud provider, custom prompts
Qwen 2-3s/page Excellent General purpose
DotsOCR 3-5s/page Very Good Technical docs, layout-aware
Nanonets OCR2 2-4s/page Excellent Document digitization, invoices

When to Use

✅ Converting PDFs to Markdown ✅ Extracting article content ✅ Document parsing for RAG pipelines ✅ Content migration

❌ Need word coordinates → Use OCR ❌ Need structure only → Use Layout Analysis