Qwen3-VL Text Extraction¶

Model Overview¶

Qwen3-VL is an advanced Vision-Language Model optimized for document understanding and text extraction. It excels at producing high-quality markdown and HTML output while maintaining document layout and semantic structure.

Model Family: Qwen3-VL-2B, Qwen3-VL-4B, Qwen3-VL-8B, Qwen3-VL-32B Repository: Qwen/Qwen3-VL Recommended Variant: Qwen3-VL-8B-Instruct (best balance of quality and speed)

Key Capabilities¶

Multi-format Output: Markdown, HTML, or custom formats
Layout-Aware: Preserves document structure and semantic relationships
Multilingual: Supports 25+ languages with native quality
Document Types: PDFs, academic papers, technical docs, web pages, presentations
Scale Support: Handles documents from single-page images to 16k+ token outputs
Custom Prompts: Flexible prompt engineering for specialized extraction tasks

Limitations¶

Requires GPU for inference (2B variant: 4GB VRAM, 8B: 16GB, 32B: 40GB+)
Slower than single-task models (100-300 tokens/sec depending on backend)
Can struggle with highly stylized or unusual layouts
No inherent language detection (specify language in config if needed)

Supported Backends¶

Qwen3-VL supports 4 inference backends, allowing you to choose the right deployment method:

Backend	Use Case	Performance	Setup
PyTorch	Local GPU inference	50-150 tokens/sec	Easy, single GPU
VLLM	High-throughput batching	200-400 tokens/sec	Requires GPU cluster
MLX	Apple Silicon (native)	20-50 tokens/sec	macOS M1/M2/M3+ only
API	Hosted inference	Variable	Cloud provider

Installation & Configuration¶

Basic Installation¶

# Install with PyTorch backend (most common)
pip install omnidocs[pytorch]

# Or install with VLLM for high throughput
pip install omnidocs[vllm]

# Or install with all backends
pip install omnidocs[all]

PyTorch Backend Configuration¶

from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig

config = QwenTextPyTorchConfig(
    model="Qwen/Qwen3-VL-8B-Instruct",
    device="cuda",
    torch_dtype="bfloat16",
    device_map="auto",
    trust_remote_code=True,
    use_flash_attention=False,  # Set to True if flash-attn installed
    max_new_tokens=8192,
    temperature=0.1,
)

extractor = QwenTextExtractor(backend=config)

PyTorch Config Parameters:

Parameter	Type	Default	Description
`model`	str	"Qwen/Qwen3-VL-8B-Instruct"	HuggingFace model ID
`device`	str	"cuda"	Device: "cuda", "mps", "cpu"
`torch_dtype`	str	"auto"	Data type: "float16", "bfloat16", "float32", "auto"
`device_map`	str	"auto"	Model parallelism: "auto", "balanced", "sequential", None
`trust_remote_code`	bool	True	Allow custom model code from HuggingFace
`use_flash_attention`	bool	False	Use Flash Attention 2 (faster, requires flash-attn)
`max_new_tokens`	int	8192	Max tokens to generate (256-32768)
`temperature`	float	0.1	Sampling temperature (0.0-2.0, lower = deterministic)

VLLM Backend Configuration¶

from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextVLLMConfig

config = QwenTextVLLMConfig(
    model="Qwen/Qwen3-VL-8B-Instruct",
    tensor_parallel_size=1,  # Use 2+ for large models
    gpu_memory_utilization=0.9,
    max_model_len=8192,
)

extractor = QwenTextExtractor(backend=config)

VLLM Config Parameters:

Parameter	Type	Default	Description
`model`	str	Required	HuggingFace model ID
`tensor_parallel_size`	int	1	Number of GPUs for parallelism
`gpu_memory_utilization`	float	0.9	GPU memory usage (0.1-1.0)
`max_model_len`	int	None	Max context length in tokens

MLX Backend Configuration (Apple Silicon)¶

from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextMLXConfig

config = QwenTextMLXConfig(
    model="Qwen/Qwen3-VL-8B-Instruct",
    quantization="4bit",  # or "8bit", "none"
    max_tokens=8192,
)

extractor = QwenTextExtractor(backend=config)

API Backend Configuration¶

from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextAPIConfig

config = QwenTextAPIConfig(
    model="qwen3-vl-8b",
    api_key="your-api-key",
    base_url="https://api.provider.com/v1",
    rate_limit=10,  # Requests per second
)

extractor = QwenTextExtractor(backend=config)

Usage Examples¶

Basic Text Extraction (Markdown)¶

from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig
from omnidocs import Document
from PIL import Image

# Initialize extractor
config = QwenTextPyTorchConfig(
    model="Qwen/Qwen3-VL-8B-Instruct",
    device="cuda",
)
extractor = QwenTextExtractor(backend=config)

# Load document
image = Image.open("document.png")

# Extract text in markdown
result = extractor.extract(
    image,
    output_format="markdown",
)

print(result.content)  # Clean markdown
print(result.word_count)  # Approximate word count

Multi-Format Extraction¶

# HTML output (preserves more layout semantics)
result_html = extractor.extract(
    image,
    output_format="html",
)

# Custom prompt for specialized extraction
custom_prompt = """Extract all text as JSON with structure:
{
    "title": "...",
    "sections": [{"heading": "...", "content": "..."}],
    "tables": [...]
}
"""

result_custom = extractor.extract(
    image,
    output_format="markdown",
    custom_prompt=custom_prompt,
)

Batch Processing with VLLM¶

from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextVLLMConfig
from PIL import Image
import time

# Initialize with VLLM for high throughput
config = QwenTextVLLMConfig(
    model="Qwen/Qwen3-VL-8B-Instruct",
    tensor_parallel_size=2,  # Use 2 GPUs
    gpu_memory_utilization=0.8,
)
extractor = QwenTextExtractor(backend=config)

# Load multiple documents
images = [
    Image.open(f"doc_{i}.png") for i in range(10)
]

# Process with streaming
results = []
start = time.time()

for i, image in enumerate(images):
    result = extractor.extract(image, output_format="markdown")
    results.append(result)
    elapsed = time.time() - start
    throughput = (i + 1) / elapsed * 1000  # chars/sec
    print(f"[{i+1}/10] {result.content_length} chars - {throughput:.0f} chars/sec")

print(f"\nTotal time: {time.time() - start:.1f}s")
print(f"Avg length: {sum(r.content_length for r in results) / len(results):.0f} chars")

Layout-Aware Extraction¶

# Include layout information
result = extractor.extract(
    image,
    output_format="markdown",
    include_layout=True,
)

# Access raw output with bounding boxes
print(result.raw_output)  # Contains bbox annotations

API-Based Extraction (Cloud)¶

import os
from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextAPIConfig
from PIL import Image

# Configure API backend
config = QwenTextAPIConfig(
    model="qwen3-vl-8b",
    api_key=os.getenv("QWEN_API_KEY"),
    base_url="https://api.together.xyz/v1",
    rate_limit=5,
)

extractor = QwenTextExtractor(backend=config)

# Extract from image
image = Image.open("document.png")
result = extractor.extract(
    image,
    output_format="markdown",
)

print(result.content)

Performance Characteristics¶

Memory Requirements by Variant¶

Model	Min VRAM	Optimal VRAM	Batch Size (VLLM)
Qwen3-VL-2B	4 GB	8 GB	8-16
Qwen3-VL-4B	8 GB	12 GB	4-8
Qwen3-VL-8B	16 GB	24 GB	2-4
Qwen3-VL-32B	40 GB	80 GB	1

Inference Speed (Single Document)¶

Backend	Model	Speed	Device
PyTorch	8B	50-100 tok/s	Single A10 GPU
VLLM	8B	200-300 tok/s	2x A10 GPU (tensor parallel)
MLX	8B-quantized	20-40 tok/s	M3 Max (48GB)
API	8B	Variable	Cloud (depends on provider)

Typical Output Sizes¶

Document Type	Tokens	Characters
Single-page document	500-2000	3-12 KB
Academic paper page	1000-4000	6-24 KB
Multi-page scanned doc	2000-8000	12-48 KB

Troubleshooting¶

Out of Memory (OOM)¶

Symptom: RuntimeError: CUDA out of memory

Solutions:

# 1. Reduce max_new_tokens
config = QwenTextPyTorchConfig(
    model="Qwen/Qwen3-VL-8B-Instruct",
    max_new_tokens=4096,  # Reduced from 8192
)

# 2. Use smaller model variant
config = QwenTextPyTorchConfig(
    model="Qwen/Qwen3-VL-4B-Instruct",  # Smaller variant
)

# 3. Enable quantization
config = QwenTextPyTorchConfig(
    model="Qwen/Qwen3-VL-8B-Instruct",
    load_in_4bit=True,  # Requires bitsandbytes
)

# 4. Use CPU
config = QwenTextPyTorchConfig(
    model="Qwen/Qwen3-VL-8B-Instruct",
    device="cpu",  # Slower but works with limited VRAM
)

Slow Inference¶

Symptom: Processing takes 30+ seconds per document

Solutions:

# 1. Enable Flash Attention (requires flash-attn package)
config = QwenTextPyTorchConfig(
    use_flash_attention=True,
)

# 2. Use VLLM for batching
from omnidocs.tasks.text_extraction.qwen import QwenTextVLLMConfig
config = QwenTextVLLMConfig(
    model="Qwen/Qwen3-VL-8B-Instruct",
    tensor_parallel_size=2,
)

# 3. Use smaller model
config = QwenTextPyTorchConfig(
    model="Qwen/Qwen3-VL-4B-Instruct",  # 2x faster
)

# 4. Reduce image size
from PIL import Image
image = Image.open("document.png")
image.thumbnail((1024, 1024))  # Resize to 1024x1024 max

Poor Quality Output¶

Symptom: Garbled or incomplete text extraction

Solutions:

# 1. Lower temperature for more deterministic output
config = QwenTextPyTorchConfig(
    temperature=0.01,  # Very low for consistency
)

# 2. Use larger model variant
config = QwenTextPyTorchConfig(
    model="Qwen/Qwen3-VL-32B-Instruct",  # Better quality
)

# 3. Pre-process image (enhance contrast, de-skew)
from PIL import ImageEnhance
image = Image.open("document.png")
enhancer = ImageEnhance.Contrast(image)
image = enhancer.enhance(1.5)  # Increase contrast

# 4. Custom prompt for better guidance
custom_prompt = """Extract all text exactly as it appears.
Preserve formatting, structure, and special characters."""
result = extractor.extract(image, custom_prompt=custom_prompt)

API Rate Limiting¶

Symptom: 429 Too Many Requests errors

Solutions:

# Reduce rate limit
config = QwenTextAPIConfig(
    model="qwen3-vl-8b",
    api_key="...",
    rate_limit=2,  # Reduced from 10
)

# Implement retry logic
import time
max_retries = 3
for attempt in range(max_retries):
    try:
        result = extractor.extract(image)
        break
    except Exception as e:
        if attempt < max_retries - 1:
            wait_time = 2 ** attempt
            print(f"Rate limited, waiting {wait_time}s...")
            time.sleep(wait_time)
        else:
            raise

Model Download Issues¶

Symptom: ConnectionError or timeout during model loading

Solutions:

# Set HuggingFace cache directory
import os
os.environ["HF_HOME"] = "/path/to/cache"

# Pre-download model
from huggingface_hub import snapshot_download
snapshot_download("Qwen/Qwen3-VL-8B-Instruct")

# Use local model path
config = QwenTextPyTorchConfig(
    model="/local/path/to/model",
)

Model Selection Guide¶

When to Use Qwen3-VL¶

Best for: - High-quality document extraction (academic papers, technical docs) - Multilingual documents - Complex layouts with mixed content types - Production systems needing consistent quality

Not ideal for: - Real-time processing (see: Nanonuts OCR for speed) - Handwritten documents (see: Surya OCR) - Fixed-label layout detection (see: DocLayout-YOLO)

Qwen vs DotsOCR Comparison¶

Feature	Qwen3-VL	DotsOCR
Output Quality	Excellent	Very Good
Layout Info	Basic	Detailed (11 categories)
Speed	Medium	Fast
Memory	High	Medium
Multilingual	Yes (25+ langs)	Limited
Model Size Options	2B-32B	Single

Choose Qwen3-VL if: You need high-quality text and multilingual support Choose DotsOCR if: You need detailed layout information with good performance

API Reference¶

QwenTextExtractor.extract()¶

def extract(
    image: Union[Image.Image, np.ndarray, str, Path],
    output_format: str = "markdown",
    include_layout: bool = False,
    custom_prompt: Optional[str] = None,
) -> TextOutput:
    """
    Extract text from image using Qwen3-VL.

    Args:
        image: Input image (PIL Image, numpy array, or path)
        output_format: "markdown" or "html"
        include_layout: Include layout information in raw output
        custom_prompt: Override default extraction prompt

    Returns:
        TextOutput with extracted content
    """

TextOutput Properties¶

result = extractor.extract(image)

# Access extracted content
print(result.content)        # Formatted text (markdown/html)
print(result.format)         # Output format
print(result.plain_text)     # Plain text without formatting
print(result.content_length) # Character count
print(result.word_count)     # Approximate word count
print(result.image_width)    # Source image width
print(result.image_height)   # Source image height
print(result.model_name)     # Model used
print(result.raw_output)     # Raw model output (with artifacts)

Advanced Configuration¶

Device Map Strategies¶

# Auto device mapping (recommended)
device_map = "auto"

# Balanced distribution across GPUs
device_map = "balanced"

# Sequential loading (one GPU at a time)
device_map = "sequential"

# Manual: First layer on GPU0, rest on CPU
device_map = {
    "model.layers.0": 0,
    "model.layers.1-31": "cpu",
}

Data Type Selection¶

# float32: Full precision (slower, more VRAM)
torch_dtype = "float32"

# float16: Half precision (faster, less VRAM, less accurate)
torch_dtype = "float16"

# bfloat16: Brain float (recommended for stability)
torch_dtype = "bfloat16"

# auto: Let model choose based on hardware
torch_dtype = "auto"

Qwen3-VL Text Extraction¶

Model Overview¶

Key Capabilities¶

Limitations¶

Supported Backends¶

Installation & Configuration¶

Basic Installation¶

PyTorch Backend Configuration¶

VLLM Backend Configuration¶

MLX Backend Configuration (Apple Silicon)¶

API Backend Configuration¶

Usage Examples¶

Basic Text Extraction (Markdown)¶

Multi-Format Extraction¶

Batch Processing with VLLM¶

Layout-Aware Extraction¶

API-Based Extraction (Cloud)¶

Performance Characteristics¶

Memory Requirements by Variant¶

Inference Speed (Single Document)¶

Typical Output Sizes¶

Troubleshooting¶

Out of Memory (OOM)¶

Slow Inference¶

Poor Quality Output¶

API Rate Limiting¶

Model Download Issues¶

Model Selection Guide¶

When to Use Qwen3-VL¶

Qwen vs DotsOCR Comparison¶

API Reference¶

QwenTextExtractor.extract()¶

TextOutput Properties¶

Advanced Configuration¶

Device Map Strategies¶

Data Type Selection¶

See Also¶