Skip to content

Qwen3-VL Text Extraction

Model Overview

Qwen3-VL is an advanced Vision-Language Model optimized for document understanding and text extraction. It excels at producing high-quality markdown and HTML output while maintaining document layout and semantic structure.

Model Family: Qwen3-VL-2B, Qwen3-VL-4B, Qwen3-VL-8B, Qwen3-VL-32B Repository: Qwen/Qwen3-VL Recommended Variant: Qwen3-VL-8B-Instruct (best balance of quality and speed)

Key Capabilities

  • Multi-format Output: Markdown, HTML, or custom formats
  • Layout-Aware: Preserves document structure and semantic relationships
  • Multilingual: Supports 25+ languages with native quality
  • Document Types: PDFs, academic papers, technical docs, web pages, presentations
  • Scale Support: Handles documents from single-page images to 16k+ token outputs
  • Custom Prompts: Flexible prompt engineering for specialized extraction tasks

Limitations

  • Requires GPU for inference (2B variant: 4GB VRAM, 8B: 16GB, 32B: 40GB+)
  • Slower than single-task models (100-300 tokens/sec depending on backend)
  • Can struggle with highly stylized or unusual layouts
  • No inherent language detection (specify language in config if needed)

Supported Backends

Qwen3-VL supports 4 inference backends, allowing you to choose the right deployment method:

Backend Use Case Performance Setup
PyTorch Local GPU inference 50-150 tokens/sec Easy, single GPU
VLLM High-throughput batching 200-400 tokens/sec Requires GPU cluster
MLX Apple Silicon (native) 20-50 tokens/sec macOS M1/M2/M3+ only
API Hosted inference Variable Cloud provider

Installation & Configuration

Basic Installation

# Install with PyTorch backend (most common)
pip install omnidocs[pytorch]

# Or install with VLLM for high throughput
pip install omnidocs[vllm]

# Or install with all backends
pip install omnidocs[all]

PyTorch Backend Configuration

from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig

config = QwenTextPyTorchConfig(
    model="Qwen/Qwen3-VL-8B-Instruct",
    device="cuda",
    torch_dtype="bfloat16",
    device_map="auto",
    trust_remote_code=True,
    use_flash_attention=False,  # Set to True if flash-attn installed
    max_new_tokens=8192,
    temperature=0.1,
)

extractor = QwenTextExtractor(backend=config)

PyTorch Config Parameters:

Parameter Type Default Description
model str "Qwen/Qwen3-VL-8B-Instruct" HuggingFace model ID
device str "cuda" Device: "cuda", "mps", "cpu"
torch_dtype str "auto" Data type: "float16", "bfloat16", "float32", "auto"
device_map str "auto" Model parallelism: "auto", "balanced", "sequential", None
trust_remote_code bool True Allow custom model code from HuggingFace
use_flash_attention bool False Use Flash Attention 2 (faster, requires flash-attn)
max_new_tokens int 8192 Max tokens to generate (256-32768)
temperature float 0.1 Sampling temperature (0.0-2.0, lower = deterministic)

VLLM Backend Configuration

from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextVLLMConfig

config = QwenTextVLLMConfig(
    model="Qwen/Qwen3-VL-8B-Instruct",
    tensor_parallel_size=1,  # Use 2+ for large models
    gpu_memory_utilization=0.9,
    max_model_len=8192,
)

extractor = QwenTextExtractor(backend=config)

VLLM Config Parameters:

Parameter Type Default Description
model str Required HuggingFace model ID
tensor_parallel_size int 1 Number of GPUs for parallelism
gpu_memory_utilization float 0.9 GPU memory usage (0.1-1.0)
max_model_len int None Max context length in tokens

MLX Backend Configuration (Apple Silicon)

from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextMLXConfig

config = QwenTextMLXConfig(
    model="Qwen/Qwen3-VL-8B-Instruct",
    quantization="4bit",  # or "8bit", "none"
    max_tokens=8192,
)

extractor = QwenTextExtractor(backend=config)

API Backend Configuration

from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextAPIConfig

config = QwenTextAPIConfig(
    model="qwen3-vl-8b",
    api_key="your-api-key",
    base_url="https://api.provider.com/v1",
    rate_limit=10,  # Requests per second
)

extractor = QwenTextExtractor(backend=config)

Usage Examples

Basic Text Extraction (Markdown)

from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig
from omnidocs import Document
from PIL import Image

# Initialize extractor
config = QwenTextPyTorchConfig(
    model="Qwen/Qwen3-VL-8B-Instruct",
    device="cuda",
)
extractor = QwenTextExtractor(backend=config)

# Load document
image = Image.open("document.png")

# Extract text in markdown
result = extractor.extract(
    image,
    output_format="markdown",
)

print(result.content)  # Clean markdown
print(result.word_count)  # Approximate word count

Multi-Format Extraction

# HTML output (preserves more layout semantics)
result_html = extractor.extract(
    image,
    output_format="html",
)

# Custom prompt for specialized extraction
custom_prompt = """Extract all text as JSON with structure:
{
    "title": "...",
    "sections": [{"heading": "...", "content": "..."}],
    "tables": [...]
}
"""

result_custom = extractor.extract(
    image,
    output_format="markdown",
    custom_prompt=custom_prompt,
)

Batch Processing with VLLM

from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextVLLMConfig
from PIL import Image
import time

# Initialize with VLLM for high throughput
config = QwenTextVLLMConfig(
    model="Qwen/Qwen3-VL-8B-Instruct",
    tensor_parallel_size=2,  # Use 2 GPUs
    gpu_memory_utilization=0.8,
)
extractor = QwenTextExtractor(backend=config)

# Load multiple documents
images = [
    Image.open(f"doc_{i}.png") for i in range(10)
]

# Process with streaming
results = []
start = time.time()

for i, image in enumerate(images):
    result = extractor.extract(image, output_format="markdown")
    results.append(result)
    elapsed = time.time() - start
    throughput = (i + 1) / elapsed * 1000  # chars/sec
    print(f"[{i+1}/10] {result.content_length} chars - {throughput:.0f} chars/sec")

print(f"\nTotal time: {time.time() - start:.1f}s")
print(f"Avg length: {sum(r.content_length for r in results) / len(results):.0f} chars")

Layout-Aware Extraction

# Include layout information
result = extractor.extract(
    image,
    output_format="markdown",
    include_layout=True,
)

# Access raw output with bounding boxes
print(result.raw_output)  # Contains bbox annotations

API-Based Extraction (Cloud)

import os
from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextAPIConfig
from PIL import Image

# Configure API backend
config = QwenTextAPIConfig(
    model="qwen3-vl-8b",
    api_key=os.getenv("QWEN_API_KEY"),
    base_url="https://api.together.xyz/v1",
    rate_limit=5,
)

extractor = QwenTextExtractor(backend=config)

# Extract from image
image = Image.open("document.png")
result = extractor.extract(
    image,
    output_format="markdown",
)

print(result.content)

Performance Characteristics

Memory Requirements by Variant

Model Min VRAM Optimal VRAM Batch Size (VLLM)
Qwen3-VL-2B 4 GB 8 GB 8-16
Qwen3-VL-4B 8 GB 12 GB 4-8
Qwen3-VL-8B 16 GB 24 GB 2-4
Qwen3-VL-32B 40 GB 80 GB 1

Inference Speed (Single Document)

Backend Model Speed Device
PyTorch 8B 50-100 tok/s Single A10 GPU
VLLM 8B 200-300 tok/s 2x A10 GPU (tensor parallel)
MLX 8B-quantized 20-40 tok/s M3 Max (48GB)
API 8B Variable Cloud (depends on provider)

Typical Output Sizes

Document Type Tokens Characters
Single-page document 500-2000 3-12 KB
Academic paper page 1000-4000 6-24 KB
Multi-page scanned doc 2000-8000 12-48 KB

Troubleshooting

Out of Memory (OOM)

Symptom: RuntimeError: CUDA out of memory

Solutions:

# 1. Reduce max_new_tokens
config = QwenTextPyTorchConfig(
    model="Qwen/Qwen3-VL-8B-Instruct",
    max_new_tokens=4096,  # Reduced from 8192
)

# 2. Use smaller model variant
config = QwenTextPyTorchConfig(
    model="Qwen/Qwen3-VL-4B-Instruct",  # Smaller variant
)

# 3. Enable quantization
config = QwenTextPyTorchConfig(
    model="Qwen/Qwen3-VL-8B-Instruct",
    load_in_4bit=True,  # Requires bitsandbytes
)

# 4. Use CPU
config = QwenTextPyTorchConfig(
    model="Qwen/Qwen3-VL-8B-Instruct",
    device="cpu",  # Slower but works with limited VRAM
)

Slow Inference

Symptom: Processing takes 30+ seconds per document

Solutions:

# 1. Enable Flash Attention (requires flash-attn package)
config = QwenTextPyTorchConfig(
    use_flash_attention=True,
)

# 2. Use VLLM for batching
from omnidocs.tasks.text_extraction.qwen import QwenTextVLLMConfig
config = QwenTextVLLMConfig(
    model="Qwen/Qwen3-VL-8B-Instruct",
    tensor_parallel_size=2,
)

# 3. Use smaller model
config = QwenTextPyTorchConfig(
    model="Qwen/Qwen3-VL-4B-Instruct",  # 2x faster
)

# 4. Reduce image size
from PIL import Image
image = Image.open("document.png")
image.thumbnail((1024, 1024))  # Resize to 1024x1024 max

Poor Quality Output

Symptom: Garbled or incomplete text extraction

Solutions:

# 1. Lower temperature for more deterministic output
config = QwenTextPyTorchConfig(
    temperature=0.01,  # Very low for consistency
)

# 2. Use larger model variant
config = QwenTextPyTorchConfig(
    model="Qwen/Qwen3-VL-32B-Instruct",  # Better quality
)

# 3. Pre-process image (enhance contrast, de-skew)
from PIL import ImageEnhance
image = Image.open("document.png")
enhancer = ImageEnhance.Contrast(image)
image = enhancer.enhance(1.5)  # Increase contrast

# 4. Custom prompt for better guidance
custom_prompt = """Extract all text exactly as it appears.
Preserve formatting, structure, and special characters."""
result = extractor.extract(image, custom_prompt=custom_prompt)

API Rate Limiting

Symptom: 429 Too Many Requests errors

Solutions:

# Reduce rate limit
config = QwenTextAPIConfig(
    model="qwen3-vl-8b",
    api_key="...",
    rate_limit=2,  # Reduced from 10
)

# Implement retry logic
import time
max_retries = 3
for attempt in range(max_retries):
    try:
        result = extractor.extract(image)
        break
    except Exception as e:
        if attempt < max_retries - 1:
            wait_time = 2 ** attempt
            print(f"Rate limited, waiting {wait_time}s...")
            time.sleep(wait_time)
        else:
            raise

Model Download Issues

Symptom: ConnectionError or timeout during model loading

Solutions:

# Set HuggingFace cache directory
import os
os.environ["HF_HOME"] = "/path/to/cache"

# Pre-download model
from huggingface_hub import snapshot_download
snapshot_download("Qwen/Qwen3-VL-8B-Instruct")

# Use local model path
config = QwenTextPyTorchConfig(
    model="/local/path/to/model",
)

Model Selection Guide

When to Use Qwen3-VL

Best for: - High-quality document extraction (academic papers, technical docs) - Multilingual documents - Complex layouts with mixed content types - Production systems needing consistent quality

Not ideal for: - Real-time processing (see: Nanonuts OCR for speed) - Handwritten documents (see: Surya OCR) - Fixed-label layout detection (see: DocLayout-YOLO)

Qwen vs DotsOCR Comparison

Feature Qwen3-VL DotsOCR
Output Quality Excellent Very Good
Layout Info Basic Detailed (11 categories)
Speed Medium Fast
Memory High Medium
Multilingual Yes (25+ langs) Limited
Model Size Options 2B-32B Single

Choose Qwen3-VL if: You need high-quality text and multilingual support Choose DotsOCR if: You need detailed layout information with good performance


API Reference

QwenTextExtractor.extract()

def extract(
    image: Union[Image.Image, np.ndarray, str, Path],
    output_format: str = "markdown",
    include_layout: bool = False,
    custom_prompt: Optional[str] = None,
) -> TextOutput:
    """
    Extract text from image using Qwen3-VL.

    Args:
        image: Input image (PIL Image, numpy array, or path)
        output_format: "markdown" or "html"
        include_layout: Include layout information in raw output
        custom_prompt: Override default extraction prompt

    Returns:
        TextOutput with extracted content
    """

TextOutput Properties

result = extractor.extract(image)

# Access extracted content
print(result.content)        # Formatted text (markdown/html)
print(result.format)         # Output format
print(result.plain_text)     # Plain text without formatting
print(result.content_length) # Character count
print(result.word_count)     # Approximate word count
print(result.image_width)    # Source image width
print(result.image_height)   # Source image height
print(result.model_name)     # Model used
print(result.raw_output)     # Raw model output (with artifacts)

Advanced Configuration

Device Map Strategies

# Auto device mapping (recommended)
device_map = "auto"

# Balanced distribution across GPUs
device_map = "balanced"

# Sequential loading (one GPU at a time)
device_map = "sequential"

# Manual: First layer on GPU0, rest on CPU
device_map = {
    "model.layers.0": 0,
    "model.layers.1-31": "cpu",
}

Data Type Selection

# float32: Full precision (slower, more VRAM)
torch_dtype = "float32"

# float16: Half precision (faster, less VRAM, less accurate)
torch_dtype = "float16"

# bfloat16: Brain float (recommended for stability)
torch_dtype = "bfloat16"

# auto: Let model choose based on hardware
torch_dtype = "auto"

See Also