DeepSeek OCR¶

High-accuracy document OCR using DeepSeek-OCR and DeepSeek-OCR-2.

Overview¶

Property	Value
Model (default)	`deepseek-ai/DeepSeek-OCR-2`
Parameters	~3B
Task	Text Extraction
Backends	PyTorch, VLLM, MLX, API
License	Apache 2.0 (v2), MIT (v1)

Two generations of DeepSeek OCR are supported:

Version	Release	License	Architecture
DeepSeek-OCR-2 (default)	Jan 2026	Apache 2.0	Visual Causal Flow
DeepSeek-OCR	Oct 2024	MIT	Hybrid Vision + Causal LM

Both share the same inference interface — swap the model string to switch.

Installation¶

# PyTorch backend
pip install omnidocs[pytorch]

# VLLM backend (recommended for production — ~2500 tok/s on A100)
pip install omnidocs[vllm]

# MLX backend (Apple Silicon)
pip install omnidocs[mlx]

# API backend (no GPU — included in base install)
pip install omnidocs

Extra dependencies for PyTorch backend

DeepSeek-OCR requires transformers==4.46.3, einops, addict, and easydict. Optionally install flash-attn==2.7.3 with --no-build-isolation for faster inference.

pip install "transformers==4.46.3" einops addict easydict

Quick Start¶

PyTorch Backend¶

from omnidocs.tasks.text_extraction import DeepSeekOCRTextExtractor
from omnidocs.tasks.text_extraction.deepseek import DeepSeekOCRTextPyTorchConfig
from PIL import Image

image = Image.open("document.png")

extractor = DeepSeekOCRTextExtractor(
    backend=DeepSeekOCRTextPyTorchConfig(device="cuda")
)
result = extractor.extract(image)
print(result.content)

VLLM Backend (Recommended for Production)¶

from omnidocs.tasks.text_extraction import DeepSeekOCRTextExtractor
from omnidocs.tasks.text_extraction.deepseek import DeepSeekOCRTextVLLMConfig

extractor = DeepSeekOCRTextExtractor(
    backend=DeepSeekOCRTextVLLMConfig(
        tensor_parallel_size=1,
        gpu_memory_utilization=0.9,
    )
)
result = extractor.extract(image)
print(result.content)

MLX Backend (Apple Silicon)¶

from omnidocs.tasks.text_extraction import DeepSeekOCRTextExtractor
from omnidocs.tasks.text_extraction.deepseek import DeepSeekOCRTextMLXConfig

extractor = DeepSeekOCRTextExtractor(
    backend=DeepSeekOCRTextMLXConfig(
        model="mlx-community/DeepSeek-OCR-4bit"
    )
)
result = extractor.extract(image)

API Backend (Novita AI)¶

import os
from omnidocs.tasks.text_extraction import DeepSeekOCRTextExtractor
from omnidocs.tasks.text_extraction.deepseek import DeepSeekOCRTextAPIConfig

extractor = DeepSeekOCRTextExtractor(
    backend=DeepSeekOCRTextAPIConfig(
        model="novita/deepseek/deepseek-ocr",
        api_key=os.getenv("NOVITA_API_KEY"),
    )
)
result = extractor.extract(image)

Configuration¶

PyTorch Config¶

from omnidocs.tasks.text_extraction.deepseek import DeepSeekOCRTextPyTorchConfig

config = DeepSeekOCRTextPyTorchConfig(
    model="deepseek-ai/DeepSeek-OCR-2",   # or "deepseek-ai/DeepSeek-OCR"
    device="cuda",                          # "cuda" or "cpu" (MPS not tested)
    torch_dtype="bfloat16",                # Required per official README
    use_flash_attention=False,             # True requires flash-attn==2.7.3
    trust_remote_code=True,               # Required — custom model code
    base_size=1024,                        # Visual encoder canvas size
    image_size=768,                        # Tile resize target
    crop_mode=True,                        # Adaptive tiling for dense pages
)

Parameter	Default	Description
`model`	`deepseek-ai/DeepSeek-OCR-2`	HuggingFace model ID
`device`	`cuda`	Inference device
`torch_dtype`	`bfloat16`	BF16 required per official README
`use_flash_attention`	`False`	Enable Flash Attention 2 (needs flash-attn==2.7.3)
`crop_mode`	`True`	Adaptive tiling for dense/small-font pages ("Gundam mode")
`base_size`	`1024`	Visual encoder canvas size (512–2048)
`image_size`	`768`	Tile resize target (256–1024)

VLLM Config¶

from omnidocs.tasks.text_extraction.deepseek import DeepSeekOCRTextVLLMConfig

config = DeepSeekOCRTextVLLMConfig(
    model="deepseek-ai/DeepSeek-OCR",       # v1 has official VLLM support
    tensor_parallel_size=1,                  # GPUs for parallelism
    gpu_memory_utilization=0.9,
    max_model_len=8192,
    max_tokens=8192,
    temperature=0.0,                         # Greedy decoding recommended
    enable_prefix_caching=False,             # Must be False for v1
    mm_processor_cache_gb=0,                 # Must be 0 for v1
)

VLLM v1 constraints

For DeepSeek-OCR v1, enable_prefix_caching must be False and mm_processor_cache_gb must be 0. These are required for the NGram logits processor to work correctly.

DeepSeek-OCR-2 VLLM support: check the official repo for updated setup instructions — may require a nightly VLLM build.

Parameter	Default	Description
`model`	`deepseek-ai/DeepSeek-OCR`	HuggingFace model ID
`tensor_parallel_size`	`1`	Number of GPUs
`gpu_memory_utilization`	`0.9`	GPU memory fraction
`max_model_len`	`8192`	Max context length
`temperature`	`0.0`	0.0 = greedy decoding
`enable_prefix_caching`	`False`	Must be False for v1
`mm_processor_cache_gb`	`0`	Must be 0 for v1

MLX Config¶

from omnidocs.tasks.text_extraction.deepseek import DeepSeekOCRTextMLXConfig

config = DeepSeekOCRTextMLXConfig(
    model="mlx-community/DeepSeek-OCR-4bit",  # or "DeepSeek-OCR-8bit"
    max_tokens=8192,
    temperature=0.0,
)

MLX model availability

MLX quantized variants are currently available for DeepSeek-OCR v1 (mlx-community/DeepSeek-OCR-4bit, mlx-community/DeepSeek-OCR-8bit). Check mlx-community on HuggingFace for DeepSeek-OCR-2 variants as they are published.

API Config¶

from omnidocs.tasks.text_extraction.deepseek import DeepSeekOCRTextAPIConfig

config = DeepSeekOCRTextAPIConfig(
    model="novita/deepseek/deepseek-ocr",  # litellm format
    api_key=None,                           # reads NOVITA_API_KEY from env
    api_base=None,                          # override provider URL if needed
    max_tokens=8192,
    temperature=0.0,
    timeout=180,
)

Set the environment variable for Novita AI:

export NOVITA_API_KEY=your_key_here

Prompt Modes¶

DeepSeek-OCR supports four built-in prompt modes. The extractor uses markdown by default.

Mode	Prompt	Best For
`markdown`	`<\\|grounding\\|>Convert the document to markdown.`	Structured documents (default)
`ocr`	`<\\|grounding\\|>OCR this image.`	General image OCR
`free`	`Free OCR.`	Plain text, no layout
`figure`	`Parse the figure.`	Figures and diagrams

Output¶

result = extractor.extract(image)

print(result.content)        # Extracted Markdown text
print(result.plain_text)     # Plain text (same as content for DeepSeek)
print(result.model_name)     # "DeepSeek-OCR (model, backend)"
print(result.image_width)    # Source image width
print(result.image_height)   # Source image height
print(result.raw_output)     # Raw model output

DeepSeek-OCR always outputs Markdown. The output_format parameter is accepted for API compatibility but does not change the output type.

Performance¶

Backend	Device	Speed	Notes
PyTorch	A100-40G	~80–120 tok/s	BF16, `crop_mode=True`
VLLM	A100-40G	~2500 tok/s	Official upstream support for v1
MLX	M3 Max (48GB)	~20–40 tok/s	4-bit quantized
API	Novita AI	Variable	No GPU required

VRAM requirements:

Backend	Min VRAM
PyTorch	16 GB
VLLM (1 GPU)	20 GB
VLLM (2 GPU)	20 GB/GPU

Comparison with Other Models¶

Model	Speed	Layout Info	Multilingual	Backends
DeepSeek-OCR-2	Fast (VLLM)	No	Limited	PyTorch, VLLM, MLX, API
Qwen3-VL-8B	Medium	Basic	Yes (25+)	PyTorch, VLLM, MLX, API
DotsOCR	Medium	Yes (11 cats)	Limited	PyTorch, VLLM
Nanonets OCR2	Fast	No	Limited	PyTorch, VLLM, MLX

Choose DeepSeek-OCR if: - You need maximum throughput (VLLM ~2500 tok/s on A100) - You're processing dense, complex real-world documents - You need handwritten or noisy document support

Choose Qwen3-VL if: You need multilingual support or an API backend with broader provider coverage.

Choose DotsOCR if: You need bounding boxes and layout categories alongside the text.

Troubleshooting¶

Empty or incomplete output (PyTorch)

The PyTorch backend writes output to .mmd files on disk. If the output directory is empty, the model may have failed silently. Try:

# Ensure image is large enough
from PIL import Image
image = Image.open("document.png")
print(image.size)  # Should be > 256x256

# Disable crop_mode for small/simple images
config = DeepSeekOCRTextPyTorchConfig(crop_mode=False)

VLLM ImportError: NGramPerReqLogitsProcessor

This requires vllm>=0.11.1. Update with:

pip install "vllm>=0.11.1"

OOM on PyTorch

# Reduce tile size
config = DeepSeekOCRTextPyTorchConfig(
    base_size=768,   # down from 1024
    image_size=512,  # down from 768
)

Slow inference on PyTorch

Switch to VLLM, or install Flash Attention:

pip install flash-attn==2.7.3 --no-build-isolation

Then enable it:

config = DeepSeekOCRTextPyTorchConfig(use_flash_attention=True)

API 401 / authentication errors

# Verify your Novita API key is set
echo $NOVITA_API_KEY

# Or pass it directly
config = DeepSeekOCRTextAPIConfig(
    model="novita/deepseek/deepseek-ocr",
    api_key="your_key_here",
)