Overview¶

DeepSeek-OCR backend configurations and extractor for text extraction.

Two generations of DeepSeek OCR models from deepseek-ai:

DeepSeek-OCR (Oct 2024, arXiv:2510.18234) — v1, MIT license, 3B params, ~6.7 GB BF16 DeepSeek-OCR-2 (Jan 2026, arXiv:2601.20552) — v2, Apache 2.0, 3B params, improved "Visual Causal Flow"

Both share the same inference interface (AutoModel + AutoTokenizer with model.infer()). The default model is DeepSeek-OCR-2 (latest).

Supported prompts

"

<|grounding|>Convert the document to markdown." ← structured document " <|grounding|>OCR this image." ← general image " Free OCR." ← plain text, no layout " Parse the figure." ← figures in document

Available backends

DeepSeekOCRTextPyTorchConfig: PyTorch/HuggingFace backend
DeepSeekOCRTextVLLMConfig: VLLM high-throughput backend (recommended, ~2500 tok/s on A100)
DeepSeekOCRTextMLXConfig: MLX backend for Apple Silicon
DeepSeekOCRTextAPIConfig: API backend (Novita AI)

HuggingFaces

deepseek-ai/DeepSeek-OCR-2 (latest, Apache 2.0) deepseek-ai/DeepSeek-OCR (v1, MIT)

GitHub: https://github.com/deepseek-ai/DeepSeek-OCR-2

Example

from omnidocs.tasks.text_extraction.deepseek import DeepSeekOCRTextVLLMConfig
config = DeepSeekOCRTextVLLMConfig()  # uses DeepSeek-OCR-2 by default

DeepSeekOCRTextAPIConfig ¶

Bases: BaseModel

API backend configuration for DeepSeek-OCR / DeepSeek-OCR-2 text extraction.

Uses litellm for provider-agnostic API access. Primary provider: Novita AI (official hosting).

Example

# Novita AI (reads NOVITA_API_KEY from env)
config = DeepSeekOCRTextAPIConfig(
    model="novita/deepseek/deepseek-ocr",
)

DeepSeekOCRTextExtractor ¶

DeepSeekOCRTextExtractor(
    backend: DeepSeekOCRTextBackendConfig,
)

Bases: BaseTextExtractor

DeepSeek-OCR / DeepSeek-OCR-2 text extractor.

High-accuracy OCR model that reads complex real-world documents (PDFs, forms, tables, handwritten/noisy text) and outputs clean Markdown. Uses a hybrid vision encoder + causal text decoder — output is structured by the model itself rather than post-processed from bounding boxes.

DeepSeek-OCR-2 ("Visual Causal Flow") is the default — released Jan 2026.

Supports PyTorch, VLLM (recommended), MLX, and API backends.

Example

from omnidocs.tasks.text_extraction import DeepSeekOCRTextExtractor
from omnidocs.tasks.text_extraction.deepseek import (
    DeepSeekOCRTextPyTorchConfig,
    DeepSeekOCRTextVLLMConfig,
)

# VLLM — ~2500 tokens/s on A100 (recommended for production)
extractor = DeepSeekOCRTextExtractor(
    backend=DeepSeekOCRTextVLLMConfig()
)
result = extractor.extract(image)
print(result.content)

# PyTorch with crop_mode for dense pages
extractor = DeepSeekOCRTextExtractor(
    backend=DeepSeekOCRTextPyTorchConfig(crop_mode=True)
)

Initialize DeepSeek-OCR extractor.

PARAMETER	DESCRIPTION
`backend`	Backend config. One of: - DeepSeekOCRTextPyTorchConfig (local GPU) - DeepSeekOCRTextVLLMConfig (recommended, high-throughput) - DeepSeekOCRTextMLXConfig (Apple Silicon) - DeepSeekOCRTextAPIConfig (Novita AI) TYPE: `DeepSeekOCRTextBackendConfig`

Source code in omnidocs/tasks/text_extraction/deepseek/extractor.py

def __init__(self, backend: DeepSeekOCRTextBackendConfig):
    """
    Initialize DeepSeek-OCR extractor.

    Args:
        backend: Backend config. One of:
            - DeepSeekOCRTextPyTorchConfig (local GPU)
            - DeepSeekOCRTextVLLMConfig (recommended, high-throughput)
            - DeepSeekOCRTextMLXConfig (Apple Silicon)
            - DeepSeekOCRTextAPIConfig (Novita AI)
    """
    self.backend_config = backend
    self._backend: Any = None  # model
    self._processor: Any = None  # tokenizer
    self._loaded = False
    self._device: str = "cpu"

    # VLLM helpers
    self._sampling_params_class: Any = None

    # MLX helpers
    self._mlx_config: Any = None
    self._apply_chat_template: Any = None
    self._generate: Any = None

    self._load_model()

extract ¶

extract(
    image: Union[Image, ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput

Extract text from a document image.

DeepSeek-OCR always outputs Markdown-structured text. The output_format parameter is accepted for API compatibility.

PARAMETER	DESCRIPTION
`image`	Input image (PIL Image, numpy array, or file path) TYPE: `Union[Image, ndarray, str, Path]`
`output_format`	Accepted for API compatibility (default: "markdown") TYPE: `Literal['html', 'markdown']` DEFAULT: `'markdown'`

RETURNS	DESCRIPTION
`TextOutput`	TextOutput with extracted Markdown content

Source code in omnidocs/tasks/text_extraction/deepseek/extractor.py

def extract(
    self,
    image: Union[Image.Image, np.ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput:
    """
    Extract text from a document image.

    DeepSeek-OCR always outputs Markdown-structured text.
    The output_format parameter is accepted for API compatibility.

    Args:
        image: Input image (PIL Image, numpy array, or file path)
        output_format: Accepted for API compatibility (default: "markdown")

    Returns:
        TextOutput with extracted Markdown content
    """
    if not self._loaded:
        raise RuntimeError("Model not loaded.")

    pil_image = self._prepare_image(image)
    width, height = pil_image.size

    config_type = type(self.backend_config).__name__
    dispatch = {
        "DeepSeekOCRTextPyTorchConfig": self._infer_pytorch,
        "DeepSeekOCRTextVLLMConfig": self._infer_vllm,
        "DeepSeekOCRTextMLXConfig": self._infer_mlx,
        "DeepSeekOCRTextAPIConfig": self._infer_api,
    }
    raw_output = dispatch[config_type](pil_image)
    cleaned = raw_output.strip()

    return TextOutput(
        content=cleaned,
        format=OutputFormat(output_format),
        raw_output=raw_output,
        plain_text=cleaned,
        image_width=width,
        image_height=height,
        model_name=f"DeepSeek-OCR ({self.backend_config.model}, {config_type})",
    )

DeepSeekOCRTextMLXConfig ¶

Bases: BaseModel

MLX backend configuration for DeepSeek-OCR text extraction.

Apple Silicon only (M1/M2/M3+). Do NOT deploy to Modal/cloud. Uses standard mlx-vlm generate interface.

Note: MLX variants currently available for DeepSeek-OCR v1. Check mlx-community for DeepSeek-OCR-2 variants as they are published.

Example

config = DeepSeekOCRTextMLXConfig(
    model="mlx-community/DeepSeek-OCR-4bit",
)

DeepSeekOCRTextPyTorchConfig ¶

Bases: BaseModel

PyTorch/HuggingFace backend configuration for DeepSeek-OCR / DeepSeek-OCR-2.

Uses AutoModel + AutoTokenizer. Inference via model.infer() — the model handles tiling and multi-page PDF stitching internally.

Models

deepseek-ai/DeepSeek-OCR-2 (default, latest — Jan 2026, Apache 2.0) deepseek-ai/DeepSeek-OCR (v1 — Oct 2024, MIT)

GPU requirements: L4 / A100 (≥16 GB VRAM recommended).

Example

config = DeepSeekOCRTextPyTorchConfig(
    model="deepseek-ai/DeepSeek-OCR-2",
    use_flash_attention=True,  # requires flash-attn==2.7.3
)

DeepSeekOCRTextVLLMConfig ¶

Bases: BaseModel

VLLM backend configuration for DeepSeek-OCR / DeepSeek-OCR-2 text extraction.

DeepSeek-OCR has official upstream VLLM support (~2500 tokens/s on A100). Recommended for high-throughput batch document processing in production. Requires: vllm>=0.11.1 (or nightly for OCR-2), torch, transformers==4.46.3

Note: Default model is DeepSeek-OCR v1 (not v2) because DeepSeek-OCR-2 VLLM support requires a vllm nightly build. Use PyTorch backend for DeepSeek-OCR-2 until official vllm support is released.

Example

config = DeepSeekOCRTextVLLMConfig(
    model="deepseek-ai/DeepSeek-OCR-2",
    tensor_parallel_size=1,
    gpu_memory_utilization=0.9,
)

api ¶

API backend configuration for DeepSeek-OCR text extraction.

Novita AI

https://novita.ai/models/model-detail/deepseek-deepseek-ocr

Note: DeepSeek-OCR-2 API availability may vary by provider — check novita.ai for updated model slugs as providers onboard the new version.

DeepSeekOCRTextAPIConfig ¶

Bases: BaseModel

API backend configuration for DeepSeek-OCR / DeepSeek-OCR-2 text extraction.

Uses litellm for provider-agnostic API access. Primary provider: Novita AI (official hosting).

Example

# Novita AI (reads NOVITA_API_KEY from env)
config = DeepSeekOCRTextAPIConfig(
    model="novita/deepseek/deepseek-ocr",
)

extractor ¶

DeepSeek-OCR / DeepSeek-OCR-2 text extractor.

DeepSeek-OCR (Oct 2024, arXiv:2510.18234) — v1, MIT, 3B params DeepSeek-OCR-2 (Jan 2026, arXiv:2601.20552) — v2, Apache 2.0, 3B params, "Visual Causal Flow"

Supported backends: PyTorch, VLLM (official upstream support), MLX, API.

GitHub

v1: https://github.com/deepseek-ai/DeepSeek-OCR v2: https://github.com/deepseek-ai/DeepSeek-OCR-2

Example

from omnidocs.tasks.text_extraction import DeepSeekOCRTextExtractor
from omnidocs.tasks.text_extraction.deepseek import DeepSeekOCRTextVLLMConfig

extractor = DeepSeekOCRTextExtractor(
    backend=DeepSeekOCRTextVLLMConfig()  # DeepSeek-OCR-2, VLLM
)
result = extractor.extract(image)
print(result.content)

DeepSeekOCRTextExtractor ¶

DeepSeekOCRTextExtractor(
    backend: DeepSeekOCRTextBackendConfig,
)

Bases: BaseTextExtractor

DeepSeek-OCR / DeepSeek-OCR-2 text extractor.

High-accuracy OCR model that reads complex real-world documents (PDFs, forms, tables, handwritten/noisy text) and outputs clean Markdown. Uses a hybrid vision encoder + causal text decoder — output is structured by the model itself rather than post-processed from bounding boxes.

DeepSeek-OCR-2 ("Visual Causal Flow") is the default — released Jan 2026.

Supports PyTorch, VLLM (recommended), MLX, and API backends.

Example

from omnidocs.tasks.text_extraction import DeepSeekOCRTextExtractor
from omnidocs.tasks.text_extraction.deepseek import (
    DeepSeekOCRTextPyTorchConfig,
    DeepSeekOCRTextVLLMConfig,
)

# VLLM — ~2500 tokens/s on A100 (recommended for production)
extractor = DeepSeekOCRTextExtractor(
    backend=DeepSeekOCRTextVLLMConfig()
)
result = extractor.extract(image)
print(result.content)

# PyTorch with crop_mode for dense pages
extractor = DeepSeekOCRTextExtractor(
    backend=DeepSeekOCRTextPyTorchConfig(crop_mode=True)
)

Initialize DeepSeek-OCR extractor.

PARAMETER	DESCRIPTION
`backend`	Backend config. One of: - DeepSeekOCRTextPyTorchConfig (local GPU) - DeepSeekOCRTextVLLMConfig (recommended, high-throughput) - DeepSeekOCRTextMLXConfig (Apple Silicon) - DeepSeekOCRTextAPIConfig (Novita AI) TYPE: `DeepSeekOCRTextBackendConfig`

Source code in omnidocs/tasks/text_extraction/deepseek/extractor.py

def __init__(self, backend: DeepSeekOCRTextBackendConfig):
    """
    Initialize DeepSeek-OCR extractor.

    Args:
        backend: Backend config. One of:
            - DeepSeekOCRTextPyTorchConfig (local GPU)
            - DeepSeekOCRTextVLLMConfig (recommended, high-throughput)
            - DeepSeekOCRTextMLXConfig (Apple Silicon)
            - DeepSeekOCRTextAPIConfig (Novita AI)
    """
    self.backend_config = backend
    self._backend: Any = None  # model
    self._processor: Any = None  # tokenizer
    self._loaded = False
    self._device: str = "cpu"

    # VLLM helpers
    self._sampling_params_class: Any = None

    # MLX helpers
    self._mlx_config: Any = None
    self._apply_chat_template: Any = None
    self._generate: Any = None

    self._load_model()

extract ¶

extract(
    image: Union[Image, ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput

Extract text from a document image.

DeepSeek-OCR always outputs Markdown-structured text. The output_format parameter is accepted for API compatibility.

PARAMETER	DESCRIPTION
`image`	Input image (PIL Image, numpy array, or file path) TYPE: `Union[Image, ndarray, str, Path]`
`output_format`	Accepted for API compatibility (default: "markdown") TYPE: `Literal['html', 'markdown']` DEFAULT: `'markdown'`

RETURNS	DESCRIPTION
`TextOutput`	TextOutput with extracted Markdown content

Source code in omnidocs/tasks/text_extraction/deepseek/extractor.py

def extract(
    self,
    image: Union[Image.Image, np.ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput:
    """
    Extract text from a document image.

    DeepSeek-OCR always outputs Markdown-structured text.
    The output_format parameter is accepted for API compatibility.

    Args:
        image: Input image (PIL Image, numpy array, or file path)
        output_format: Accepted for API compatibility (default: "markdown")

    Returns:
        TextOutput with extracted Markdown content
    """
    if not self._loaded:
        raise RuntimeError("Model not loaded.")

    pil_image = self._prepare_image(image)
    width, height = pil_image.size

    config_type = type(self.backend_config).__name__
    dispatch = {
        "DeepSeekOCRTextPyTorchConfig": self._infer_pytorch,
        "DeepSeekOCRTextVLLMConfig": self._infer_vllm,
        "DeepSeekOCRTextMLXConfig": self._infer_mlx,
        "DeepSeekOCRTextAPIConfig": self._infer_api,
    }
    raw_output = dispatch[config_type](pil_image)
    cleaned = raw_output.strip()

    return TextOutput(
        content=cleaned,
        format=OutputFormat(output_format),
        raw_output=raw_output,
        plain_text=cleaned,
        image_width=width,
        image_height=height,
        model_name=f"DeepSeek-OCR ({self.backend_config.model}, {config_type})",
    )

mlx ¶

MLX backend configuration for DeepSeek-OCR text extraction.

Available MLX quantized variants (mlx-community): mlx-community/DeepSeek-OCR-4bit (4-bit, recommended) mlx-community/DeepSeek-OCR-8bit (8-bit, higher fidelity)

Note: DeepSeek-OCR-2 MLX variants may not yet be available — check https://huggingface.co/mlx-community for latest uploads. Fall back to DeepSeek-OCR v1 4bit/8bit for Apple Silicon.

DeepSeekOCRTextMLXConfig ¶

Bases: BaseModel

MLX backend configuration for DeepSeek-OCR text extraction.

Apple Silicon only (M1/M2/M3+). Do NOT deploy to Modal/cloud. Uses standard mlx-vlm generate interface.

Note: MLX variants currently available for DeepSeek-OCR v1. Check mlx-community for DeepSeek-OCR-2 variants as they are published.

Example

config = DeepSeekOCRTextMLXConfig(
    model="mlx-community/DeepSeek-OCR-4bit",
)

pytorch ¶

PyTorch/HuggingFace backend configuration for DeepSeek-OCR text extraction.

Both DeepSeek-OCR and DeepSeek-OCR-2 use

AutoModel (not AutoModelForCausalLM)
AutoTokenizer (not AutoProcessor)
model.infer(tokenizer, prompt=..., image_file=...) for inference

Requirements (from official README): python==3.12.9, CUDA==11.8 torch==2.6.0, transformers==4.46.3, tokenizers==0.20.3 einops, addict, easydict flash-attn==2.7.3 (optional, --no-build-isolation)

DeepSeekOCRTextPyTorchConfig ¶

Bases: BaseModel

PyTorch/HuggingFace backend configuration for DeepSeek-OCR / DeepSeek-OCR-2.

Uses AutoModel + AutoTokenizer. Inference via model.infer() — the model handles tiling and multi-page PDF stitching internally.

Models

deepseek-ai/DeepSeek-OCR-2 (default, latest — Jan 2026, Apache 2.0) deepseek-ai/DeepSeek-OCR (v1 — Oct 2024, MIT)

GPU requirements: L4 / A100 (≥16 GB VRAM recommended).

Example

config = DeepSeekOCRTextPyTorchConfig(
    model="deepseek-ai/DeepSeek-OCR-2",
    use_flash_attention=True,  # requires flash-attn==2.7.3
)

vllm ¶

VLLM backend configuration for DeepSeek-OCR text extraction.

DeepSeek-OCR has official upstream VLLM support (announced Oct 23 2025). Achieves ~2500 tokens/s on A100-40G — the recommended backend for production.

DeepSeek-OCR-2 VLLM support: refer to https://github.com/deepseek-ai/DeepSeek-OCR-2 for the latest vLLM setup instructions (may require nightly build).

DeepSeekOCRTextVLLMConfig ¶

Bases: BaseModel

VLLM backend configuration for DeepSeek-OCR / DeepSeek-OCR-2 text extraction.

DeepSeek-OCR has official upstream VLLM support (~2500 tokens/s on A100). Recommended for high-throughput batch document processing in production. Requires: vllm>=0.11.1 (or nightly for OCR-2), torch, transformers==4.46.3

Note: Default model is DeepSeek-OCR v1 (not v2) because DeepSeek-OCR-2 VLLM support requires a vllm nightly build. Use PyTorch backend for DeepSeek-OCR-2 until official vllm support is released.

Example

config = DeepSeekOCRTextVLLMConfig(
    model="deepseek-ai/DeepSeek-OCR-2",
    tensor_parallel_size=1,
    gpu_memory_utilization=0.9,
)