Skip to content

Extractor

DeepSeek-OCR / DeepSeek-OCR-2 text extractor.

DeepSeek-OCR (Oct 2024, arXiv:2510.18234) — v1, MIT, 3B params DeepSeek-OCR-2 (Jan 2026, arXiv:2601.20552) — v2, Apache 2.0, 3B params, "Visual Causal Flow"

Both models share the same inference interface
  • AutoModel + AutoTokenizer (NOT AutoModelForCausalLM / AutoProcessor)
  • model.infer(tokenizer, prompt, image_file, output_path, ...) for PyTorch
  • Grounding prompt format: "\n<|grounding|>Convert the document to markdown."

Supported backends: PyTorch, VLLM (official upstream support), MLX, API.

GitHub

v1: https://github.com/deepseek-ai/DeepSeek-OCR v2: https://github.com/deepseek-ai/DeepSeek-OCR-2

Example
from omnidocs.tasks.text_extraction import DeepSeekOCRTextExtractor
from omnidocs.tasks.text_extraction.deepseek import DeepSeekOCRTextVLLMConfig

extractor = DeepSeekOCRTextExtractor(
    backend=DeepSeekOCRTextVLLMConfig()  # DeepSeek-OCR-2, VLLM
)
result = extractor.extract(image)
print(result.content)

DeepSeekOCRTextExtractor

DeepSeekOCRTextExtractor(
    backend: DeepSeekOCRTextBackendConfig,
)

Bases: BaseTextExtractor

DeepSeek-OCR / DeepSeek-OCR-2 text extractor.

High-accuracy OCR model that reads complex real-world documents (PDFs, forms, tables, handwritten/noisy text) and outputs clean Markdown. Uses a hybrid vision encoder + causal text decoder — output is structured by the model itself rather than post-processed from bounding boxes.

DeepSeek-OCR-2 ("Visual Causal Flow") is the default — released Jan 2026.

Supports PyTorch, VLLM (recommended), MLX, and API backends.

Example
from omnidocs.tasks.text_extraction import DeepSeekOCRTextExtractor
from omnidocs.tasks.text_extraction.deepseek import (
    DeepSeekOCRTextPyTorchConfig,
    DeepSeekOCRTextVLLMConfig,
)

# VLLM — ~2500 tokens/s on A100 (recommended for production)
extractor = DeepSeekOCRTextExtractor(
    backend=DeepSeekOCRTextVLLMConfig()
)
result = extractor.extract(image)
print(result.content)

# PyTorch with crop_mode for dense pages
extractor = DeepSeekOCRTextExtractor(
    backend=DeepSeekOCRTextPyTorchConfig(crop_mode=True)
)

Initialize DeepSeek-OCR extractor.

PARAMETER DESCRIPTION
backend

Backend config. One of: - DeepSeekOCRTextPyTorchConfig (local GPU) - DeepSeekOCRTextVLLMConfig (recommended, high-throughput) - DeepSeekOCRTextMLXConfig (Apple Silicon) - DeepSeekOCRTextAPIConfig (Novita AI)

TYPE: DeepSeekOCRTextBackendConfig

Source code in omnidocs/tasks/text_extraction/deepseek/extractor.py
def __init__(self, backend: DeepSeekOCRTextBackendConfig):
    """
    Initialize DeepSeek-OCR extractor.

    Args:
        backend: Backend config. One of:
            - DeepSeekOCRTextPyTorchConfig (local GPU)
            - DeepSeekOCRTextVLLMConfig (recommended, high-throughput)
            - DeepSeekOCRTextMLXConfig (Apple Silicon)
            - DeepSeekOCRTextAPIConfig (Novita AI)
    """
    self.backend_config = backend
    self._backend: Any = None  # model
    self._processor: Any = None  # tokenizer
    self._loaded = False
    self._device: str = "cpu"

    # VLLM helpers
    self._sampling_params_class: Any = None

    # MLX helpers
    self._mlx_config: Any = None
    self._apply_chat_template: Any = None
    self._generate: Any = None

    self._load_model()

extract

extract(
    image: Union[Image, ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput

Extract text from a document image.

DeepSeek-OCR always outputs Markdown-structured text. The output_format parameter is accepted for API compatibility.

PARAMETER DESCRIPTION
image

Input image (PIL Image, numpy array, or file path)

TYPE: Union[Image, ndarray, str, Path]

output_format

Accepted for API compatibility (default: "markdown")

TYPE: Literal['html', 'markdown'] DEFAULT: 'markdown'

RETURNS DESCRIPTION
TextOutput

TextOutput with extracted Markdown content

Source code in omnidocs/tasks/text_extraction/deepseek/extractor.py
def extract(
    self,
    image: Union[Image.Image, np.ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput:
    """
    Extract text from a document image.

    DeepSeek-OCR always outputs Markdown-structured text.
    The output_format parameter is accepted for API compatibility.

    Args:
        image: Input image (PIL Image, numpy array, or file path)
        output_format: Accepted for API compatibility (default: "markdown")

    Returns:
        TextOutput with extracted Markdown content
    """
    if not self._loaded:
        raise RuntimeError("Model not loaded.")

    pil_image = self._prepare_image(image)
    width, height = pil_image.size

    config_type = type(self.backend_config).__name__
    dispatch = {
        "DeepSeekOCRTextPyTorchConfig": self._infer_pytorch,
        "DeepSeekOCRTextVLLMConfig": self._infer_vllm,
        "DeepSeekOCRTextMLXConfig": self._infer_mlx,
        "DeepSeekOCRTextAPIConfig": self._infer_api,
    }
    raw_output = dispatch[config_type](pil_image)
    cleaned = raw_output.strip()

    return TextOutput(
        content=cleaned,
        format=OutputFormat(output_format),
        raw_output=raw_output,
        plain_text=cleaned,
        image_width=width,
        image_height=height,
        model_name=f"DeepSeek-OCR ({self.backend_config.model}, {config_type})",
    )