Overview¶

Qwen3-VL backend configurations and extractor for text extraction.

Available backends

QwenTextPyTorchConfig: PyTorch/HuggingFace backend
QwenTextVLLMConfig: VLLM high-throughput backend
QwenTextMLXConfig: MLX backend for Apple Silicon
QwenTextAPIConfig: API backend (OpenRouter, etc.)

Example

from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig
config = QwenTextPyTorchConfig(model="Qwen/Qwen3-VL-8B-Instruct")

QwenTextAPIConfig ¶

Bases: BaseModel

API backend configuration for Qwen text extraction.

Uses litellm for provider-agnostic API access. Supports OpenRouter, Gemini, Azure, OpenAI, and any other litellm-compatible provider.

API keys can be passed directly or read from environment variables.

Example

# OpenRouter (reads OPENROUTER_API_KEY from env)
config = QwenTextAPIConfig(
    model="openrouter/qwen/qwen3-vl-8b-instruct",
)

# With explicit key
config = QwenTextAPIConfig(
    model="openrouter/qwen/qwen3-vl-8b-instruct",
    api_key=os.environ["OPENROUTER_API_KEY"],
    api_base="https://openrouter.ai/api/v1",
)

QwenTextExtractor ¶

QwenTextExtractor(backend: QwenTextBackendConfig)

Bases: BaseTextExtractor

Qwen3-VL Vision-Language Model text extractor.

Extracts text from document images and outputs as structured HTML or Markdown. Uses Qwen3-VL's built-in document parsing prompts.

Supports PyTorch, VLLM, MLX, and API backends.

Example

from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig

# Initialize with PyTorch backend
extractor = QwenTextExtractor(
        backend=QwenTextPyTorchConfig(model="Qwen/Qwen3-VL-8B-Instruct")
    )

# Extract as Markdown
result = extractor.extract(image, output_format="markdown")
print(result.content)

# Extract as HTML
result = extractor.extract(image, output_format="html")
print(result.content)

Initialize Qwen text extractor.

PARAMETER	DESCRIPTION
`backend`	Backend configuration. One of: - QwenTextPyTorchConfig: PyTorch/HuggingFace backend - QwenTextVLLMConfig: VLLM high-throughput backend - QwenTextMLXConfig: MLX backend for Apple Silicon - QwenTextAPIConfig: API backend (OpenRouter, etc.) TYPE: `QwenTextBackendConfig`

Source code in omnidocs/tasks/text_extraction/qwen/extractor.py

def __init__(self, backend: QwenTextBackendConfig):
    """
    Initialize Qwen text extractor.

    Args:
        backend: Backend configuration. One of:
            - QwenTextPyTorchConfig: PyTorch/HuggingFace backend
            - QwenTextVLLMConfig: VLLM high-throughput backend
            - QwenTextMLXConfig: MLX backend for Apple Silicon
            - QwenTextAPIConfig: API backend (OpenRouter, etc.)
    """
    self.backend_config = backend
    self._backend: Any = None
    self._processor: Any = None
    self._loaded = False

    # Backend-specific helpers
    self._process_vision_info: Any = None
    self._sampling_params_class: Any = None
    self._mlx_config: Any = None
    self._apply_chat_template: Any = None
    self._generate: Any = None

    # Load model
    self._load_model()

extract ¶

extract(
    image: Union[Image, ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput

Extract text from an image.

PARAMETER	DESCRIPTION
`image`	Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file TYPE: `Union[Image, ndarray, str, Path]`
`output_format`	Desired output format: - "html": Structured HTML with div elements - "markdown": Markdown format TYPE: `Literal['html', 'markdown']` DEFAULT: `'markdown'`

RETURNS	DESCRIPTION
`TextOutput`	TextOutput containing extracted text content

RAISES	DESCRIPTION
`RuntimeError`	If model is not loaded
`ValueError`	If image format or output_format is not supported

Source code in omnidocs/tasks/text_extraction/qwen/extractor.py

def extract(
    self,
    image: Union[Image.Image, np.ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput:
    """
    Extract text from an image.

    Args:
        image: Input image as:
            - PIL.Image.Image: PIL image object
            - np.ndarray: Numpy array (HWC format, RGB)
            - str or Path: Path to image file
        output_format: Desired output format:
            - "html": Structured HTML with div elements
            - "markdown": Markdown format

    Returns:
        TextOutput containing extracted text content

    Raises:
        RuntimeError: If model is not loaded
        ValueError: If image format or output_format is not supported
    """
    if not self._loaded:
        raise RuntimeError("Model not loaded. Call _load_model() first.")

    if output_format not in ("html", "markdown"):
        raise ValueError(f"Invalid output_format: {output_format}. Expected 'html' or 'markdown'.")

    # Prepare image
    pil_image = self._prepare_image(image)
    width, height = pil_image.size

    # Get prompt for output format
    prompt = QWEN_PROMPTS[output_format]

    # Run inference based on backend
    config_type = type(self.backend_config).__name__
    if config_type == "QwenTextPyTorchConfig":
        raw_output = self._infer_pytorch(pil_image, prompt)
    elif config_type == "QwenTextVLLMConfig":
        raw_output = self._infer_vllm(pil_image, prompt)
    elif config_type == "QwenTextMLXConfig":
        raw_output = self._infer_mlx(pil_image, prompt)
    elif config_type == "QwenTextAPIConfig":
        raw_output = self._infer_api(pil_image, prompt)
    else:
        raise RuntimeError(f"Unknown backend: {config_type}")

    # Clean output
    if output_format == "html":
        cleaned_output = _clean_html_output(raw_output)
    else:
        cleaned_output = _clean_markdown_output(raw_output)

    # Extract plain text
    plain_text = _extract_plain_text(raw_output, output_format)

    return TextOutput(
        content=cleaned_output,
        format=OutputFormat(output_format),
        raw_output=raw_output,
        plain_text=plain_text,
        image_width=width,
        image_height=height,
        model_name=f"Qwen3-VL ({type(self.backend_config).__name__})",
    )

QwenTextMLXConfig ¶

Bases: BaseModel

MLX backend configuration for Qwen text extraction.

This backend uses MLX for Apple Silicon native inference. Best for local development and testing on macOS M1/M2/M3+. Requires: mlx, mlx-vlm

Note: This backend only works on Apple Silicon Macs. Do NOT use for Modal/cloud deployments.

Example

config = QwenTextMLXConfig(
        model="mlx-community/Qwen3-VL-8B-Instruct-4bit",
    )

QwenTextPyTorchConfig ¶

Bases: BaseModel

PyTorch/HuggingFace backend configuration for Qwen text extraction.

This backend uses the transformers library with PyTorch for local GPU inference. Requires: torch, transformers, accelerate, qwen-vl-utils

Example

config = QwenTextPyTorchConfig(
        model="Qwen/Qwen3-VL-8B-Instruct",
        device="cuda",
        torch_dtype="bfloat16",
    )

QwenTextVLLMConfig ¶

Bases: BaseModel

VLLM backend configuration for Qwen text extraction.

This backend uses VLLM for high-throughput inference. Best for batch processing and production deployments. Requires: vllm, torch, transformers, qwen-vl-utils

Example

config = QwenTextVLLMConfig(
        model="Qwen/Qwen3-VL-8B-Instruct",
        tensor_parallel_size=2,
        gpu_memory_utilization=0.9,
    )

api ¶

API backend configuration for Qwen3-VL text extraction.

Uses litellm for provider-agnostic inference (OpenRouter, Gemini, Azure, etc.).

QwenTextAPIConfig ¶

Bases: BaseModel

API backend configuration for Qwen text extraction.

Uses litellm for provider-agnostic API access. Supports OpenRouter, Gemini, Azure, OpenAI, and any other litellm-compatible provider.

API keys can be passed directly or read from environment variables.

Example

# OpenRouter (reads OPENROUTER_API_KEY from env)
config = QwenTextAPIConfig(
    model="openrouter/qwen/qwen3-vl-8b-instruct",
)

# With explicit key
config = QwenTextAPIConfig(
    model="openrouter/qwen/qwen3-vl-8b-instruct",
    api_key=os.environ["OPENROUTER_API_KEY"],
    api_base="https://openrouter.ai/api/v1",
)

extractor ¶

Qwen3-VL text extractor.

A Vision-Language Model for extracting text from document images as structured HTML or Markdown.

Supports PyTorch, VLLM, MLX, and API backends.

Example

from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig

extractor = QwenTextExtractor(
        backend=QwenTextPyTorchConfig(model="Qwen/Qwen3-VL-8B-Instruct")
    )
result = extractor.extract(image, output_format="markdown")
print(result.content)

QwenTextExtractor ¶

QwenTextExtractor(backend: QwenTextBackendConfig)

Bases: BaseTextExtractor

Qwen3-VL Vision-Language Model text extractor.

Extracts text from document images and outputs as structured HTML or Markdown. Uses Qwen3-VL's built-in document parsing prompts.

Supports PyTorch, VLLM, MLX, and API backends.

Example

from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig

# Initialize with PyTorch backend
extractor = QwenTextExtractor(
        backend=QwenTextPyTorchConfig(model="Qwen/Qwen3-VL-8B-Instruct")
    )

# Extract as Markdown
result = extractor.extract(image, output_format="markdown")
print(result.content)

# Extract as HTML
result = extractor.extract(image, output_format="html")
print(result.content)

Initialize Qwen text extractor.

PARAMETER	DESCRIPTION
`backend`	Backend configuration. One of: - QwenTextPyTorchConfig: PyTorch/HuggingFace backend - QwenTextVLLMConfig: VLLM high-throughput backend - QwenTextMLXConfig: MLX backend for Apple Silicon - QwenTextAPIConfig: API backend (OpenRouter, etc.) TYPE: `QwenTextBackendConfig`

Source code in omnidocs/tasks/text_extraction/qwen/extractor.py

def __init__(self, backend: QwenTextBackendConfig):
    """
    Initialize Qwen text extractor.

    Args:
        backend: Backend configuration. One of:
            - QwenTextPyTorchConfig: PyTorch/HuggingFace backend
            - QwenTextVLLMConfig: VLLM high-throughput backend
            - QwenTextMLXConfig: MLX backend for Apple Silicon
            - QwenTextAPIConfig: API backend (OpenRouter, etc.)
    """
    self.backend_config = backend
    self._backend: Any = None
    self._processor: Any = None
    self._loaded = False

    # Backend-specific helpers
    self._process_vision_info: Any = None
    self._sampling_params_class: Any = None
    self._mlx_config: Any = None
    self._apply_chat_template: Any = None
    self._generate: Any = None

    # Load model
    self._load_model()

extract ¶

extract(
    image: Union[Image, ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput

Extract text from an image.

PARAMETER	DESCRIPTION
`image`	Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file TYPE: `Union[Image, ndarray, str, Path]`
`output_format`	Desired output format: - "html": Structured HTML with div elements - "markdown": Markdown format TYPE: `Literal['html', 'markdown']` DEFAULT: `'markdown'`

RETURNS	DESCRIPTION
`TextOutput`	TextOutput containing extracted text content

RAISES	DESCRIPTION
`RuntimeError`	If model is not loaded
`ValueError`	If image format or output_format is not supported

Source code in omnidocs/tasks/text_extraction/qwen/extractor.py

def extract(
    self,
    image: Union[Image.Image, np.ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput:
    """
    Extract text from an image.

    Args:
        image: Input image as:
            - PIL.Image.Image: PIL image object
            - np.ndarray: Numpy array (HWC format, RGB)
            - str or Path: Path to image file
        output_format: Desired output format:
            - "html": Structured HTML with div elements
            - "markdown": Markdown format

    Returns:
        TextOutput containing extracted text content

    Raises:
        RuntimeError: If model is not loaded
        ValueError: If image format or output_format is not supported
    """
    if not self._loaded:
        raise RuntimeError("Model not loaded. Call _load_model() first.")

    if output_format not in ("html", "markdown"):
        raise ValueError(f"Invalid output_format: {output_format}. Expected 'html' or 'markdown'.")

    # Prepare image
    pil_image = self._prepare_image(image)
    width, height = pil_image.size

    # Get prompt for output format
    prompt = QWEN_PROMPTS[output_format]

    # Run inference based on backend
    config_type = type(self.backend_config).__name__
    if config_type == "QwenTextPyTorchConfig":
        raw_output = self._infer_pytorch(pil_image, prompt)
    elif config_type == "QwenTextVLLMConfig":
        raw_output = self._infer_vllm(pil_image, prompt)
    elif config_type == "QwenTextMLXConfig":
        raw_output = self._infer_mlx(pil_image, prompt)
    elif config_type == "QwenTextAPIConfig":
        raw_output = self._infer_api(pil_image, prompt)
    else:
        raise RuntimeError(f"Unknown backend: {config_type}")

    # Clean output
    if output_format == "html":
        cleaned_output = _clean_html_output(raw_output)
    else:
        cleaned_output = _clean_markdown_output(raw_output)

    # Extract plain text
    plain_text = _extract_plain_text(raw_output, output_format)

    return TextOutput(
        content=cleaned_output,
        format=OutputFormat(output_format),
        raw_output=raw_output,
        plain_text=plain_text,
        image_width=width,
        image_height=height,
        model_name=f"Qwen3-VL ({type(self.backend_config).__name__})",
    )

mlx ¶

MLX backend configuration for Qwen3-VL text extraction.

QwenTextMLXConfig ¶

Bases: BaseModel

MLX backend configuration for Qwen text extraction.

This backend uses MLX for Apple Silicon native inference. Best for local development and testing on macOS M1/M2/M3+. Requires: mlx, mlx-vlm

Note: This backend only works on Apple Silicon Macs. Do NOT use for Modal/cloud deployments.

Example

config = QwenTextMLXConfig(
        model="mlx-community/Qwen3-VL-8B-Instruct-4bit",
    )

pytorch ¶

PyTorch/HuggingFace backend configuration for Qwen3-VL text extraction.

QwenTextPyTorchConfig ¶

Bases: BaseModel

PyTorch/HuggingFace backend configuration for Qwen text extraction.

This backend uses the transformers library with PyTorch for local GPU inference. Requires: torch, transformers, accelerate, qwen-vl-utils

Example

config = QwenTextPyTorchConfig(
        model="Qwen/Qwen3-VL-8B-Instruct",
        device="cuda",
        torch_dtype="bfloat16",
    )

vllm ¶

VLLM backend configuration for Qwen3-VL text extraction.

QwenTextVLLMConfig ¶

Bases: BaseModel

VLLM backend configuration for Qwen text extraction.

This backend uses VLLM for high-throughput inference. Best for batch processing and production deployments. Requires: vllm, torch, transformers, qwen-vl-utils

Example

config = QwenTextVLLMConfig(
        model="Qwen/Qwen3-VL-8B-Instruct",
        tensor_parallel_size=2,
        gpu_memory_utilization=0.9,
    )