Skip to content

Extractor

Qwen3-VL text extractor.

A Vision-Language Model for extracting text from document images as structured HTML or Markdown.

Supports PyTorch, VLLM, MLX, and API backends.

Example
from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig

extractor = QwenTextExtractor(
        backend=QwenTextPyTorchConfig(model="Qwen/Qwen3-VL-8B-Instruct")
    )
result = extractor.extract(image, output_format="markdown")
print(result.content)

QwenTextExtractor

QwenTextExtractor(backend: QwenTextBackendConfig)

Bases: BaseTextExtractor

Qwen3-VL Vision-Language Model text extractor.

Extracts text from document images and outputs as structured HTML or Markdown. Uses Qwen3-VL's built-in document parsing prompts.

Supports PyTorch, VLLM, MLX, and API backends.

Example
from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig

# Initialize with PyTorch backend
extractor = QwenTextExtractor(
        backend=QwenTextPyTorchConfig(model="Qwen/Qwen3-VL-8B-Instruct")
    )

# Extract as Markdown
result = extractor.extract(image, output_format="markdown")
print(result.content)

# Extract as HTML
result = extractor.extract(image, output_format="html")
print(result.content)

Initialize Qwen text extractor.

PARAMETER DESCRIPTION
backend

Backend configuration. One of: - QwenTextPyTorchConfig: PyTorch/HuggingFace backend - QwenTextVLLMConfig: VLLM high-throughput backend - QwenTextMLXConfig: MLX backend for Apple Silicon - QwenTextAPIConfig: API backend (OpenRouter, etc.)

TYPE: QwenTextBackendConfig

Source code in omnidocs/tasks/text_extraction/qwen/extractor.py
def __init__(self, backend: QwenTextBackendConfig):
    """
    Initialize Qwen text extractor.

    Args:
        backend: Backend configuration. One of:
            - QwenTextPyTorchConfig: PyTorch/HuggingFace backend
            - QwenTextVLLMConfig: VLLM high-throughput backend
            - QwenTextMLXConfig: MLX backend for Apple Silicon
            - QwenTextAPIConfig: API backend (OpenRouter, etc.)
    """
    self.backend_config = backend
    self._backend: Any = None
    self._processor: Any = None
    self._loaded = False

    # Backend-specific helpers
    self._process_vision_info: Any = None
    self._sampling_params_class: Any = None
    self._mlx_config: Any = None
    self._apply_chat_template: Any = None
    self._generate: Any = None

    # Load model
    self._load_model()

extract

extract(
    image: Union[Image, ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput

Extract text from an image.

PARAMETER DESCRIPTION
image

Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file

TYPE: Union[Image, ndarray, str, Path]

output_format

Desired output format: - "html": Structured HTML with div elements - "markdown": Markdown format

TYPE: Literal['html', 'markdown'] DEFAULT: 'markdown'

RETURNS DESCRIPTION
TextOutput

TextOutput containing extracted text content

RAISES DESCRIPTION
RuntimeError

If model is not loaded

ValueError

If image format or output_format is not supported

Source code in omnidocs/tasks/text_extraction/qwen/extractor.py
def extract(
    self,
    image: Union[Image.Image, np.ndarray, str, Path],
    output_format: Literal["html", "markdown"] = "markdown",
) -> TextOutput:
    """
    Extract text from an image.

    Args:
        image: Input image as:
            - PIL.Image.Image: PIL image object
            - np.ndarray: Numpy array (HWC format, RGB)
            - str or Path: Path to image file
        output_format: Desired output format:
            - "html": Structured HTML with div elements
            - "markdown": Markdown format

    Returns:
        TextOutput containing extracted text content

    Raises:
        RuntimeError: If model is not loaded
        ValueError: If image format or output_format is not supported
    """
    if not self._loaded:
        raise RuntimeError("Model not loaded. Call _load_model() first.")

    if output_format not in ("html", "markdown"):
        raise ValueError(f"Invalid output_format: {output_format}. Expected 'html' or 'markdown'.")

    # Prepare image
    pil_image = self._prepare_image(image)
    width, height = pil_image.size

    # Get prompt for output format
    prompt = QWEN_PROMPTS[output_format]

    # Run inference based on backend
    config_type = type(self.backend_config).__name__
    if config_type == "QwenTextPyTorchConfig":
        raw_output = self._infer_pytorch(pil_image, prompt)
    elif config_type == "QwenTextVLLMConfig":
        raw_output = self._infer_vllm(pil_image, prompt)
    elif config_type == "QwenTextMLXConfig":
        raw_output = self._infer_mlx(pil_image, prompt)
    elif config_type == "QwenTextAPIConfig":
        raw_output = self._infer_api(pil_image, prompt)
    else:
        raise RuntimeError(f"Unknown backend: {config_type}")

    # Clean output
    if output_format == "html":
        cleaned_output = _clean_html_output(raw_output)
    else:
        cleaned_output = _clean_markdown_output(raw_output)

    # Extract plain text
    plain_text = _extract_plain_text(raw_output, output_format)

    return TextOutput(
        content=cleaned_output,
        format=OutputFormat(output_format),
        raw_output=raw_output,
        plain_text=plain_text,
        image_width=width,
        image_height=height,
        model_name=f"Qwen3-VL ({type(self.backend_config).__name__})",
    )