Overview¶

Qwen3-VL backend configurations and detector for layout detection.

Available backends

QwenLayoutPyTorchConfig: PyTorch/HuggingFace backend
QwenLayoutVLLMConfig: VLLM high-throughput backend
QwenLayoutMLXConfig: MLX backend for Apple Silicon
QwenLayoutAPIConfig: API backend (OpenRouter, etc.)

Example

from omnidocs.tasks.layout_extraction.qwen import QwenLayoutPyTorchConfig
config = QwenLayoutPyTorchConfig(model="Qwen/Qwen3-VL-8B-Instruct")

QwenLayoutAPIConfig ¶

Bases: BaseModel

API backend configuration for Qwen layout detection.

Uses litellm for provider-agnostic API access. Supports OpenRouter, Gemini, Azure, OpenAI, and any other litellm-compatible provider.

API keys can be passed directly or read from environment variables.

Example

# OpenRouter (reads OPENROUTER_API_KEY from env)
config = QwenLayoutAPIConfig(
    model="openrouter/qwen/qwen3-vl-8b-instruct",
)

# With explicit key
config = QwenLayoutAPIConfig(
    model="openrouter/qwen/qwen3-vl-8b-instruct",
    api_key=os.environ["OPENROUTER_API_KEY"],
    api_base="https://openrouter.ai/api/v1",
)

QwenLayoutDetector ¶

QwenLayoutDetector(backend: QwenLayoutBackendConfig)

Bases: BaseLayoutExtractor

Qwen3-VL Vision-Language Model layout detector.

A flexible VLM-based layout detector that supports custom labels. Unlike fixed-label models (DocLayoutYOLO, RT-DETR), Qwen can detect any document elements specified at runtime.

Supports PyTorch, VLLM, MLX, and API backends.

Example

from omnidocs.tasks.layout_extraction import QwenLayoutDetector, CustomLabel
from omnidocs.tasks.layout_extraction.qwen import QwenLayoutPyTorchConfig

# Initialize with PyTorch backend
detector = QwenLayoutDetector(
        backend=QwenLayoutPyTorchConfig(model="Qwen/Qwen3-VL-8B-Instruct")
    )

# Basic extraction with default labels
result = detector.extract(image)

# With custom labels (strings)
result = detector.extract(image, custom_labels=["code_block", "sidebar"])

# With typed custom labels
labels = [
        CustomLabel(name="code_block", color="#E74C3C"),
        CustomLabel(name="sidebar", description="Side panel content"),
    ]
result = detector.extract(image, custom_labels=labels)

Initialize Qwen layout detector.

PARAMETER	DESCRIPTION
`backend`	Backend configuration. One of: - QwenLayoutPyTorchConfig: PyTorch/HuggingFace backend - QwenLayoutVLLMConfig: VLLM high-throughput backend - QwenLayoutMLXConfig: MLX backend for Apple Silicon - QwenLayoutAPIConfig: API backend (OpenRouter, etc.) TYPE: `QwenLayoutBackendConfig`

Source code in omnidocs/tasks/layout_extraction/qwen/detector.py

def __init__(self, backend: QwenLayoutBackendConfig):
    """
    Initialize Qwen layout detector.

    Args:
        backend: Backend configuration. One of:
            - QwenLayoutPyTorchConfig: PyTorch/HuggingFace backend
            - QwenLayoutVLLMConfig: VLLM high-throughput backend
            - QwenLayoutMLXConfig: MLX backend for Apple Silicon
            - QwenLayoutAPIConfig: API backend (OpenRouter, etc.)
    """
    self.backend_config = backend
    self._backend: Any = None
    self._processor: Any = None
    self._loaded = False

    # Backend-specific helpers
    self._process_vision_info: Any = None
    self._sampling_params_class: Any = None
    self._mlx_config: Any = None
    self._apply_chat_template: Any = None
    self._generate: Any = None

    # Load model
    self._load_model()

extract ¶

extract(
    image: Union[Image, ndarray, str, Path],
    custom_labels: Optional[
        List[Union[str, CustomLabel]]
    ] = None,
) -> LayoutOutput

Run layout detection on an image.

PARAMETER	DESCRIPTION
`image`	Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file TYPE: `Union[Image, ndarray, str, Path]`
`custom_labels`	Optional custom labels to detect. Can be: - None: Use default labels (title, text, table, figure, etc.) - List[str]: Simple label names ["code_block", "sidebar"] - List[CustomLabel]: Typed labels with metadata TYPE: `Optional[List[Union[str, CustomLabel]]]` DEFAULT: `None`

RETURNS	DESCRIPTION
`LayoutOutput`	LayoutOutput with detected layout boxes

RAISES	DESCRIPTION
`RuntimeError`	If model is not loaded
`ValueError`	If image format is not supported

Source code in omnidocs/tasks/layout_extraction/qwen/detector.py

def extract(
    self,
    image: Union[Image.Image, np.ndarray, str, Path],
    custom_labels: Optional[List[Union[str, CustomLabel]]] = None,
) -> LayoutOutput:
    """
    Run layout detection on an image.

    Args:
        image: Input image as:
            - PIL.Image.Image: PIL image object
            - np.ndarray: Numpy array (HWC format, RGB)
            - str or Path: Path to image file
        custom_labels: Optional custom labels to detect. Can be:
            - None: Use default labels (title, text, table, figure, etc.)
            - List[str]: Simple label names ["code_block", "sidebar"]
            - List[CustomLabel]: Typed labels with metadata

    Returns:
        LayoutOutput with detected layout boxes

    Raises:
        RuntimeError: If model is not loaded
        ValueError: If image format is not supported
    """
    if not self._loaded:
        raise RuntimeError("Model not loaded. Call _load_model() first.")

    # Prepare image
    pil_image = self._prepare_image(image)
    width, height = pil_image.size

    # Normalize labels
    label_names = self._normalize_labels(custom_labels)

    # Build prompt
    prompt = self._build_detection_prompt(label_names)

    # Run inference based on backend
    config_type = type(self.backend_config).__name__
    if config_type == "QwenLayoutPyTorchConfig":
        raw_output = self._infer_pytorch(pil_image, prompt)
    elif config_type == "QwenLayoutVLLMConfig":
        raw_output = self._infer_vllm(pil_image, prompt)
    elif config_type == "QwenLayoutMLXConfig":
        raw_output = self._infer_mlx(pil_image, prompt)
    elif config_type == "QwenLayoutAPIConfig":
        raw_output = self._infer_api(pil_image, prompt)
    else:
        raise RuntimeError(f"Unknown backend: {config_type}")

    # Parse detections
    detections = self._parse_json_output(raw_output)

    # Convert to LayoutOutput
    layout_boxes = self._build_layout_boxes(detections, width, height)

    # Sort by position (reading order)
    layout_boxes.sort(key=lambda b: (b.bbox.y1, b.bbox.x1))

    return LayoutOutput(
        bboxes=layout_boxes,
        image_width=width,
        image_height=height,
        model_name=f"Qwen3-VL ({type(self.backend_config).__name__})",
    )

QwenLayoutMLXConfig ¶

Bases: BaseModel

MLX backend configuration for Qwen layout detection.

This backend uses MLX for Apple Silicon native inference. Best for local development and testing on macOS M1/M2/M3+. Requires: mlx, mlx-vlm

Note: This backend only works on Apple Silicon Macs. Do NOT use for Modal/cloud deployments.

Example

config = QwenLayoutMLXConfig(
        model="mlx-community/Qwen3-VL-8B-Instruct-4bit",
    )

QwenLayoutPyTorchConfig ¶

Bases: BaseModel

PyTorch/HuggingFace backend configuration for Qwen layout detection.

This backend uses the transformers library with PyTorch for local GPU inference. Requires: torch, transformers, accelerate, qwen-vl-utils

Example

config = QwenLayoutPyTorchConfig(
        model="Qwen/Qwen3-VL-8B-Instruct",
        device="cuda",
        torch_dtype="bfloat16",
    )

QwenLayoutVLLMConfig ¶

Bases: BaseModel

VLLM backend configuration for Qwen layout detection.

This backend uses VLLM for high-throughput inference. Best for batch processing and production deployments. Requires: vllm, torch, transformers, qwen-vl-utils

Example

config = QwenLayoutVLLMConfig(
        model="Qwen/Qwen3-VL-8B-Instruct",
        tensor_parallel_size=2,
        gpu_memory_utilization=0.9,
    )

api ¶

API backend configuration for Qwen3-VL layout detection.

Uses litellm for provider-agnostic inference (OpenRouter, Gemini, Azure, etc.).

QwenLayoutAPIConfig ¶

Bases: BaseModel

API backend configuration for Qwen layout detection.

Uses litellm for provider-agnostic API access. Supports OpenRouter, Gemini, Azure, OpenAI, and any other litellm-compatible provider.

API keys can be passed directly or read from environment variables.

Example

# OpenRouter (reads OPENROUTER_API_KEY from env)
config = QwenLayoutAPIConfig(
    model="openrouter/qwen/qwen3-vl-8b-instruct",
)

# With explicit key
config = QwenLayoutAPIConfig(
    model="openrouter/qwen/qwen3-vl-8b-instruct",
    api_key=os.environ["OPENROUTER_API_KEY"],
    api_base="https://openrouter.ai/api/v1",
)

detector ¶

Qwen3-VL layout detector.

A Vision-Language Model for flexible layout detection with custom label support. Supports PyTorch, VLLM, MLX, and API backends.

Example

from omnidocs.tasks.layout_extraction import QwenLayoutDetector
from omnidocs.tasks.layout_extraction.qwen import QwenLayoutPyTorchConfig

detector = QwenLayoutDetector(
        backend=QwenLayoutPyTorchConfig(model="Qwen/Qwen3-VL-8B-Instruct")
    )
result = detector.extract(image)

# With custom labels
result = detector.extract(image, custom_labels=["code_block", "sidebar"])

QwenLayoutDetector ¶

QwenLayoutDetector(backend: QwenLayoutBackendConfig)

Bases: BaseLayoutExtractor

Qwen3-VL Vision-Language Model layout detector.

A flexible VLM-based layout detector that supports custom labels. Unlike fixed-label models (DocLayoutYOLO, RT-DETR), Qwen can detect any document elements specified at runtime.

Supports PyTorch, VLLM, MLX, and API backends.

Example

from omnidocs.tasks.layout_extraction import QwenLayoutDetector, CustomLabel
from omnidocs.tasks.layout_extraction.qwen import QwenLayoutPyTorchConfig

# Initialize with PyTorch backend
detector = QwenLayoutDetector(
        backend=QwenLayoutPyTorchConfig(model="Qwen/Qwen3-VL-8B-Instruct")
    )

# Basic extraction with default labels
result = detector.extract(image)

# With custom labels (strings)
result = detector.extract(image, custom_labels=["code_block", "sidebar"])

# With typed custom labels
labels = [
        CustomLabel(name="code_block", color="#E74C3C"),
        CustomLabel(name="sidebar", description="Side panel content"),
    ]
result = detector.extract(image, custom_labels=labels)

Initialize Qwen layout detector.

PARAMETER	DESCRIPTION
`backend`	Backend configuration. One of: - QwenLayoutPyTorchConfig: PyTorch/HuggingFace backend - QwenLayoutVLLMConfig: VLLM high-throughput backend - QwenLayoutMLXConfig: MLX backend for Apple Silicon - QwenLayoutAPIConfig: API backend (OpenRouter, etc.) TYPE: `QwenLayoutBackendConfig`

Source code in omnidocs/tasks/layout_extraction/qwen/detector.py

def __init__(self, backend: QwenLayoutBackendConfig):
    """
    Initialize Qwen layout detector.

    Args:
        backend: Backend configuration. One of:
            - QwenLayoutPyTorchConfig: PyTorch/HuggingFace backend
            - QwenLayoutVLLMConfig: VLLM high-throughput backend
            - QwenLayoutMLXConfig: MLX backend for Apple Silicon
            - QwenLayoutAPIConfig: API backend (OpenRouter, etc.)
    """
    self.backend_config = backend
    self._backend: Any = None
    self._processor: Any = None
    self._loaded = False

    # Backend-specific helpers
    self._process_vision_info: Any = None
    self._sampling_params_class: Any = None
    self._mlx_config: Any = None
    self._apply_chat_template: Any = None
    self._generate: Any = None

    # Load model
    self._load_model()

extract ¶

extract(
    image: Union[Image, ndarray, str, Path],
    custom_labels: Optional[
        List[Union[str, CustomLabel]]
    ] = None,
) -> LayoutOutput

Run layout detection on an image.

PARAMETER	DESCRIPTION
`image`	Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file TYPE: `Union[Image, ndarray, str, Path]`
`custom_labels`	Optional custom labels to detect. Can be: - None: Use default labels (title, text, table, figure, etc.) - List[str]: Simple label names ["code_block", "sidebar"] - List[CustomLabel]: Typed labels with metadata TYPE: `Optional[List[Union[str, CustomLabel]]]` DEFAULT: `None`

RETURNS	DESCRIPTION
`LayoutOutput`	LayoutOutput with detected layout boxes

RAISES	DESCRIPTION
`RuntimeError`	If model is not loaded
`ValueError`	If image format is not supported

Source code in omnidocs/tasks/layout_extraction/qwen/detector.py

def extract(
    self,
    image: Union[Image.Image, np.ndarray, str, Path],
    custom_labels: Optional[List[Union[str, CustomLabel]]] = None,
) -> LayoutOutput:
    """
    Run layout detection on an image.

    Args:
        image: Input image as:
            - PIL.Image.Image: PIL image object
            - np.ndarray: Numpy array (HWC format, RGB)
            - str or Path: Path to image file
        custom_labels: Optional custom labels to detect. Can be:
            - None: Use default labels (title, text, table, figure, etc.)
            - List[str]: Simple label names ["code_block", "sidebar"]
            - List[CustomLabel]: Typed labels with metadata

    Returns:
        LayoutOutput with detected layout boxes

    Raises:
        RuntimeError: If model is not loaded
        ValueError: If image format is not supported
    """
    if not self._loaded:
        raise RuntimeError("Model not loaded. Call _load_model() first.")

    # Prepare image
    pil_image = self._prepare_image(image)
    width, height = pil_image.size

    # Normalize labels
    label_names = self._normalize_labels(custom_labels)

    # Build prompt
    prompt = self._build_detection_prompt(label_names)

    # Run inference based on backend
    config_type = type(self.backend_config).__name__
    if config_type == "QwenLayoutPyTorchConfig":
        raw_output = self._infer_pytorch(pil_image, prompt)
    elif config_type == "QwenLayoutVLLMConfig":
        raw_output = self._infer_vllm(pil_image, prompt)
    elif config_type == "QwenLayoutMLXConfig":
        raw_output = self._infer_mlx(pil_image, prompt)
    elif config_type == "QwenLayoutAPIConfig":
        raw_output = self._infer_api(pil_image, prompt)
    else:
        raise RuntimeError(f"Unknown backend: {config_type}")

    # Parse detections
    detections = self._parse_json_output(raw_output)

    # Convert to LayoutOutput
    layout_boxes = self._build_layout_boxes(detections, width, height)

    # Sort by position (reading order)
    layout_boxes.sort(key=lambda b: (b.bbox.y1, b.bbox.x1))

    return LayoutOutput(
        bboxes=layout_boxes,
        image_width=width,
        image_height=height,
        model_name=f"Qwen3-VL ({type(self.backend_config).__name__})",
    )

mlx ¶

MLX backend configuration for Qwen3-VL layout detection.

QwenLayoutMLXConfig ¶

Bases: BaseModel

MLX backend configuration for Qwen layout detection.

This backend uses MLX for Apple Silicon native inference. Best for local development and testing on macOS M1/M2/M3+. Requires: mlx, mlx-vlm

Note: This backend only works on Apple Silicon Macs. Do NOT use for Modal/cloud deployments.

Example

config = QwenLayoutMLXConfig(
        model="mlx-community/Qwen3-VL-8B-Instruct-4bit",
    )

pytorch ¶

PyTorch/HuggingFace backend configuration for Qwen3-VL layout detection.

QwenLayoutPyTorchConfig ¶

Bases: BaseModel

PyTorch/HuggingFace backend configuration for Qwen layout detection.

This backend uses the transformers library with PyTorch for local GPU inference. Requires: torch, transformers, accelerate, qwen-vl-utils

Example

config = QwenLayoutPyTorchConfig(
        model="Qwen/Qwen3-VL-8B-Instruct",
        device="cuda",
        torch_dtype="bfloat16",
    )

vllm ¶

VLLM backend configuration for Qwen3-VL layout detection.

QwenLayoutVLLMConfig ¶

Bases: BaseModel

VLLM backend configuration for Qwen layout detection.

This backend uses VLLM for high-throughput inference. Best for batch processing and production deployments. Requires: vllm, torch, transformers, qwen-vl-utils

Example

config = QwenLayoutVLLMConfig(
        model="Qwen/Qwen3-VL-8B-Instruct",
        tensor_parallel_size=2,
        gpu_memory_utilization=0.9,
    )