Skip to content

Overview

VLM - Shared Vision-Language Model infrastructure.

Provides provider-agnostic VLM inference via litellm.

Example
from omnidocs.vlm import VLMAPIConfig, vlm_completion

config = VLMAPIConfig(model="gemini/gemini-2.5-flash")
result = vlm_completion(config, "Extract text from this image", image)

VLMAPIConfig

Bases: BaseModel

Provider-agnostic VLM API configuration using litellm.

Supports any provider litellm supports: Gemini, OpenRouter, Azure, OpenAI, Anthropic, etc. The model string uses litellm format with provider prefix (e.g., "gemini/gemini-2.5-flash").

API keys can be passed directly or read from environment variables (GOOGLE_API_KEY, OPENROUTER_API_KEY, AZURE_API_KEY, OPENAI_API_KEY, etc.).

Example
# Gemini (reads GOOGLE_API_KEY from env)
config = VLMAPIConfig(model="gemini/gemini-2.5-flash")

# OpenRouter with explicit key
config = VLMAPIConfig(
    model="openrouter/qwen/qwen3-vl-8b-instruct",
    api_key="sk-...",
)

# Azure OpenAI
config = VLMAPIConfig(
    model="azure/gpt-4o",
    api_base="https://my-deployment.openai.azure.com/",
)

vlm_completion

vlm_completion(
    config: VLMAPIConfig, prompt: str, image: Image
) -> str

Send image + prompt to any VLM via litellm. Returns raw text.

PARAMETER DESCRIPTION
config

VLM API configuration.

TYPE: VLMAPIConfig

prompt

Text prompt to send with the image.

TYPE: str

image

PIL Image to send.

TYPE: Image

RETURNS DESCRIPTION
str

Raw text response from the model.

Source code in omnidocs/vlm/client.py
def vlm_completion(config: VLMAPIConfig, prompt: str, image: Image.Image) -> str:
    """
    Send image + prompt to any VLM via litellm. Returns raw text.

    Args:
        config: VLM API configuration.
        prompt: Text prompt to send with the image.
        image: PIL Image to send.

    Returns:
        Raw text response from the model.
    """
    import litellm

    b64 = _encode_image(image)
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}},
            ],
        }
    ]

    kwargs = _build_kwargs(config, messages)
    response = litellm.completion(**kwargs)
    return response.choices[0].message.content

vlm_structured_completion

vlm_structured_completion(
    config: VLMAPIConfig,
    prompt: str,
    image: Image,
    response_schema: type[BaseModel],
) -> BaseModel

Send image + prompt, get structured Pydantic output.

Tries two strategies: 1. litellm's native response_format (works with OpenAI, Gemini, etc.) 2. Fallback: prompt-based JSON extraction for providers that don't support response_format (OpenRouter, some open-source models)

PARAMETER DESCRIPTION
config

VLM API configuration.

TYPE: VLMAPIConfig

prompt

Text prompt to send with the image.

TYPE: str

image

PIL Image to send.

TYPE: Image

response_schema

Pydantic model class for structured output.

TYPE: type[BaseModel]

RETURNS DESCRIPTION
BaseModel

Validated instance of response_schema.

Source code in omnidocs/vlm/client.py
def vlm_structured_completion(
    config: VLMAPIConfig,
    prompt: str,
    image: Image.Image,
    response_schema: type[BaseModel],
) -> BaseModel:
    """
    Send image + prompt, get structured Pydantic output.

    Tries two strategies:
    1. litellm's native response_format (works with OpenAI, Gemini, etc.)
    2. Fallback: prompt-based JSON extraction for providers that don't
       support response_format (OpenRouter, some open-source models)

    Args:
        config: VLM API configuration.
        prompt: Text prompt to send with the image.
        image: PIL Image to send.
        response_schema: Pydantic model class for structured output.

    Returns:
        Validated instance of response_schema.
    """
    import json

    import litellm

    b64 = _encode_image(image)
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}},
            ],
        }
    ]

    # Strategy 1: Try response_format (native structured output)
    kwargs = _build_kwargs(config, messages)
    kwargs["response_format"] = response_schema
    try:
        response = litellm.completion(**kwargs)
        raw = response.choices[0].message.content
        return response_schema.model_validate_json(raw)
    except Exception:
        pass

    # Strategy 2: Fallback — prompt for JSON, parse manually
    schema_json = json.dumps(response_schema.model_json_schema(), indent=2)
    json_prompt = (
        f"{prompt}\n\n"
        f"Respond with ONLY valid JSON matching this schema (no markdown fencing, no extra text):\n"
        f"{schema_json}"
    )
    fallback_messages = [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": json_prompt},
                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}},
            ],
        }
    ]
    kwargs = _build_kwargs(config, fallback_messages)
    response = litellm.completion(**kwargs)
    raw = response.choices[0].message.content

    # Strip markdown fencing if present
    text = raw.strip()
    if text.startswith("```"):
        lines = text.splitlines()
        # Remove first line (```json) and last line (```)
        lines = [line for line in lines if not line.strip().startswith("```")]
        text = "\n".join(lines)

    return response_schema.model_validate_json(text)

client

VLM completion utilities using litellm for provider-agnostic inference.

vlm_completion

vlm_completion(
    config: VLMAPIConfig, prompt: str, image: Image
) -> str

Send image + prompt to any VLM via litellm. Returns raw text.

PARAMETER DESCRIPTION
config

VLM API configuration.

TYPE: VLMAPIConfig

prompt

Text prompt to send with the image.

TYPE: str

image

PIL Image to send.

TYPE: Image

RETURNS DESCRIPTION
str

Raw text response from the model.

Source code in omnidocs/vlm/client.py
def vlm_completion(config: VLMAPIConfig, prompt: str, image: Image.Image) -> str:
    """
    Send image + prompt to any VLM via litellm. Returns raw text.

    Args:
        config: VLM API configuration.
        prompt: Text prompt to send with the image.
        image: PIL Image to send.

    Returns:
        Raw text response from the model.
    """
    import litellm

    b64 = _encode_image(image)
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}},
            ],
        }
    ]

    kwargs = _build_kwargs(config, messages)
    response = litellm.completion(**kwargs)
    return response.choices[0].message.content

vlm_structured_completion

vlm_structured_completion(
    config: VLMAPIConfig,
    prompt: str,
    image: Image,
    response_schema: type[BaseModel],
) -> BaseModel

Send image + prompt, get structured Pydantic output.

Tries two strategies: 1. litellm's native response_format (works with OpenAI, Gemini, etc.) 2. Fallback: prompt-based JSON extraction for providers that don't support response_format (OpenRouter, some open-source models)

PARAMETER DESCRIPTION
config

VLM API configuration.

TYPE: VLMAPIConfig

prompt

Text prompt to send with the image.

TYPE: str

image

PIL Image to send.

TYPE: Image

response_schema

Pydantic model class for structured output.

TYPE: type[BaseModel]

RETURNS DESCRIPTION
BaseModel

Validated instance of response_schema.

Source code in omnidocs/vlm/client.py
def vlm_structured_completion(
    config: VLMAPIConfig,
    prompt: str,
    image: Image.Image,
    response_schema: type[BaseModel],
) -> BaseModel:
    """
    Send image + prompt, get structured Pydantic output.

    Tries two strategies:
    1. litellm's native response_format (works with OpenAI, Gemini, etc.)
    2. Fallback: prompt-based JSON extraction for providers that don't
       support response_format (OpenRouter, some open-source models)

    Args:
        config: VLM API configuration.
        prompt: Text prompt to send with the image.
        image: PIL Image to send.
        response_schema: Pydantic model class for structured output.

    Returns:
        Validated instance of response_schema.
    """
    import json

    import litellm

    b64 = _encode_image(image)
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}},
            ],
        }
    ]

    # Strategy 1: Try response_format (native structured output)
    kwargs = _build_kwargs(config, messages)
    kwargs["response_format"] = response_schema
    try:
        response = litellm.completion(**kwargs)
        raw = response.choices[0].message.content
        return response_schema.model_validate_json(raw)
    except Exception:
        pass

    # Strategy 2: Fallback — prompt for JSON, parse manually
    schema_json = json.dumps(response_schema.model_json_schema(), indent=2)
    json_prompt = (
        f"{prompt}\n\n"
        f"Respond with ONLY valid JSON matching this schema (no markdown fencing, no extra text):\n"
        f"{schema_json}"
    )
    fallback_messages = [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": json_prompt},
                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}},
            ],
        }
    ]
    kwargs = _build_kwargs(config, fallback_messages)
    response = litellm.completion(**kwargs)
    raw = response.choices[0].message.content

    # Strip markdown fencing if present
    text = raw.strip()
    if text.startswith("```"):
        lines = text.splitlines()
        # Remove first line (```json) and last line (```)
        lines = [line for line in lines if not line.strip().startswith("```")]
        text = "\n".join(lines)

    return response_schema.model_validate_json(text)

config

VLM API configuration for provider-agnostic VLM inference via litellm.

VLMAPIConfig

Bases: BaseModel

Provider-agnostic VLM API configuration using litellm.

Supports any provider litellm supports: Gemini, OpenRouter, Azure, OpenAI, Anthropic, etc. The model string uses litellm format with provider prefix (e.g., "gemini/gemini-2.5-flash").

API keys can be passed directly or read from environment variables (GOOGLE_API_KEY, OPENROUTER_API_KEY, AZURE_API_KEY, OPENAI_API_KEY, etc.).

Example
# Gemini (reads GOOGLE_API_KEY from env)
config = VLMAPIConfig(model="gemini/gemini-2.5-flash")

# OpenRouter with explicit key
config = VLMAPIConfig(
    model="openrouter/qwen/qwen3-vl-8b-instruct",
    api_key="sk-...",
)

# Azure OpenAI
config = VLMAPIConfig(
    model="azure/gpt-4o",
    api_base="https://my-deployment.openai.azure.com/",
)