Overview¶
VLM - Shared Vision-Language Model infrastructure.
Provides provider-agnostic VLM inference via litellm.
Example
VLMAPIConfig
¶
Bases: BaseModel
Provider-agnostic VLM API configuration using litellm.
Supports any provider litellm supports: Gemini, OpenRouter, Azure, OpenAI, Anthropic, etc. The model string uses litellm format with provider prefix (e.g., "gemini/gemini-2.5-flash").
API keys can be passed directly or read from environment variables (GOOGLE_API_KEY, OPENROUTER_API_KEY, AZURE_API_KEY, OPENAI_API_KEY, etc.).
Example
# Gemini (reads GOOGLE_API_KEY from env)
config = VLMAPIConfig(model="gemini/gemini-2.5-flash")
# OpenRouter with explicit key
config = VLMAPIConfig(
model="openrouter/qwen/qwen3-vl-8b-instruct",
api_key="sk-...",
)
# Azure OpenAI
config = VLMAPIConfig(
model="azure/gpt-4o",
api_base="https://my-deployment.openai.azure.com/",
)
vlm_completion
¶
Send image + prompt to any VLM via litellm. Returns raw text.
| PARAMETER | DESCRIPTION |
|---|---|
config
|
VLM API configuration.
TYPE:
|
prompt
|
Text prompt to send with the image.
TYPE:
|
image
|
PIL Image to send.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Raw text response from the model. |
Source code in omnidocs/vlm/client.py
vlm_structured_completion
¶
vlm_structured_completion(
config: VLMAPIConfig,
prompt: str,
image: Image,
response_schema: type[BaseModel],
) -> BaseModel
Send image + prompt, get structured Pydantic output.
Tries two strategies: 1. litellm's native response_format (works with OpenAI, Gemini, etc.) 2. Fallback: prompt-based JSON extraction for providers that don't support response_format (OpenRouter, some open-source models)
| PARAMETER | DESCRIPTION |
|---|---|
config
|
VLM API configuration.
TYPE:
|
prompt
|
Text prompt to send with the image.
TYPE:
|
image
|
PIL Image to send.
TYPE:
|
response_schema
|
Pydantic model class for structured output.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
BaseModel
|
Validated instance of response_schema. |
Source code in omnidocs/vlm/client.py
client
¶
VLM completion utilities using litellm for provider-agnostic inference.
vlm_completion
¶
Send image + prompt to any VLM via litellm. Returns raw text.
| PARAMETER | DESCRIPTION |
|---|---|
config
|
VLM API configuration.
TYPE:
|
prompt
|
Text prompt to send with the image.
TYPE:
|
image
|
PIL Image to send.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Raw text response from the model. |
Source code in omnidocs/vlm/client.py
vlm_structured_completion
¶
vlm_structured_completion(
config: VLMAPIConfig,
prompt: str,
image: Image,
response_schema: type[BaseModel],
) -> BaseModel
Send image + prompt, get structured Pydantic output.
Tries two strategies: 1. litellm's native response_format (works with OpenAI, Gemini, etc.) 2. Fallback: prompt-based JSON extraction for providers that don't support response_format (OpenRouter, some open-source models)
| PARAMETER | DESCRIPTION |
|---|---|
config
|
VLM API configuration.
TYPE:
|
prompt
|
Text prompt to send with the image.
TYPE:
|
image
|
PIL Image to send.
TYPE:
|
response_schema
|
Pydantic model class for structured output.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
BaseModel
|
Validated instance of response_schema. |
Source code in omnidocs/vlm/client.py
config
¶
VLM API configuration for provider-agnostic VLM inference via litellm.
VLMAPIConfig
¶
Bases: BaseModel
Provider-agnostic VLM API configuration using litellm.
Supports any provider litellm supports: Gemini, OpenRouter, Azure, OpenAI, Anthropic, etc. The model string uses litellm format with provider prefix (e.g., "gemini/gemini-2.5-flash").
API keys can be passed directly or read from environment variables (GOOGLE_API_KEY, OPENROUTER_API_KEY, AZURE_API_KEY, OPENAI_API_KEY, etc.).
Example
# Gemini (reads GOOGLE_API_KEY from env)
config = VLMAPIConfig(model="gemini/gemini-2.5-flash")
# OpenRouter with explicit key
config = VLMAPIConfig(
model="openrouter/qwen/qwen3-vl-8b-instruct",
api_key="sk-...",
)
# Azure OpenAI
config = VLMAPIConfig(
model="azure/gpt-4o",
api_base="https://my-deployment.openai.azure.com/",
)