Skip to content

VLM API (Any Cloud Provider)

Use any vision-language model through a single, provider-agnostic API. Powered by litellm, VLM API works with any provider that supports the OpenAI chat completions spec (also known as the Open Responses API). This includes OpenRouter, ANANNAS AI, Google Gemini, Azure OpenAI, OpenAI, and any self-hosted VLLM server.


Why VLM API?

  • No GPU required -- use cloud models directly
  • Provider-agnostic -- switch between providers by changing one string
  • Custom prompts -- tailor extraction to your domain
  • Structured output -- extract data into Pydantic schemas
  • Works with any VLM -- Gemini, GPT, Qwen, Claude, Llama, Grok, and more
  • Any litellm-compatible or OpenAI-spec provider -- if it speaks the OpenAI API, it works

Quick Start

from omnidocs.vlm import VLMAPIConfig
from omnidocs.tasks.text_extraction import VLMTextExtractor

# Just set your env var: OPENROUTER_API_KEY, GOOGLE_API_KEY, etc.
config = VLMAPIConfig(model="openrouter/qwen/qwen3-vl-8b-instruct")

extractor = VLMTextExtractor(config=config)
result = extractor.extract("document.png", output_format="markdown")
print(result.content)

Supported Providers

OmniDocs works with any provider that is either:

  1. Natively supported by litellm (use the litellm model prefix)
  2. OpenAI API-compatible (use openai/ prefix + api_base)
Provider Model Format Env Variable Notes
OpenRouter openrouter/org/model OPENROUTER_API_KEY 100+ vision models, pay-per-token
ANANNAS AI openai/model-name ANANNAS_API_KEY OpenAI-compatible, wide model selection
Google Gemini gemini/model-name GOOGLE_API_KEY Native litellm support
Azure OpenAI azure/deployment-name AZURE_API_KEY Requires api_version
OpenAI openai/model-name OPENAI_API_KEY Native litellm support
Self-hosted VLLM openai/model-name -- Use api_base to point to your server

Provider Setup Examples

OpenRouter

Access 100+ vision models through a single API key.

export OPENROUTER_API_KEY=sk-or-v1-...
from omnidocs.vlm import VLMAPIConfig

# Qwen models (great for document extraction)
config = VLMAPIConfig(model="openrouter/qwen/qwen3-vl-8b-instruct")
config = VLMAPIConfig(model="openrouter/qwen/qwen3-vl-32b-instruct")

# Google via OpenRouter
config = VLMAPIConfig(model="openrouter/google/gemini-2.5-flash-image")

# Anthropic via OpenRouter
config = VLMAPIConfig(model="openrouter/anthropic/claude-opus-4.6")

# OpenAI via OpenRouter
config = VLMAPIConfig(model="openrouter/openai/gpt-5.2")
Available vision models on OpenRouter
Provider Models
Qwen qwen/qwen3-vl-8b-instruct, qwen/qwen3-vl-32b-instruct, qwen/qwen3-vl-30b-a3b-instruct, qwen/qwen3-vl-8b-thinking, qwen/qwen3-vl-30b-a3b-thinking
Google google/gemini-3-flash-preview, google/gemini-3-pro-preview, google/gemini-2.5-flash-image
OpenAI openai/gpt-5.2, openai/gpt-5.1, openai/gpt-5-image-mini
Anthropic anthropic/claude-opus-4.6, anthropic/claude-opus-4.5, anthropic/claude-haiku-4.5
Mistral mistralai/mistral-large-3-2512, mistralai/ministral-3-14b-2512, mistralai/ministral-3-8b-2512, mistralai/ministral-3-3b-2512
NVIDIA nvidia/nemotron-nano-12b-2-vl
AllenAI allenai/molmo2-8b
ByteDance bytedance-seed/seed-1.6-flash, bytedance-seed/seed-1.6
xAI x-ai/grok-4-1-fast
Z.AI z-ai/glm-4.6v
Amazon amazon/nova-2-lite

All model names above should be prefixed with openrouter/ in OmniDocs.

ANANNAS AI

OpenAI-compatible API with access to models from multiple providers.

export ANANNAS_API_KEY=...
export ANANNAS_BASE_URL=https://api.anannas.ai/v1  # or your ANANNAS endpoint
from omnidocs.vlm import VLMAPIConfig

# ANANNAS uses OpenAI-compatible API, so use openai/ prefix + api_base
config = VLMAPIConfig(
    model="openai/qwen3-vl-8b-instruct",
    api_key=os.environ["ANANNAS_API_KEY"],
    api_base=os.environ["ANANNAS_BASE_URL"],
)

# Claude on ANANNAS
config = VLMAPIConfig(
    model="openai/claude-opus-4.6",
    api_key=os.environ["ANANNAS_API_KEY"],
    api_base=os.environ["ANANNAS_BASE_URL"],
)

# GPT-5 on ANANNAS
config = VLMAPIConfig(
    model="openai/gpt-5-mini",
    api_key=os.environ["ANANNAS_API_KEY"],
    api_base=os.environ["ANANNAS_BASE_URL"],
)
Available vision models on ANANNAS AI
Provider Models
Anthropic claude-3-haiku, claude-haiku-4-5, claude-opus-4, claude-opus-4-1, claude-opus-4-6, claude-sonnet-4, claude-sonnet-4-5
OpenAI gpt-5.2, gpt-5.1, gpt-5, gpt-5-mini, gpt-5-nano, gpt-5-pro, gpt-4.1, gpt-4.1-mini, gpt-4.1-nano, gpt-4o, gpt-4o-mini
OpenAI o-series o1, o1-pro, o3, o3-pro, o4-mini
Google gemini-2.5-flash-image, gemini-3-pro-image-preview, google-gemma-3-27b-it
Qwen qwen-qwen3-vl-235b-a22b, qwen2.5-vl-72b-instruct
Meta meta-llama4-maverick-17b-instruct-v1-0, meta-llama4-scout-17b-instruct-v1-0, meta-llama3-2-90b-instruct-v1-0
Amazon amazon-nova-lite-v1-0, amazon-nova-premier-v1-0, amazon-nova-pro-v1-0
Mistral mistral-voxtral-small-24b-2507
xAI grok-2-vision, grok-4-1-fast
Z.AI glm-4.5v, glm-4.6v
MoonshotAI kimi-k2-5

All model names above should be used without prefix but with openai/ prefix and api_base set to your ANANNAS endpoint.

Google Gemini (Direct)

export GOOGLE_API_KEY=...
from omnidocs.vlm import VLMAPIConfig

config = VLMAPIConfig(model="gemini/gemini-2.5-flash")
config = VLMAPIConfig(model="gemini/gemini-2.5-pro")

Azure OpenAI

export AZURE_API_KEY=...
export AZURE_API_BASE=https://your-resource.openai.azure.com/
from omnidocs.vlm import VLMAPIConfig

config = VLMAPIConfig(
    model="azure/gpt-5-mini",
    api_version="2024-12-01-preview",
)

OpenAI (Direct)

export OPENAI_API_KEY=sk-...
from omnidocs.vlm import VLMAPIConfig

config = VLMAPIConfig(model="openai/gpt-4o")
config = VLMAPIConfig(model="openai/gpt-5-mini")

Self-Hosted VLLM

Any VLLM server exposes an OpenAI-compatible endpoint. Use openai/ prefix with api_base:

from omnidocs.vlm import VLMAPIConfig

# Local VLLM server
config = VLMAPIConfig(
    model="openai/Qwen/Qwen3-VL-8B-Instruct",
    api_base="http://localhost:8000/v1",
    temperature=0.0,
)

# Modal-deployed VLLM server
config = VLMAPIConfig(
    model="openai/mineru-vl",
    api_base="https://your-app--server-serve.modal.run/v1",
    temperature=0.0,
)

Any OpenAI-Compatible Provider

Any API that follows the OpenAI chat completions spec works. Use openai/ prefix and set api_base:

from omnidocs.vlm import VLMAPIConfig

config = VLMAPIConfig(
    model="openai/model-name",
    api_key="your-api-key",
    api_base="https://your-provider.com/v1",
)

VLMAPIConfig

from omnidocs.vlm import VLMAPIConfig

config = VLMAPIConfig(
    model="gemini/gemini-2.5-flash",  # Required: litellm model string
    api_key=None,          # Optional: auto-reads from env
    api_base=None,         # Optional: override endpoint URL
    max_tokens=8192,       # Max tokens to generate
    temperature=0.1,       # Sampling temperature
    timeout=180,           # Request timeout (seconds)
    api_version=None,      # Required for Azure
    extra_headers=None,    # Additional HTTP headers
)

Model String Format

The model parameter follows litellm conventions:

Pattern Example When to Use
provider/model gemini/gemini-2.5-flash Litellm-native providers
openrouter/org/model openrouter/qwen/qwen3-vl-32b-instruct OpenRouter
azure/deployment azure/gpt-5-mini Azure OpenAI
openai/model + api_base openai/qwen3-vl-8b-instruct OpenAI-compatible APIs (ANANNAS, VLLM, etc.)

Tasks

Text Extraction

from omnidocs.vlm import VLMAPIConfig
from omnidocs.tasks.text_extraction import VLMTextExtractor

config = VLMAPIConfig(model="openrouter/qwen/qwen3-vl-8b-instruct")
extractor = VLMTextExtractor(config=config)

# Default prompt
result = extractor.extract("document.png", output_format="markdown")

# Custom prompt
result = extractor.extract(
    "document.png",
    prompt="Extract only the table data as a markdown table",
)

Layout Detection

from omnidocs.vlm import VLMAPIConfig
from omnidocs.tasks.layout_extraction import VLMLayoutDetector

config = VLMAPIConfig(model="gemini/gemini-2.5-flash")
detector = VLMLayoutDetector(config=config)

# Default labels
result = detector.extract("document.png")
for elem in result.elements:
    print(f"{elem.label}: {elem.bbox}")

# Custom labels
result = detector.extract(
    "document.png",
    custom_labels=["code_block", "sidebar", "diagram"],
)

Structured Extraction

Extract structured data into Pydantic schemas. The model returns validated, typed objects.

from pydantic import BaseModel
from omnidocs.vlm import VLMAPIConfig
from omnidocs.tasks.structured_extraction import VLMStructuredExtractor

class Invoice(BaseModel):
    vendor: str
    total: float
    items: list[str]
    date: str

config = VLMAPIConfig(model="gemini/gemini-2.5-flash")
extractor = VLMStructuredExtractor(config=config)

result = extractor.extract(
    image="invoice.png",
    schema=Invoice,
    prompt="Extract invoice details from this document.",
)

# result.data is a validated Invoice instance
print(result.data.vendor)
print(result.data.total)
print(result.data.items)

Switching Providers

One of the key benefits is being able to swap providers without changing your extraction code:

import os
from omnidocs.vlm import VLMAPIConfig
from omnidocs.tasks.text_extraction import VLMTextExtractor

configs = {
    # Native litellm providers
    "gemini": VLMAPIConfig(model="gemini/gemini-2.5-flash"),
    "openrouter": VLMAPIConfig(model="openrouter/qwen/qwen3-vl-32b-instruct"),
    "azure": VLMAPIConfig(model="azure/gpt-5-mini", api_version="2024-12-01-preview"),

    # OpenAI-compatible providers
    "anannas": VLMAPIConfig(
        model="openai/claude-opus-4.6",
        api_key=os.environ.get("ANANNAS_API_KEY"),
        api_base=os.environ.get("ANANNAS_BASE_URL"),
    ),
    "vllm": VLMAPIConfig(
        model="openai/mineru-vl",
        api_base="https://my-server.modal.run/v1",
    ),
}

for name, config in configs.items():
    extractor = VLMTextExtractor(config=config)
    result = extractor.extract("document.png")
    print(f"{name}: {len(result.content)} chars")

Use Case Recommended Model Provider
General text extraction qwen/qwen3-vl-32b-instruct OpenRouter
Fast + cheap extraction qwen/qwen3-vl-8b-instruct OpenRouter
Best quality gemini/gemini-2.5-pro Google
Structured output gemini/gemini-2.5-flash Google
Layout detection qwen/qwen3-vl-32b-instruct OpenRouter
Self-hosted Any Qwen3-VL or MinerU VL VLLM

Troubleshooting

Authentication error

Set the correct environment variable for your provider:

export GOOGLE_API_KEY=...       # Gemini
export OPENROUTER_API_KEY=...   # OpenRouter
export AZURE_API_KEY=...        # Azure
export OPENAI_API_KEY=...       # OpenAI
export ANANNAS_API_KEY=...      # ANANNAS AI

Azure errors

Azure requires api_version:

config = VLMAPIConfig(
    model="azure/gpt-5-mini",
    api_version="2024-12-01-preview",
)

OpenAI-compatible provider not working

Make sure you use openai/ prefix and set api_base:

config = VLMAPIConfig(
    model="openai/model-name",      # Must have openai/ prefix
    api_base="https://provider.com/v1",  # Must set api_base
    api_key="your-key",
)

Structured output fails

Some providers don't support native JSON schema output. The extractor automatically falls back to prompt-based extraction. If results are still poor, try a more capable model (Gemini 2.5 Flash/Pro work well).

Using a provider not listed here

OmniDocs works with any provider that is either litellm-supported or follows the OpenAI API spec. Check the litellm providers list for native support, or use openai/ prefix with api_base for any OpenAI-compatible endpoint.