Vlm¶
VLM layout detector.
A provider-agnostic Vision-Language Model layout detector using litellm. Works with any cloud API: Gemini, OpenRouter, Azure, OpenAI, Anthropic, etc.
Example
from omnidocs.vlm import VLMAPIConfig
from omnidocs.tasks.layout_extraction import VLMLayoutDetector
config = VLMAPIConfig(model="gemini/gemini-2.5-flash")
detector = VLMLayoutDetector(config=config)
result = detector.extract("document.png")
for box in result.bboxes:
print(f"{box.label.value}: {box.bbox}")
VLMLayoutDetector
¶
Bases: BaseLayoutExtractor
Provider-agnostic VLM layout detector using litellm.
Works with any cloud VLM API: Gemini, OpenRouter, Azure, OpenAI, Anthropic, etc. Supports custom labels for flexible detection.
Example
from omnidocs.vlm import VLMAPIConfig
from omnidocs.tasks.layout_extraction import VLMLayoutDetector
config = VLMAPIConfig(model="gemini/gemini-2.5-flash")
detector = VLMLayoutDetector(config=config)
# Default labels
result = detector.extract("document.png")
# Custom labels
result = detector.extract("document.png", custom_labels=["code_block", "sidebar"])
Initialize VLM layout detector.
| PARAMETER | DESCRIPTION |
|---|---|
config
|
VLM API configuration with model and provider details.
TYPE:
|
Source code in omnidocs/tasks/layout_extraction/vlm.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
custom_labels: Optional[
List[Union[str, CustomLabel]]
] = None,
prompt: Optional[str] = None,
) -> LayoutOutput
Run layout detection on an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or file path).
TYPE:
|
custom_labels
|
Optional custom labels to detect. Can be: - None: Use default labels (title, text, table, figure, etc.) - List[str]: Simple label names ["code_block", "sidebar"] - List[CustomLabel]: Typed labels with metadata
TYPE:
|
prompt
|
Custom prompt. If None, builds a default detection prompt.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
LayoutOutput
|
LayoutOutput with detected layout boxes. |