Skip to content

Model Cache

OmniDocs automatically shares loaded models across extractors using a unified cache. This avoids loading the same multi-GB model twice when you use it for different tasks.


Why Cache?

Vision-language models like Qwen3-VL and MinerU VL support both text extraction and layout detection. Without caching, each extractor loads its own copy of the model, doubling GPU memory usage and load time.

Scenario Without cache With cache
Qwen text + layout 2 models loaded (~16GB) 1 model shared (~8GB)
Load time ~60s total ~30s (second is instant)

Basic Usage

The cache works automatically. Just create extractors normally:

from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig
from omnidocs.tasks.layout_extraction import QwenLayoutDetector
from omnidocs.tasks.layout_extraction.qwen import QwenLayoutPyTorchConfig

# Loads model (~30s)
text_extractor = QwenTextExtractor(
    backend=QwenTextPyTorchConfig(device="cuda")
)

# Reuses cached model (instant)
layout_detector = QwenLayoutDetector(
    backend=QwenLayoutPyTorchConfig(device="cuda")
)

# Both work independently
text_result = text_extractor.extract(image, output_format="markdown")
layout_result = layout_detector.extract(image)

How Sharing Works

The cache normalizes config class names to detect when two extractors use the same model:

QwenTextPyTorchConfig   → Qwen:PyTorchConfig  (model_family:backend)
QwenLayoutPyTorchConfig → Qwen:PyTorchConfig  (same key = shared)

Task markers (Text, Layout, OCR, Table, ReadingOrder, Formula) are stripped from config names. If the remaining model family, backend type, and loading parameters match, the model is shared.

Runtime Parameters Are Excluded

Parameters that only affect inference are excluded from the cache key:

  • max_tokens / max_new_tokens
  • temperature
  • do_sample
  • timeout / max_retries

This means a text extractor with max_tokens=8192 and a layout detector with max_tokens=4096 still share the same model, since max_tokens is a generation parameter, not a model loading parameter.

Parameters That Affect the Cache Key

Parameters that change how the model is loaded produce different cache keys:

  • model (model name/path)
  • device / device_map
  • torch_dtype
  • gpu_memory_utilization
  • max_model_len
  • tensor_parallel_size
  • enforce_eager

Cache Management

Inspecting the Cache

from omnidocs import get_cache_info, list_cached_keys

# List all cached model keys
keys = list_cached_keys()
print(f"Cached models: {len(keys)}")
for key in keys:
    print(f"  {key}")

# Detailed info with reference counts
info = get_cache_info()
for key, entry in info["entries"].items():
    print(f"  {key}: refs={entry['ref_count']}, accesses={entry['access_count']}")

Clearing the Cache

from omnidocs import clear_cache, remove_cached

# Clear everything
clear_cache()

# Remove a specific entry
remove_cached("Qwen:PyTorchConfig:device=cuda:model=Qwen/Qwen3-VL-8B-Instruct:...")

Setting Cache Size

from omnidocs import set_cache_config

# Allow up to 5 models (default: 10)
set_cache_config(max_entries=5)

# Unlimited cache (watch memory usage)
set_cache_config(max_entries=0)

LRU Eviction

When the cache is full, the least recently used model is evicted. Models with active references (extractors still using them) are evicted last.

set_cache_config(max_entries=3)

# Load 3 models - cache full
ext1 = QwenTextExtractor(backend=QwenTextPyTorchConfig())      # slot 1
ext2 = MinerUVLTextExtractor(backend=MinerUVLTextPyTorchConfig())  # slot 2
ext3 = NanonetsTextExtractor(backend=NanonetsTextPyTorchConfig())  # slot 3

# Loading a 4th evicts the least recently used
del ext1  # Qwen now has ref_count=0 and is eviction candidate
ext4 = GraniteDoclingTextExtractor(backend=...)  # Qwen gets evicted

Reference Counting

Each extractor registers as a reference to its cached model. When an extractor is deleted (garbage collected), its reference is removed. Models with zero references are eligible for LRU eviction but are not immediately removed.

# ref_count = 1
text_ext = QwenTextExtractor(backend=config)

# ref_count = 2 (same model, shared)
layout_det = QwenLayoutDetector(backend=config)

del text_ext   # ref_count = 1
del layout_det # ref_count = 0 (eligible for eviction)

API Backends

API backends (e.g., QwenTextAPIConfig) are not cached because they don't load a local model. They just create a lightweight HTTP client.


Supported Models

Model Type What's cached
Qwen3-VL Multi-backend (model, processor)
MinerU VL Multi-backend (client, layout_size)
Nanonets OCR2 Multi-backend (model, processor)
Granite Docling Multi-backend (model, processor)
DotsOCR Multi-backend PyTorch: (model, processor), VLLM: (backend,)
RT-DETR Single-backend (model, processor)
DocLayout-YOLO Single-backend (model,)
PaddleOCR Single-backend (ocr_engine,)
EasyOCR Single-backend (reader,)
TableFormer Single-backend (predictor, config)
TesseractOCR Single-backend Not cached (module import only)