Skip to content

Overview

GLM-OCR backend configurations and extractor for text extraction.

GLM-OCR from zai-org (Feb 2026) — 0.9B OCR-specialist model. Architecture: CogViT visual encoder (0.4B) + GLM decoder (0.5B). Scores #1 on OmniDocBench V1.5 (94.62), beating models 10x its size.

Unlike GLM-V (which is a general VLM), GLM-OCR is purpose-built for document OCR. Uses AutoModelForImageTextToText + AutoProcessor (NOT Glm4vForConditionalGeneration). Requires transformers>=5.3.0.

Available backends
  • GLMOCRPyTorchConfig: PyTorch/HuggingFace backend
  • GLMOCRVLLMConfig: VLLM high-throughput backend (with MTP speculative decoding)
  • GLMOCRAPIConfig: API backend

HuggingFace: zai-org/GLM-OCR License: Apache 2.0

GLMOCRAPIConfig

Bases: BaseModel

API backend configuration for GLM-OCR.

Primary provider: ZhipuAI / BigModel (official) — get key at open.bigmodel.cn.

Example:

python # Self-hosted vLLM server config = GLMOCRAPIConfig( model="zai-org/GLM-OCR", api_base="http://localhost:8000/v1", api_key="token-abc", )

GLMOCRTextExtractor

GLMOCRTextExtractor(backend: GLMOCRBackendConfig)

Bases: BaseTextExtractor

GLM-OCR text extractor (zai-org/GLM-OCR, 0.9B, Feb 2026).

Purpose-built OCR model, #1 on OmniDocBench V1.5.
Faster and cheaper than GLM-V for pure document OCR tasks.

Example:

python from omnidocs.tasks.text_extraction import GLMOCRTextExtractor from omnidocs.tasks.text_extraction.glmocr import GLMOCRPyTorchConfig extractor = GLMOCRTextExtractor(backend=GLMOCRPyTorchConfig()) result = extractor.extract(image) print(result.content)

Source code in omnidocs/tasks/text_extraction/glmocr/extractor.py
def __init__(self, backend: GLMOCRBackendConfig):
    self.backend_config = backend
    self._backend: Any = None
    self._processor: Any = None
    self._loaded = False
    self._sampling_params_class: Any = None
    self._mlx_config: Any = None
    self._apply_chat_template: Any = None
    self._generate: Any = None
    self._load_model()

GLMOCRMLXConfig

Bases: BaseModel

MLX backend configuration for GLM-OCR.

Uses mlx-vlm for Apple Silicon native inference.
GLM-OCR at 0.9B runs comfortably on any M-series Mac with 8GB+ unified memory.
Requires: mlx, mlx-vlm>=0.3.11

Note: Only works on Apple Silicon Macs. Do NOT use for Modal/cloud deployments.

Available models:
    mlx-community/GLM-OCR-bf16   (default — full precision, 2.21 GB)
    mlx-community/GLM-OCR-6bit   (quantized, smaller)

Example:

python config = GLMOCRMLXConfig() # bf16, default config = GLMOCRMLXConfig(model="mlx-community/GLM-OCR-6bit") # quantized

GLMOCRPyTorchConfig

Bases: BaseModel

PyTorch/HuggingFace backend configuration for GLM-OCR.

GLM-OCR uses AutoModelForImageTextToText + AutoProcessor.
Requires transformers>=5.3.0.

Example:

python config = GLMOCRPyTorchConfig() # zai-org/GLM-OCR, default config = GLMOCRPyTorchConfig(device="mps") # Apple Silicon

GLMOCRVLLMConfig

Bases: BaseModel

VLLM backend configuration for GLM-OCR.

GLM-OCR supports VLLM with MTP (Multi-Token Prediction) speculative decoding
for significantly higher throughput. Requires vllm>=0.17.0 and transformers>=5.3.0.

Example:

python config = GLMOCRVLLMConfig(gpu_memory_utilization=0.85)

api

API backend configuration for GLM-OCR text extraction.

GLMOCRAPIConfig

Bases: BaseModel

API backend configuration for GLM-OCR.

Primary provider: ZhipuAI / BigModel (official) — get key at open.bigmodel.cn.

Example:

python # Self-hosted vLLM server config = GLMOCRAPIConfig( model="zai-org/GLM-OCR", api_base="http://localhost:8000/v1", api_key="token-abc", )

extractor

GLM-OCR text extractor.

GLM-OCR from zai-org (Feb 2026) — 0.9B OCR-specialist model. Architecture: CogViT visual encoder (0.4B) + GLM decoder (0.5B). Scores #1 on OmniDocBench V1.5 (94.62).

Key differences from GLM-V
  • Uses AutoModelForImageTextToText (NOT Glm4vForConditionalGeneration)
  • Uses AutoProcessor with direct image input (no chat template URL trick)
  • Much smaller (0.9B vs 9B) — faster, lower VRAM
  • Requires transformers>=5.3.0
  • No tokens, no <|begin_of_box|> — clean output

GLMOCRTextExtractor

GLMOCRTextExtractor(backend: GLMOCRBackendConfig)

Bases: BaseTextExtractor

GLM-OCR text extractor (zai-org/GLM-OCR, 0.9B, Feb 2026).

Purpose-built OCR model, #1 on OmniDocBench V1.5.
Faster and cheaper than GLM-V for pure document OCR tasks.

Example:

python from omnidocs.tasks.text_extraction import GLMOCRTextExtractor from omnidocs.tasks.text_extraction.glmocr import GLMOCRPyTorchConfig extractor = GLMOCRTextExtractor(backend=GLMOCRPyTorchConfig()) result = extractor.extract(image) print(result.content)

Source code in omnidocs/tasks/text_extraction/glmocr/extractor.py
def __init__(self, backend: GLMOCRBackendConfig):
    self.backend_config = backend
    self._backend: Any = None
    self._processor: Any = None
    self._loaded = False
    self._sampling_params_class: Any = None
    self._mlx_config: Any = None
    self._apply_chat_template: Any = None
    self._generate: Any = None
    self._load_model()

mlx

MLX backend configuration for GLM-OCR text extraction.

GLMOCRMLXConfig

Bases: BaseModel

MLX backend configuration for GLM-OCR.

Uses mlx-vlm for Apple Silicon native inference.
GLM-OCR at 0.9B runs comfortably on any M-series Mac with 8GB+ unified memory.
Requires: mlx, mlx-vlm>=0.3.11

Note: Only works on Apple Silicon Macs. Do NOT use for Modal/cloud deployments.

Available models:
    mlx-community/GLM-OCR-bf16   (default — full precision, 2.21 GB)
    mlx-community/GLM-OCR-6bit   (quantized, smaller)

Example:

python config = GLMOCRMLXConfig() # bf16, default config = GLMOCRMLXConfig(model="mlx-community/GLM-OCR-6bit") # quantized

pytorch

PyTorch backend configuration for GLM-OCR text extraction.

GLMOCRPyTorchConfig

Bases: BaseModel

PyTorch/HuggingFace backend configuration for GLM-OCR.

GLM-OCR uses AutoModelForImageTextToText + AutoProcessor.
Requires transformers>=5.3.0.

Example:

python config = GLMOCRPyTorchConfig() # zai-org/GLM-OCR, default config = GLMOCRPyTorchConfig(device="mps") # Apple Silicon

vllm

VLLM backend configuration for GLM-OCR text extraction.

GLMOCRVLLMConfig

Bases: BaseModel

VLLM backend configuration for GLM-OCR.

GLM-OCR supports VLLM with MTP (Multi-Token Prediction) speculative decoding
for significantly higher throughput. Requires vllm>=0.17.0 and transformers>=5.3.0.

Example:

python config = GLMOCRVLLMConfig(gpu_memory_utilization=0.85)