Overview¶
OmniDocs Utilities.
Provides utility functions for result aggregation, visualization, export, and cache management.
BatchResult
¶
Container for results from processing multiple documents.
Examples:
batch_result = BatchResult()
batch_result.add_document_result("doc1", doc_result1)
batch_result.add_document_result("doc2", doc_result2)
# Access results
doc1_result = batch_result.get_document_result("doc1")
all_ids = batch_result.document_ids
# Save all results
batch_result.save_json("all_results.json")
Initialize empty BatchResult.
Source code in omnidocs/utils/aggregation.py
add_document_result
¶
Add result for a document.
| PARAMETER | DESCRIPTION |
|---|---|
doc_id
|
Document identifier (usually filename without extension)
TYPE:
|
result
|
DocumentResult instance
TYPE:
|
Source code in omnidocs/utils/aggregation.py
get_document_result
¶
Get result for a specific document.
| PARAMETER | DESCRIPTION |
|---|---|
doc_id
|
Document identifier
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Optional[DocumentResult]
|
DocumentResult or None if not found |
Source code in omnidocs/utils/aggregation.py
to_dict
¶
Convert to dictionary.
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Dictionary representation |
Source code in omnidocs/utils/aggregation.py
save_json
¶
Save all results to JSON file.
| PARAMETER | DESCRIPTION |
|---|---|
path
|
Output file path
TYPE:
|
Source code in omnidocs/utils/aggregation.py
DocumentResult
¶
Container for results from processing a single document.
Stores results by page for easy access and serialization.
Examples:
doc_result = DocumentResult(source_path="paper.pdf", page_count=10)
doc_result.add_page_result(0, text_output)
doc_result.add_page_result(1, text_output)
# Access results
all_results = doc_result.all_results
page_0_result = doc_result.get_page_result(0)
# Save to file
doc_result.save_json("paper_result.json")
Initialize DocumentResult.
| PARAMETER | DESCRIPTION |
|---|---|
source_path
|
Path to source document
TYPE:
|
page_count
|
Total number of pages
TYPE:
|
Source code in omnidocs/utils/aggregation.py
all_results
property
¶
Get all results in page order.
| RETURNS | DESCRIPTION |
|---|---|
List[Any]
|
List of results sorted by page number |
add_page_result
¶
Add result for a specific page.
| PARAMETER | DESCRIPTION |
|---|---|
page_num
|
Page number (0-indexed)
TYPE:
|
result
|
Extraction result (TextOutput, LayoutOutput, etc.)
TYPE:
|
Source code in omnidocs/utils/aggregation.py
get_page_result
¶
Get result for a specific page.
| PARAMETER | DESCRIPTION |
|---|---|
page_num
|
Page number (0-indexed)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Optional[Any]
|
Result for the page, or None if not found |
Source code in omnidocs/utils/aggregation.py
to_dict
¶
Convert to dictionary for serialization.
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Dictionary representation |
Source code in omnidocs/utils/aggregation.py
save_json
¶
Save results to JSON file.
| PARAMETER | DESCRIPTION |
|---|---|
path
|
Output file path
TYPE:
|
Source code in omnidocs/utils/aggregation.py
merge_text_results
¶
Merge multiple TextOutput results into single string.
| PARAMETER | DESCRIPTION |
|---|---|
results
|
List of TextOutput (or objects with .content attribute)
TYPE:
|
separator
|
String to join pages (default: double newline)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Combined content string |
Examples:
all_results = doc_result.all_results
full_text = merge_text_results(all_results)
full_text_with_dividers = merge_text_results(all_results, separator="\n\n---\n\n")
Source code in omnidocs/utils/aggregation.py
configure_backend_cache
¶
Configure cache directories for all backends.
When OMNIDOCS_MODELS_DIR is set (or cache_dir is passed), this OVERWRITES HF_HOME and TRANSFORMERS_CACHE so every backend downloads to the same place.
This is called automatically on import omnidocs.
| PARAMETER | DESCRIPTION |
|---|---|
cache_dir
|
Optional cache directory path. If None, uses get_model_cache_dir().
TYPE:
|
Source code in omnidocs/utils/cache.py
get_model_cache_dir
¶
Get unified model cache directory.
Priority order: 1. custom_dir parameter (if provided) 2. OMNIDOCS_MODELS_DIR environment variable 3. HF_HOME environment variable 4. Default: ~/.cache/huggingface
| PARAMETER | DESCRIPTION |
|---|---|
custom_dir
|
Optional custom cache directory path. Overrides environment variables if provided.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Path
|
Path object pointing to the cache directory. |
Path
|
Directory is created if it doesn't exist. |
Source code in omnidocs/utils/cache.py
get_storage_info
¶
Get current cache directory configuration information.
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Dictionary with cache paths and environment variable values. |
Source code in omnidocs/utils/cache.py
aggregation
¶
Result aggregation utilities for batch processing.
Provides containers and utilities for storing, aggregating, and exporting results from batch document processing.
DocumentResult
¶
Container for results from processing a single document.
Stores results by page for easy access and serialization.
Examples:
doc_result = DocumentResult(source_path="paper.pdf", page_count=10)
doc_result.add_page_result(0, text_output)
doc_result.add_page_result(1, text_output)
# Access results
all_results = doc_result.all_results
page_0_result = doc_result.get_page_result(0)
# Save to file
doc_result.save_json("paper_result.json")
Initialize DocumentResult.
| PARAMETER | DESCRIPTION |
|---|---|
source_path
|
Path to source document
TYPE:
|
page_count
|
Total number of pages
TYPE:
|
Source code in omnidocs/utils/aggregation.py
all_results
property
¶
Get all results in page order.
| RETURNS | DESCRIPTION |
|---|---|
List[Any]
|
List of results sorted by page number |
add_page_result
¶
Add result for a specific page.
| PARAMETER | DESCRIPTION |
|---|---|
page_num
|
Page number (0-indexed)
TYPE:
|
result
|
Extraction result (TextOutput, LayoutOutput, etc.)
TYPE:
|
Source code in omnidocs/utils/aggregation.py
get_page_result
¶
Get result for a specific page.
| PARAMETER | DESCRIPTION |
|---|---|
page_num
|
Page number (0-indexed)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Optional[Any]
|
Result for the page, or None if not found |
Source code in omnidocs/utils/aggregation.py
to_dict
¶
Convert to dictionary for serialization.
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Dictionary representation |
Source code in omnidocs/utils/aggregation.py
save_json
¶
Save results to JSON file.
| PARAMETER | DESCRIPTION |
|---|---|
path
|
Output file path
TYPE:
|
Source code in omnidocs/utils/aggregation.py
BatchResult
¶
Container for results from processing multiple documents.
Examples:
batch_result = BatchResult()
batch_result.add_document_result("doc1", doc_result1)
batch_result.add_document_result("doc2", doc_result2)
# Access results
doc1_result = batch_result.get_document_result("doc1")
all_ids = batch_result.document_ids
# Save all results
batch_result.save_json("all_results.json")
Initialize empty BatchResult.
Source code in omnidocs/utils/aggregation.py
add_document_result
¶
Add result for a document.
| PARAMETER | DESCRIPTION |
|---|---|
doc_id
|
Document identifier (usually filename without extension)
TYPE:
|
result
|
DocumentResult instance
TYPE:
|
Source code in omnidocs/utils/aggregation.py
get_document_result
¶
Get result for a specific document.
| PARAMETER | DESCRIPTION |
|---|---|
doc_id
|
Document identifier
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Optional[DocumentResult]
|
DocumentResult or None if not found |
Source code in omnidocs/utils/aggregation.py
to_dict
¶
Convert to dictionary.
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Dictionary representation |
Source code in omnidocs/utils/aggregation.py
save_json
¶
Save all results to JSON file.
| PARAMETER | DESCRIPTION |
|---|---|
path
|
Output file path
TYPE:
|
Source code in omnidocs/utils/aggregation.py
merge_text_results
¶
Merge multiple TextOutput results into single string.
| PARAMETER | DESCRIPTION |
|---|---|
results
|
List of TextOutput (or objects with .content attribute)
TYPE:
|
separator
|
String to join pages (default: double newline)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Combined content string |
Examples:
all_results = doc_result.all_results
full_text = merge_text_results(all_results)
full_text_with_dividers = merge_text_results(all_results, separator="\n\n---\n\n")
Source code in omnidocs/utils/aggregation.py
cache
¶
Unified model cache directory management for OmniDocs.
When OMNIDOCS_MODELS_DIR is set, ALL model downloads (PyTorch, VLLM, MLX, snapshot_download) go into that directory. It overwrites HF_HOME so every backend respects the same path.
Environment Variables
OMNIDOCS_MODELS_DIR: Primary cache directory for all OmniDocs models. Overwrites HF_HOME when set. HF_HOME: HuggingFace cache directory (used as fallback).
Example
get_model_cache_dir
¶
Get unified model cache directory.
Priority order: 1. custom_dir parameter (if provided) 2. OMNIDOCS_MODELS_DIR environment variable 3. HF_HOME environment variable 4. Default: ~/.cache/huggingface
| PARAMETER | DESCRIPTION |
|---|---|
custom_dir
|
Optional custom cache directory path. Overrides environment variables if provided.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Path
|
Path object pointing to the cache directory. |
Path
|
Directory is created if it doesn't exist. |
Source code in omnidocs/utils/cache.py
configure_backend_cache
¶
Configure cache directories for all backends.
When OMNIDOCS_MODELS_DIR is set (or cache_dir is passed), this OVERWRITES HF_HOME and TRANSFORMERS_CACHE so every backend downloads to the same place.
This is called automatically on import omnidocs.
| PARAMETER | DESCRIPTION |
|---|---|
cache_dir
|
Optional cache directory path. If None, uses get_model_cache_dir().
TYPE:
|
Source code in omnidocs/utils/cache.py
get_storage_info
¶
Get current cache directory configuration information.
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Dictionary with cache paths and environment variable values. |