Batch¶
OmniDocs Batch Processing Utilities.
Provides utilities for processing multiple documents efficiently: - DocumentBatch: Load and iterate over multiple PDFs - process_directory: Convenience function for batch processing - process_document: Process all pages of a single document
DocumentBatch
¶
Batch document loader for processing multiple PDFs.
Features: - Lazy loading (documents loaded on iteration) - Memory efficient (processes one document at a time) - Glob pattern support - Progress callbacks
Examples:
# Load from directory
batch = DocumentBatch.from_directory("pdfs/")
# Load from list
batch = DocumentBatch.from_paths(["doc1.pdf", "doc2.pdf"])
# Iterate
for doc in batch:
for page in doc.iter_pages():
result = extractor.extract(page)
Initialize DocumentBatch.
| PARAMETER | DESCRIPTION |
|---|---|
paths
|
List of PDF file paths
TYPE:
|
dpi
|
Resolution for page rendering (default: 150)
TYPE:
|
page_range
|
Optional (start, end) tuple for page range (applied to all docs)
TYPE:
|
Source code in omnidocs/batch.py
from_directory
classmethod
¶
from_directory(
directory: str,
pattern: str = "*.pdf",
recursive: bool = False,
dpi: int = 150,
page_range: Optional[tuple] = None,
) -> DocumentBatch
Load all PDFs from directory.
| PARAMETER | DESCRIPTION |
|---|---|
directory
|
Path to directory
TYPE:
|
pattern
|
Glob pattern (default: "*.pdf")
TYPE:
|
recursive
|
Search subdirectories
TYPE:
|
dpi
|
Resolution for rendering
TYPE:
|
page_range
|
Optional page range for all documents
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DocumentBatch
|
DocumentBatch instance |
| RAISES | DESCRIPTION |
|---|---|
FileNotFoundError
|
If directory doesn't exist |
Examples:
batch = DocumentBatch.from_directory("pdfs/")
batch = DocumentBatch.from_directory("docs/", pattern="*.pdf", recursive=True)
Source code in omnidocs/batch.py
from_paths
classmethod
¶
from_paths(
paths: List[str],
dpi: int = 150,
page_range: Optional[tuple] = None,
) -> DocumentBatch
Load documents from explicit list of paths.
| PARAMETER | DESCRIPTION |
|---|---|
paths
|
List of PDF paths
TYPE:
|
dpi
|
Resolution for rendering
TYPE:
|
page_range
|
Optional page range for all documents
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DocumentBatch
|
DocumentBatch instance |
Examples:
Source code in omnidocs/batch.py
iter_with_progress
¶
Iterate with progress callback.
| PARAMETER | DESCRIPTION |
|---|---|
callback
|
Function(current, total, filename) called for each document
TYPE:
|
| YIELDS | DESCRIPTION |
|---|---|
Document
|
Document instances |
Examples:
def progress(current, total, filename):
print(f"[{current}/{total}] {filename}")
for doc in batch.iter_with_progress(progress):
# Process document...
Source code in omnidocs/batch.py
iter_all_pages
¶
Iterate over all pages from all documents.
Memory efficient - loads one document at a time.
| YIELDS | DESCRIPTION |
|---|---|
tuple
|
Tuples of (doc_index, page_index, page_image, doc_path) |
Examples:
for doc_idx, page_idx, page_img, doc_path in batch.iter_all_pages():
result = extractor.extract(page_img)
Source code in omnidocs/batch.py
process_document
¶
process_document(
document: Document,
extractor: Any,
progress_callback: Optional[
Callable[[int, int], None]
] = None,
**extract_kwargs,
) -> DocumentResult
Process all pages of a single document.
| PARAMETER | DESCRIPTION |
|---|---|
document
|
Document instance
TYPE:
|
extractor
|
Initialized extractor (any type)
TYPE:
|
progress_callback
|
Optional function(current, total) for progress
TYPE:
|
**extract_kwargs
|
Passed to extractor.extract()
DEFAULT:
|
| RETURNS | DESCRIPTION |
|---|---|
DocumentResult
|
DocumentResult with page results |
Examples:
from omnidocs import Document
from omnidocs.batch import process_document
doc = Document.from_pdf("paper.pdf")
result = process_document(doc, extractor, output_format="markdown")
result.save_json("output.json")
Source code in omnidocs/batch.py
process_directory
¶
process_directory(
directory: str,
extractor: Any,
output_dir: Optional[str] = None,
pattern: str = "*.pdf",
recursive: bool = False,
dpi: int = 150,
progress_callback: Optional[
Callable[[str, int, int], None]
] = None,
**extract_kwargs,
) -> BatchResult
Process all PDFs in a directory.
Convenience function for common batch processing pattern.
| PARAMETER | DESCRIPTION |
|---|---|
directory
|
Path to directory with PDFs
TYPE:
|
extractor
|
Initialized extractor instance
TYPE:
|
output_dir
|
Optional directory to save results as JSON
TYPE:
|
pattern
|
Glob pattern for files (default: "*.pdf")
TYPE:
|
recursive
|
Search subdirectories
TYPE:
|
dpi
|
Resolution for page rendering
TYPE:
|
progress_callback
|
Function(filename, current, total) for progress
TYPE:
|
**extract_kwargs
|
Passed to extractor.extract()
DEFAULT:
|
| RETURNS | DESCRIPTION |
|---|---|
BatchResult
|
BatchResult with all document results |
Examples:
from omnidocs.batch import process_directory
from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig
extractor = QwenTextExtractor(
backend=QwenTextPyTorchConfig(model="Qwen/Qwen2-VL-7B")
)
results = process_directory(
"pdfs/",
extractor,
output_dir="results/",
output_format="markdown",
)
Source code in omnidocs/batch.py
270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 | |