Batch Processing¶
Process multiple documents efficiently.
Quick Start¶
from pathlib import Path
from omnidocs import Document
from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenPyTorchConfig
# Initialize once (expensive)
extractor = QwenTextExtractor(backend=QwenPyTorchConfig(device="cuda"))
# Process all PDFs
for pdf_path in Path("documents/").glob("*.pdf"):
doc = Document.from_pdf(pdf_path)
for i, page in enumerate(doc.iter_pages()):
result = extractor.extract(page, output_format="markdown")
output = Path("output") / f"{pdf_path.stem}_page_{i+1}.md"
output.write_text(result.content)
With Progress Tracking¶
import time
from pathlib import Path
pdf_files = list(Path("documents/").glob("*.pdf"))
start = time.time()
for idx, pdf_path in enumerate(pdf_files, 1):
doc = Document.from_pdf(pdf_path)
for page in doc.iter_pages():
result = extractor.extract(page)
# Progress
elapsed = time.time() - start
remaining = (len(pdf_files) - idx) * (elapsed / idx)
print(f"[{idx}/{len(pdf_files)}] {pdf_path.name} - ETA: {remaining/60:.1f}min")
Memory Management¶
Clear cache periodically for large batches:
for i, page in enumerate(doc.iter_pages()):
result = extractor.extract(page)
save_result(result)
# Free memory every 10 pages
if i % 10 == 0:
doc.clear_cache()
Stream to Disk¶
Don't accumulate results in memory:
import json
with open("results.jsonl", "w") as f:
for pdf_path in pdf_files:
doc = Document.from_pdf(pdf_path)
result = extractor.extract(doc.get_page(0))
record = {"path": str(pdf_path), "word_count": result.word_count}
f.write(json.dumps(record) + "\n")
Error Handling¶
results = []
errors = []
for pdf_path in pdf_files:
try:
doc = Document.from_pdf(pdf_path)
result = extractor.extract(doc.get_page(0))
results.append({"path": str(pdf_path), "success": True})
except Exception as e:
errors.append({"path": str(pdf_path), "error": str(e)})
print(f"Succeeded: {len(results)}, Failed: {len(errors)}")
Performance Tips¶
| Tip | Why |
|---|---|
| Initialize extractor once | Model loading takes 2-3s |
| Use VLLM for large batches | 2-4x better throughput |
| Stream results to disk | Constant memory usage |
| Clear cache periodically | Prevents OOM |
For Cloud Scale¶
See Deployment for processing on Modal GPUs.