Document¶
OmniDocs Document Loader
Stateless document container for loading and accessing PDF/image data. Uses pypdfium2 (Apache 2.0) for PDF rendering and pdfplumber (MIT) for text extraction.
DocumentLoadError
¶
Bases: Exception
Failed to load document.
URLDownloadError
¶
Bases: Exception
Failed to download from URL.
PageRangeError
¶
Bases: Exception
Invalid page range.
UnsupportedFormatError
¶
Bases: Exception
Unsupported file format.
DocumentMetadata
¶
Bases: BaseModel
Metadata container for documents.
LazyPage
¶
Lazy page wrapper - renders only when accessed.
This avoids loading all pages into memory upfront for large PDFs.
Source code in omnidocs/document.py
Document
¶
Document(
pdf_doc: Optional[PdfDocument],
pdf_bytes: Optional[bytes],
metadata: DocumentMetadata,
dpi: int = 150,
page_range: Optional[tuple] = None,
preloaded_images: Optional[List[Image]] = None,
)
Stateless document container for OmniDocs.
Features: - Lazy page rendering (pages only rendered when accessed) - Page caching (rendered pages cached to avoid re-rendering) - Multiple source support (PDF file, URL, bytes, images) - Text extraction with pypdfium2 first, pdfplumber fallback - Memory efficient for large documents
Design: - Document is SOURCE DATA only - does NOT store task results - Users manage their own analysis results and caching strategy
Examples:
# Load from file
doc = Document.from_pdf("paper.pdf")
# Access pages
page = doc.get_page(0) # 0-indexed
text = doc.get_page_text(1) # 1-indexed for compatibility
# Iterate efficiently
for page in doc.iter_pages():
result = layout.extract(page)
Source code in omnidocs/document.py
pages
property
¶
List of all page images.
Note: This renders ALL pages. For large documents, use get_page() or iter_pages() instead.
| RETURNS | DESCRIPTION |
|---|---|
List[Image]
|
List of PIL Images |
text
property
¶
Full document text (lazy, cached).
Uses pypdfium2 first (fast), falls back to pdfplumber if needed.
| RETURNS | DESCRIPTION |
|---|---|
str
|
Full document text |
from_pdf
classmethod
¶
Load document from PDF file (lazy - pages not rendered yet).
| PARAMETER | DESCRIPTION |
|---|---|
path
|
Path to PDF file
TYPE:
|
page_range
|
Optional (start, end) tuple for page range (0-indexed, inclusive)
TYPE:
|
dpi
|
Resolution for page rendering (default: 150)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Document
|
Document instance |
| RAISES | DESCRIPTION |
|---|---|
DocumentLoadError
|
If file not found |
UnsupportedFormatError
|
If not a PDF file |
PageRangeError
|
If page range is invalid |
Examples:
doc = Document.from_pdf("paper.pdf")
doc = Document.from_pdf("paper.pdf", page_range=(0, 4))
doc = Document.from_pdf("paper.pdf", dpi=300)
Source code in omnidocs/document.py
201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 | |
from_url
classmethod
¶
from_url(
url: str,
page_range: Optional[tuple] = None,
dpi: int = 150,
timeout: int = 30,
) -> Document
Download and load document from URL (lazy).
| PARAMETER | DESCRIPTION |
|---|---|
url
|
URL to PDF file
TYPE:
|
page_range
|
Optional (start, end) tuple for page range
TYPE:
|
dpi
|
Resolution for page rendering
TYPE:
|
timeout
|
Download timeout in seconds
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Document
|
Document instance |
| RAISES | DESCRIPTION |
|---|---|
URLDownloadError
|
If download fails |
PageRangeError
|
If page range is invalid |
Examples:
doc = Document.from_url("https://example.com/doc.pdf")
doc = Document.from_url("https://example.com/doc.pdf", timeout=60)
Source code in omnidocs/document.py
280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 | |
from_bytes
classmethod
¶
from_bytes(
data: bytes,
filename: Optional[str] = None,
page_range: Optional[tuple] = None,
dpi: int = 150,
) -> Document
Load document from PDF bytes (lazy).
| PARAMETER | DESCRIPTION |
|---|---|
data
|
PDF file bytes
TYPE:
|
filename
|
Optional filename for metadata
TYPE:
|
page_range
|
Optional (start, end) tuple for page range
TYPE:
|
dpi
|
Resolution for page rendering
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Document
|
Document instance |
| RAISES | DESCRIPTION |
|---|---|
PageRangeError
|
If page range is invalid |
Examples:
Source code in omnidocs/document.py
from_image
classmethod
¶
Load document from single image file.
| PARAMETER | DESCRIPTION |
|---|---|
path
|
Path to image file
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Document
|
Document instance |
| RAISES | DESCRIPTION |
|---|---|
DocumentLoadError
|
If file not found |
Examples:
Source code in omnidocs/document.py
from_images
classmethod
¶
Load document from multiple images (multi-page).
| PARAMETER | DESCRIPTION |
|---|---|
paths
|
List of paths to image files
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Document
|
Document instance |
| RAISES | DESCRIPTION |
|---|---|
DocumentLoadError
|
If any file not found |
Examples:
Source code in omnidocs/document.py
get_page
¶
Get single page image (0-indexed).
More memory efficient than accessing .pages for large documents.
| PARAMETER | DESCRIPTION |
|---|---|
page_num
|
Page number (0-indexed)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Image
|
PIL Image |
| RAISES | DESCRIPTION |
|---|---|
PageRangeError
|
If page number out of range |
Examples:
Source code in omnidocs/document.py
get_page_text
¶
Get text for specific page (1-indexed for compatibility with PDF page numbers).
| PARAMETER | DESCRIPTION |
|---|---|
page_num
|
Page number (1-indexed, like PDF viewers)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Page text |
| RAISES | DESCRIPTION |
|---|---|
PageRangeError
|
If page number out of range |
Examples:
Source code in omnidocs/document.py
get_page_size
¶
Get page dimensions without rendering (fast).
| PARAMETER | DESCRIPTION |
|---|---|
page_num
|
Page number (0-indexed)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple
|
Tuple of (width, height) in pixels |
Examples:
Source code in omnidocs/document.py
iter_pages
¶
Iterate over pages one at a time (memory efficient).
Use this for large documents instead of .pages property.
| YIELDS | DESCRIPTION |
|---|---|
Image
|
PIL Images |
Examples:
Source code in omnidocs/document.py
clear_cache
¶
Clear cached page images to free memory.
| PARAMETER | DESCRIPTION |
|---|---|
page_num
|
Specific page to clear, or None for all pages
TYPE:
|
Examples:
Source code in omnidocs/document.py
save_images
¶
Save all pages as individual image files.
| PARAMETER | DESCRIPTION |
|---|---|
output_dir
|
Output directory path
TYPE:
|
prefix
|
Filename prefix (default: "page")
TYPE:
|
format
|
Image format (default: "PNG")
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[Path]
|
List of saved file paths |
Examples:
Source code in omnidocs/document.py
to_dict
¶
Convert document metadata to dictionary.
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Dictionary of metadata |
Examples:
Source code in omnidocs/document.py
close
¶
Close PDF document and free resources.
Examples: