Overview¶
OCR Extraction Module.
Provides extractors for detecting text with bounding boxes from document images. Returns text content along with spatial coordinates (unlike Text Extraction which returns formatted Markdown/HTML without coordinates).
Available Extractors
- TesseractOCR: Open-source OCR (CPU, requires system Tesseract)
- EasyOCR: PyTorch-based OCR (CPU/GPU, 80+ languages)
- PaddleOCR: PaddlePaddle-based OCR (CPU/GPU, excellent CJK support)
Key Difference from Text Extraction
- OCR Extraction: Text + Bounding Boxes (spatial location)
- Text Extraction: Markdown/HTML (formatted document export)
Example
from omnidocs.tasks.ocr_extraction import TesseractOCR, TesseractOCRConfig
ocr = TesseractOCR(config=TesseractOCRConfig(languages=["eng"]))
result = ocr.extract(image)
for block in result.text_blocks:
print(f"'{block.text}' @ {block.bbox.to_list()} (conf: {block.confidence:.2f})")
# With EasyOCR
from omnidocs.tasks.ocr_extraction import EasyOCR, EasyOCRConfig
ocr = EasyOCR(config=EasyOCRConfig(languages=["en", "ch_sim"], gpu=True))
result = ocr.extract(image)
# With PaddleOCR
from omnidocs.tasks.ocr_extraction import PaddleOCR, PaddleOCRConfig
ocr = PaddleOCR(config=PaddleOCRConfig(lang="ch", device="cpu"))
result = ocr.extract(image)
BaseOCRExtractor
¶
Bases: ABC
Abstract base class for OCR extractors.
All OCR extraction models must inherit from this class and implement the required methods.
Example
extract
abstractmethod
¶
Run OCR extraction on an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
OCROutput
|
OCROutput containing detected text blocks with bounding boxes |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If image format is not supported |
RuntimeError
|
If OCR engine is not initialized or extraction fails |
Source code in omnidocs/tasks/ocr_extraction/base.py
batch_extract
¶
batch_extract(
images: List[Union[Image, ndarray, str, Path]],
progress_callback: Optional[
Callable[[int, int], None]
] = None,
) -> List[OCROutput]
Run OCR extraction on multiple images.
Default implementation loops over extract(). Subclasses can override for optimized batching.
| PARAMETER | DESCRIPTION |
|---|---|
images
|
List of images in any supported format
TYPE:
|
progress_callback
|
Optional function(current, total) for progress
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[OCROutput]
|
List of OCROutput in same order as input |
Examples:
Source code in omnidocs/tasks/ocr_extraction/base.py
extract_document
¶
extract_document(
document: Document,
progress_callback: Optional[
Callable[[int, int], None]
] = None,
) -> List[OCROutput]
Run OCR extraction on all pages of a document.
| PARAMETER | DESCRIPTION |
|---|---|
document
|
Document instance
TYPE:
|
progress_callback
|
Optional function(current, total) for progress
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[OCROutput]
|
List of OCROutput, one per page |
Examples:
Source code in omnidocs/tasks/ocr_extraction/base.py
EasyOCR
¶
Bases: BaseOCRExtractor
EasyOCR text extractor.
Single-backend model (PyTorch - CPU/GPU).
Example
Initialize EasyOCR extractor.
| PARAMETER | DESCRIPTION |
|---|---|
config
|
Configuration object
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
ImportError
|
If easyocr is not installed |
Source code in omnidocs/tasks/ocr_extraction/easyocr.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
detail: int = 1,
paragraph: bool = False,
min_size: int = 10,
text_threshold: float = 0.7,
low_text: float = 0.4,
link_threshold: float = 0.4,
canvas_size: int = 2560,
mag_ratio: float = 1.0,
) -> OCROutput
Run OCR on an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or path)
TYPE:
|
detail
|
0 = simple output, 1 = detailed with boxes
TYPE:
|
paragraph
|
Combine results into paragraphs
TYPE:
|
min_size
|
Minimum text box size
TYPE:
|
text_threshold
|
Text confidence threshold
TYPE:
|
low_text
|
Low text bound
TYPE:
|
link_threshold
|
Link threshold for text joining
TYPE:
|
canvas_size
|
Max image dimension for processing
TYPE:
|
mag_ratio
|
Magnification ratio
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
OCROutput
|
OCROutput with detected text blocks |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If detail is not 0 or 1 |
RuntimeError
|
If EasyOCR is not initialized |
Source code in omnidocs/tasks/ocr_extraction/easyocr.py
149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 | |
extract_batch
¶
Run OCR on multiple images.
| PARAMETER | DESCRIPTION |
|---|---|
images
|
List of input images
TYPE:
|
**kwargs
|
Arguments passed to extract()
DEFAULT:
|
| RETURNS | DESCRIPTION |
|---|---|
List[OCROutput]
|
List of OCROutput objects |
Source code in omnidocs/tasks/ocr_extraction/easyocr.py
EasyOCRConfig
¶
BoundingBox
¶
Bases: BaseModel
Bounding box coordinates in pixel space.
Coordinates follow the convention: (x1, y1) is top-left, (x2, y2) is bottom-right. For rotated text, use the polygon field in TextBlock instead.
Example
to_list
¶
to_xyxy
¶
to_xywh
¶
from_list
classmethod
¶
Create from [x1, y1, x2, y2] list.
Source code in omnidocs/tasks/ocr_extraction/models.py
from_polygon
classmethod
¶
Create axis-aligned bounding box from polygon points.
| PARAMETER | DESCRIPTION |
|---|---|
polygon
|
List of [x, y] points (usually 4 for quadrilateral)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
BoundingBox
|
BoundingBox that encloses all polygon points |
Source code in omnidocs/tasks/ocr_extraction/models.py
to_normalized
¶
Convert to normalized coordinates (0-1024 range).
Scales coordinates from absolute pixel values to a virtual 1024x1024 canvas. This provides consistent coordinates regardless of original image size.
| PARAMETER | DESCRIPTION |
|---|---|
image_width
|
Original image width in pixels
TYPE:
|
image_height
|
Original image height in pixels
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
BoundingBox
|
New BoundingBox with coordinates in 0-1024 range |
Source code in omnidocs/tasks/ocr_extraction/models.py
to_absolute
¶
Convert from normalized (0-1024) to absolute pixel coordinates.
| PARAMETER | DESCRIPTION |
|---|---|
image_width
|
Target image width in pixels
TYPE:
|
image_height
|
Target image height in pixels
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
BoundingBox
|
New BoundingBox with absolute pixel coordinates |
Source code in omnidocs/tasks/ocr_extraction/models.py
OCRGranularity
¶
Bases: str, Enum
OCR detection granularity levels.
Different OCR engines return results at different granularity levels. This enum standardizes the options across all extractors.
OCROutput
¶
Bases: BaseModel
Complete OCR extraction results for a single image.
Contains all detected text blocks with their bounding boxes, plus metadata about the extraction.
Example
filter_by_confidence
¶
Filter text blocks by minimum confidence.
filter_by_granularity
¶
Filter text blocks by granularity level.
to_dict
¶
Convert to dictionary representation.
Source code in omnidocs/tasks/ocr_extraction/models.py
sort_by_position
¶
Return a new OCROutput with blocks sorted by position.
| PARAMETER | DESCRIPTION |
|---|---|
top_to_bottom
|
If True, sort by y-coordinate (reading order)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
OCROutput
|
New OCROutput with sorted text blocks |
Source code in omnidocs/tasks/ocr_extraction/models.py
get_normalized_blocks
¶
Get all text blocks with normalized (0-1024) coordinates.
| RETURNS | DESCRIPTION |
|---|---|
List[Dict]
|
List of dicts with normalized bbox coordinates and metadata. |
Source code in omnidocs/tasks/ocr_extraction/models.py
visualize
¶
visualize(
image: Image,
output_path: Optional[Union[str, Path]] = None,
show_text: bool = True,
show_confidence: bool = False,
line_width: int = 2,
box_color: str = "#2ECC71",
text_color: str = "#000000",
) -> Image.Image
Visualize OCR results on the image.
Draws bounding boxes around detected text with optional labels.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
PIL Image to draw on (will be copied, not modified)
TYPE:
|
output_path
|
Optional path to save the visualization
TYPE:
|
show_text
|
Whether to show detected text
TYPE:
|
show_confidence
|
Whether to show confidence scores
TYPE:
|
line_width
|
Width of bounding box lines
TYPE:
|
box_color
|
Color for bounding boxes (hex)
TYPE:
|
text_color
|
Color for text labels (hex)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Image
|
PIL Image with visualizations drawn |
Source code in omnidocs/tasks/ocr_extraction/models.py
363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 | |
load_json
classmethod
¶
Load an OCROutput instance from a JSON file.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to JSON file
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
OCROutput
|
OCROutput instance |
Source code in omnidocs/tasks/ocr_extraction/models.py
save_json
¶
Save OCROutput instance to a JSON file.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path where JSON file should be saved
TYPE:
|
Source code in omnidocs/tasks/ocr_extraction/models.py
TextBlock
¶
Bases: BaseModel
Single detected text element with text, bounding box, and confidence.
This is the fundamental unit of OCR output - can represent a character, word, line, or block depending on the OCR model and configuration.
Example
to_dict
¶
Convert to dictionary representation.
Source code in omnidocs/tasks/ocr_extraction/models.py
get_normalized_bbox
¶
Get bounding box in normalized (0-1024) coordinates.
| PARAMETER | DESCRIPTION |
|---|---|
image_width
|
Original image width
TYPE:
|
image_height
|
Original image height
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
BoundingBox
|
BoundingBox with normalized coordinates |
Source code in omnidocs/tasks/ocr_extraction/models.py
PaddleOCR
¶
Bases: BaseOCRExtractor
PaddleOCR text extractor.
Single-backend model (PaddlePaddle - CPU/GPU).
Example
Initialize PaddleOCR extractor.
| PARAMETER | DESCRIPTION |
|---|---|
config
|
Configuration object
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
ImportError
|
If paddleocr or paddlepaddle is not installed |
Source code in omnidocs/tasks/ocr_extraction/paddleocr.py
extract
¶
Run OCR on an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or path)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
OCROutput
|
OCROutput with detected text blocks |
Source code in omnidocs/tasks/ocr_extraction/paddleocr.py
158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 | |
PaddleOCRConfig
¶
TesseractOCR
¶
Bases: BaseOCRExtractor
Tesseract OCR extractor.
Single-backend model (CPU only). Requires system Tesseract installation.
Example
Initialize Tesseract OCR extractor.
| PARAMETER | DESCRIPTION |
|---|---|
config
|
Configuration object
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If Tesseract is not installed |
ImportError
|
If pytesseract is not installed |
Source code in omnidocs/tasks/ocr_extraction/tesseract.py
extract
¶
Run OCR on an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or path)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
OCROutput
|
OCROutput with detected text blocks at word level |
Source code in omnidocs/tasks/ocr_extraction/tesseract.py
160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 | |
extract_lines
¶
Run OCR and return line-level blocks.
Groups words into lines based on Tesseract's line detection.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or path)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
OCROutput
|
OCROutput with line-level text blocks |
Source code in omnidocs/tasks/ocr_extraction/tesseract.py
251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 | |
TesseractOCRConfig
¶
base
¶
Base class for OCR extractors.
Defines the abstract interface that all OCR extractors must implement.
BaseOCRExtractor
¶
Bases: ABC
Abstract base class for OCR extractors.
All OCR extraction models must inherit from this class and implement the required methods.
Example
extract
abstractmethod
¶
Run OCR extraction on an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image as: - PIL.Image.Image: PIL image object - np.ndarray: Numpy array (HWC format, RGB) - str or Path: Path to image file
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
OCROutput
|
OCROutput containing detected text blocks with bounding boxes |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If image format is not supported |
RuntimeError
|
If OCR engine is not initialized or extraction fails |
Source code in omnidocs/tasks/ocr_extraction/base.py
batch_extract
¶
batch_extract(
images: List[Union[Image, ndarray, str, Path]],
progress_callback: Optional[
Callable[[int, int], None]
] = None,
) -> List[OCROutput]
Run OCR extraction on multiple images.
Default implementation loops over extract(). Subclasses can override for optimized batching.
| PARAMETER | DESCRIPTION |
|---|---|
images
|
List of images in any supported format
TYPE:
|
progress_callback
|
Optional function(current, total) for progress
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[OCROutput]
|
List of OCROutput in same order as input |
Examples:
Source code in omnidocs/tasks/ocr_extraction/base.py
extract_document
¶
extract_document(
document: Document,
progress_callback: Optional[
Callable[[int, int], None]
] = None,
) -> List[OCROutput]
Run OCR extraction on all pages of a document.
| PARAMETER | DESCRIPTION |
|---|---|
document
|
Document instance
TYPE:
|
progress_callback
|
Optional function(current, total) for progress
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[OCROutput]
|
List of OCROutput, one per page |
Examples:
Source code in omnidocs/tasks/ocr_extraction/base.py
easyocr
¶
EasyOCR extractor.
EasyOCR is a PyTorch-based OCR engine with excellent multi-language support. - GPU accelerated (optional) - Supports 80+ languages - Good for scene text and printed documents
Python Package
pip install easyocr
Model Download Location
By default, EasyOCR downloads models to ~/.EasyOCR/ Can be overridden with model_storage_directory parameter
EasyOCRConfig
¶
EasyOCR
¶
Bases: BaseOCRExtractor
EasyOCR text extractor.
Single-backend model (PyTorch - CPU/GPU).
Example
Initialize EasyOCR extractor.
| PARAMETER | DESCRIPTION |
|---|---|
config
|
Configuration object
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
ImportError
|
If easyocr is not installed |
Source code in omnidocs/tasks/ocr_extraction/easyocr.py
extract
¶
extract(
image: Union[Image, ndarray, str, Path],
detail: int = 1,
paragraph: bool = False,
min_size: int = 10,
text_threshold: float = 0.7,
low_text: float = 0.4,
link_threshold: float = 0.4,
canvas_size: int = 2560,
mag_ratio: float = 1.0,
) -> OCROutput
Run OCR on an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or path)
TYPE:
|
detail
|
0 = simple output, 1 = detailed with boxes
TYPE:
|
paragraph
|
Combine results into paragraphs
TYPE:
|
min_size
|
Minimum text box size
TYPE:
|
text_threshold
|
Text confidence threshold
TYPE:
|
low_text
|
Low text bound
TYPE:
|
link_threshold
|
Link threshold for text joining
TYPE:
|
canvas_size
|
Max image dimension for processing
TYPE:
|
mag_ratio
|
Magnification ratio
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
OCROutput
|
OCROutput with detected text blocks |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If detail is not 0 or 1 |
RuntimeError
|
If EasyOCR is not initialized |
Source code in omnidocs/tasks/ocr_extraction/easyocr.py
149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 | |
extract_batch
¶
Run OCR on multiple images.
| PARAMETER | DESCRIPTION |
|---|---|
images
|
List of input images
TYPE:
|
**kwargs
|
Arguments passed to extract()
DEFAULT:
|
| RETURNS | DESCRIPTION |
|---|---|
List[OCROutput]
|
List of OCROutput objects |
Source code in omnidocs/tasks/ocr_extraction/easyocr.py
models
¶
Pydantic models for OCR extraction outputs.
Defines standardized output types for OCR detection including text blocks with bounding boxes, confidence scores, and granularity levels.
Key difference from Text Extraction: - OCR returns text WITH bounding boxes (word/line/character level) - Text Extraction returns formatted text (MD/HTML) WITHOUT bboxes
Coordinate Systems
- Absolute (default): Coordinates in pixels relative to original image size
- Normalized (0-1024): Coordinates scaled to 0-1024 range (virtual 1024x1024 canvas)
Use bbox.to_normalized(width, height) or output.get_normalized_blocks()
to convert to normalized coordinates.
Example
OCRGranularity
¶
Bases: str, Enum
OCR detection granularity levels.
Different OCR engines return results at different granularity levels. This enum standardizes the options across all extractors.
BoundingBox
¶
Bases: BaseModel
Bounding box coordinates in pixel space.
Coordinates follow the convention: (x1, y1) is top-left, (x2, y2) is bottom-right. For rotated text, use the polygon field in TextBlock instead.
Example
to_list
¶
to_xyxy
¶
to_xywh
¶
from_list
classmethod
¶
Create from [x1, y1, x2, y2] list.
Source code in omnidocs/tasks/ocr_extraction/models.py
from_polygon
classmethod
¶
Create axis-aligned bounding box from polygon points.
| PARAMETER | DESCRIPTION |
|---|---|
polygon
|
List of [x, y] points (usually 4 for quadrilateral)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
BoundingBox
|
BoundingBox that encloses all polygon points |
Source code in omnidocs/tasks/ocr_extraction/models.py
to_normalized
¶
Convert to normalized coordinates (0-1024 range).
Scales coordinates from absolute pixel values to a virtual 1024x1024 canvas. This provides consistent coordinates regardless of original image size.
| PARAMETER | DESCRIPTION |
|---|---|
image_width
|
Original image width in pixels
TYPE:
|
image_height
|
Original image height in pixels
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
BoundingBox
|
New BoundingBox with coordinates in 0-1024 range |
Source code in omnidocs/tasks/ocr_extraction/models.py
to_absolute
¶
Convert from normalized (0-1024) to absolute pixel coordinates.
| PARAMETER | DESCRIPTION |
|---|---|
image_width
|
Target image width in pixels
TYPE:
|
image_height
|
Target image height in pixels
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
BoundingBox
|
New BoundingBox with absolute pixel coordinates |
Source code in omnidocs/tasks/ocr_extraction/models.py
TextBlock
¶
Bases: BaseModel
Single detected text element with text, bounding box, and confidence.
This is the fundamental unit of OCR output - can represent a character, word, line, or block depending on the OCR model and configuration.
Example
to_dict
¶
Convert to dictionary representation.
Source code in omnidocs/tasks/ocr_extraction/models.py
get_normalized_bbox
¶
Get bounding box in normalized (0-1024) coordinates.
| PARAMETER | DESCRIPTION |
|---|---|
image_width
|
Original image width
TYPE:
|
image_height
|
Original image height
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
BoundingBox
|
BoundingBox with normalized coordinates |
Source code in omnidocs/tasks/ocr_extraction/models.py
OCROutput
¶
Bases: BaseModel
Complete OCR extraction results for a single image.
Contains all detected text blocks with their bounding boxes, plus metadata about the extraction.
Example
filter_by_confidence
¶
Filter text blocks by minimum confidence.
filter_by_granularity
¶
Filter text blocks by granularity level.
to_dict
¶
Convert to dictionary representation.
Source code in omnidocs/tasks/ocr_extraction/models.py
sort_by_position
¶
Return a new OCROutput with blocks sorted by position.
| PARAMETER | DESCRIPTION |
|---|---|
top_to_bottom
|
If True, sort by y-coordinate (reading order)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
OCROutput
|
New OCROutput with sorted text blocks |
Source code in omnidocs/tasks/ocr_extraction/models.py
get_normalized_blocks
¶
Get all text blocks with normalized (0-1024) coordinates.
| RETURNS | DESCRIPTION |
|---|---|
List[Dict]
|
List of dicts with normalized bbox coordinates and metadata. |
Source code in omnidocs/tasks/ocr_extraction/models.py
visualize
¶
visualize(
image: Image,
output_path: Optional[Union[str, Path]] = None,
show_text: bool = True,
show_confidence: bool = False,
line_width: int = 2,
box_color: str = "#2ECC71",
text_color: str = "#000000",
) -> Image.Image
Visualize OCR results on the image.
Draws bounding boxes around detected text with optional labels.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
PIL Image to draw on (will be copied, not modified)
TYPE:
|
output_path
|
Optional path to save the visualization
TYPE:
|
show_text
|
Whether to show detected text
TYPE:
|
show_confidence
|
Whether to show confidence scores
TYPE:
|
line_width
|
Width of bounding box lines
TYPE:
|
box_color
|
Color for bounding boxes (hex)
TYPE:
|
text_color
|
Color for text labels (hex)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Image
|
PIL Image with visualizations drawn |
Source code in omnidocs/tasks/ocr_extraction/models.py
363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 | |
load_json
classmethod
¶
Load an OCROutput instance from a JSON file.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to JSON file
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
OCROutput
|
OCROutput instance |
Source code in omnidocs/tasks/ocr_extraction/models.py
save_json
¶
Save OCROutput instance to a JSON file.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path where JSON file should be saved
TYPE:
|
Source code in omnidocs/tasks/ocr_extraction/models.py
paddleocr
¶
PaddleOCR extractor.
PaddleOCR is an OCR toolkit developed by Baidu/PaddlePaddle. - Excellent for CJK languages (Chinese, Japanese, Korean) - GPU accelerated - Supports layout analysis + OCR
Python Package
pip install paddleocr paddlepaddle # CPU version pip install paddleocr paddlepaddle-gpu # GPU version
Model Download Location
By default, PaddleOCR downloads models to ~/.paddleocr/
PaddleOCRConfig
¶
PaddleOCR
¶
Bases: BaseOCRExtractor
PaddleOCR text extractor.
Single-backend model (PaddlePaddle - CPU/GPU).
Example
Initialize PaddleOCR extractor.
| PARAMETER | DESCRIPTION |
|---|---|
config
|
Configuration object
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
ImportError
|
If paddleocr or paddlepaddle is not installed |
Source code in omnidocs/tasks/ocr_extraction/paddleocr.py
extract
¶
Run OCR on an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or path)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
OCROutput
|
OCROutput with detected text blocks |
Source code in omnidocs/tasks/ocr_extraction/paddleocr.py
158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 | |
tesseract
¶
Tesseract OCR extractor.
Tesseract is an open-source OCR engine maintained by Google. - CPU-based (no GPU required) - Requires system installation of Tesseract - Good for printed text, supports 100+ languages
System Requirements
macOS: brew install tesseract Ubuntu: sudo apt-get install tesseract-ocr Windows: Download from https://github.com/UB-Mannheim/tesseract/wiki
Python Package
pip install pytesseract
TesseractOCRConfig
¶
TesseractOCR
¶
Bases: BaseOCRExtractor
Tesseract OCR extractor.
Single-backend model (CPU only). Requires system Tesseract installation.
Example
Initialize Tesseract OCR extractor.
| PARAMETER | DESCRIPTION |
|---|---|
config
|
Configuration object
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If Tesseract is not installed |
ImportError
|
If pytesseract is not installed |
Source code in omnidocs/tasks/ocr_extraction/tesseract.py
extract
¶
Run OCR on an image.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or path)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
OCROutput
|
OCROutput with detected text blocks at word level |
Source code in omnidocs/tasks/ocr_extraction/tesseract.py
160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 | |
extract_lines
¶
Run OCR and return line-level blocks.
Groups words into lines based on Tesseract's line detection.
| PARAMETER | DESCRIPTION |
|---|---|
image
|
Input image (PIL Image, numpy array, or path)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
OCROutput
|
OCROutput with line-level text blocks |
Source code in omnidocs/tasks/ocr_extraction/tesseract.py
251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 | |