OmniDocs
OmniDocs is a powerful framework that simplifies document understanding and analysis. It provides a unified, production-ready interface for essential document processing tasks including:
- Layout Analysis & Detection
- OCR & Text Extraction
- Table Detection & Extraction
- Reading Order Detection
- Document Understanding & Analysis
By abstracting away the complexities of integrating multiple libraries and models, OmniDocs enables developers to build robust document processing workflows with minimal effort. Whether you're working with academic papers, business documents, or complex technical materials, OmniDocs provides the tools you need to extract, analyze, and understand document content efficiently.
π§ Installation
Prerequisites
- Python 3.11 or higher
pip
package manager- Optional (for GPU support): A compatible NVIDIA GPU with CUDA 12.1
Setting Up Your Environment
To set up your environment, you can choose one of the following methods:
-
Using
conda
: -
Using
venv
: -
Using
poetry
:
Installing PyTorch
To install PyTorch, choose one of the following options based on whether you want GPU support:
-
With GPU support (CUDA 12.1):
-
Without GPU support:
Installing OmniDocs
Once your environment is set up and PyTorch is installed, you can install OmniDocs:
-
From PyPI:
-
From source: If you prefer to install directly from the source, you can use the following command:
π οΈ Getting Started
Here's a quick example to demonstrate the power of OmniDocs:
π Supported Models and Libraries
OmniDocs integrates seamlessly with a variety of popular tools, including:
- β : working and tested
- β³ : planned/in-progress support
- β : no support
Layout Analysis
Detection Model | Source | License | CPU | GPU | Info |
---|---|---|---|---|---|
β DocLayout YOLO | GitHub - DocLayout-YOLO | AGPL-3.0 | β³ | β | A robust layout detection model based on YOLO-v10, designed for diverse document types. |
β PPStructure (Paddle OCR) | GitHub - PaddleOCR | Apache 2.0 | β | β | An OCR tool that supports multiple languages and provides layout detection capabilities. |
β RT DETR (Docling) | GitHub - RT-DETR | MIT | β³ | β | Implementation of RT-DETR, a real-time detection transformer focusing on object detection tasks. |
β Florence-2-DocLayNet-Fixed | Hugging Face - Florence-2-DocLayNet-Fixed | MIT | β | β | Fine-tuned model for document layout analysis, improving bounding box accuracy in document images. |
β Surya Layout | GitHub - Surya | GPL-3.0-or-later | β | β | OCR and layout analysis tool supporting 90+ languages, including reading order and table recognition. |
β³Layout LM V3 | Hugging Face - LayoutLMv3 | CC BY-NC-SA 4.0 | β³ | β³ | A pre-trained multimodal Transformer for Document AI, effective for various document understanding tasks. |
β³Fast / Faster R CNN / MR CNN | GitHub - Faster R-CNN | MIT | β³ | β³ | A library implementing the Faster R-CNN architecture for object detection, widely used in layout tasks. |
Text Extraction
Extraction Libraries | Source | License | CPU | GPU | Info |
---|---|---|---|---|---|
PyPDF2 | GitHub - PyPDF2 | MIT | β | β | A library for extracting text from PDFs. |
PyMuPDF | GitHub - PyMuPDF | MIT | β | β | A library for extracting text from PDFs. |
pdfplumber | GitHub - pdfplumber | MIT | β | β | A library for extracting text from PDFs. | Docling Parse | GitHub - Docling | MIT | β³ | β³ | A library for extracting text from PDFs. |
OCR
OCR Library | Source | License | CPU | GPU | Info |
---|---|---|---|---|---|
Paddle OCR | GitHub - PaddleOCR | Apache 2.0 | β | β | An OCR tool that supports multiple languages and provides layout detection capabilities. |
Tesseract | GitHub - Tesseract | BSD-3-Clause | β | β | An open-source OCR engine that supports multiple languages and is widely used for text extraction from images. |
EasyOCR | GitHub - EasyOCR | MIT | β | β | A simple and easy-to-use OCR library that supports multiple languages and is built on PyTorch. |
Table Extraction
Extraction Libraries/Models | Source | License | CPU | GPU | Info |
---|---|---|---|---|---|
PPStructure (Paddle OCR) | GitHub - PaddleOCR | Apache 2.0 | β | β | An OCR tool that supports multiple languages and provides layout detection capabilities. |
Camelot | GitHub - Camelot | MIT | β | β | A Python library for extracting tables from PDFs. |
Tabula | GitHub - Tabula | MIT | β | β | A tool for extracting tables from PDFs. |
Table Transformer | GitHub - Table Transformer | MIT | β³ | β³ | A transformer model for table extraction. |
TableFormer (Docling) | GitHub - Docling | MIT | β³ | β³ | A transformer model for table extraction. |
ποΈ How It Works
OmniDocs organizes document processing tasks into modular components. Each component corresponds to a specific task and offers:
- A Unified Interface: Consistent input and output formats.
- Model Independence: Switch between libraries or models effortlessly.
- Pipeline Flexibility: Combine components to create custom workflows.
π Roadmap
- Add support for semantic understanding tasks (e.g., entity extraction).
- Integrate pre-trained transformer models for context-aware document analysis.
- Expand pipelines for multilingual document processing.
- Add CLI support for batch processing.
π€ Contributing
We welcome contributions to OmniDocs! Here's how you can help:
- Fork the repository.
- Create a new branch for your feature or bug fix.
- Commit your changes and open a pull request.
For more details, refer to our CONTRIBUTING.md.
π‘οΈ License
This project is licensed under multiple licenses, depending on the models and libraries you use in your pipeline. Please refer to the individual licenses of each component for specific terms and conditions.
π Support the Project
If you find OmniDocs helpful, please give us a β on GitHub and share it with others in the community.
π¨οΈ Join the Community
For discussions, questions, or feedback:
- Issues: Report bugs or suggest features here.
- Email: Reach out at adithyaskolavi@gmail.com