ViViD: Vision Language model for Unified Visual Understanding of Documents

Abstract

In an era where documents are increasingly complex—spanning intricate layouts, mathematical expressions, and structured tables , complex diagrams — the need for a unified approach to document understanding has never been greater. Current approaches to document analysis rely on multiple specialized models, each dedicated to a single task such as layout detection, OCR, table parsing, math expression recognition, or image captioning. This fragmented approach often results in resource-intensive and heavy pipelines, making it challenging to deploy efficiently.

We introduce ViViD—Visual Interpretation using Vision Language Models for Documents—a framework that consolidates diverse document analysis tasks into a single, unified model capable of performing essential tasks, such as layout detection, OCR, table parsing, math expression recognition, and image captioning.Its flexible and customizable design allows for a more integrated understanding of documents, as it can interpret complex structures and relationships across tasks. ViViD is optimized to adapt to various document types and analysis requirements, making it both powerful and versatile.

At the core of ViViD is a Vision Language Model (VLM) fine-tuned through a novel multi-task strategy, optimizing its ability to handle diverse document structures in a streamlined pipeline. A key innovation of ViViD lies in finding the most effective multi-task fine-tuning strategy, striking a balance between high performance and comprehensive task coverage. Our experiments demonstrate that ViViD significantly enhances document understanding, offering a reliable, efficient solution for industries such as finance, legal, and research, where analyzing complex documents is critical. The model resulting from this framework, ViViD-0.7B, provides a robust, all-in-one tool for handling a wide range of document analysis tasks, with the added benefit of being deployable on resource-constrained environments.

Method

ViViD leverages Vision Language Models (VLMs) with parameter-efficient fine-tuning techniques to achieve multi-task performance in a single framework. We experimented with both full fine-tuning and LoRA (Low-Rank Adaptation) fine-tuning to handle diverse document understanding tasks.

Results

Multi-task Document Understanding

ViViD demonstrates strong performance across various document understanding tasks:

Layout Detection
OCR
Table Extraction
Math Expression Recognition
Image Captioning

BibTeX

@article{AdithyaSKolavi2024ViViD,
          title     = {ViViD: Vision Language Model for Unified Visual Understanding of Documents},
          author    = {Adithya S Kolavi},
          year      = {2024},
          url       = {https://github.com/adithya-s-k/ViViD},
          abstract  = {
              "ViViD integrates multiple document analysis tasks into a single model, reducing the need for multiple specialized models and enhancing efficiency for complex document understanding."
          }
        }