Document Understanding

Document understanding aims to enable computers to comprehend the content and structure of documents, including text, images, and layouts, to extract key information and answer questions. Current research focuses on improving the efficiency and accuracy of multimodal large language models (MLLMs) for this task, often employing techniques like knowledge distillation, synthetic data generation, and efficient visual processing to handle high-resolution and long-context documents. These advancements are significant because they improve information retrieval, automate document processing tasks, and address privacy concerns through techniques like machine unlearning, ultimately impacting various fields from healthcare to finance.

Papers

November 22, 2023

Towards Improving Document Understanding: An Exploration on Text-Grounding via MLLMs
Yonghui Wang, Wengang Zhou, Hao Feng, Keyi Zhou, Houqiang Li
Multimodal Large Language Model Environment Exploration Document Understanding Text Grounding

November 20, 2023

DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding
Hao Feng, Qi Liu, Hao Liu, Jingqun Tang, Wengang Zhou, Houqiang Li, Can Huang
Large Language Model Real Power Large Multimodal Model High Resolution Document Understanding Frequency Domain Visual Token OCR Free

November 16, 2023

Efficient End-to-End Visual Document Understanding with Rationale Distillation
Wang Zhu, Alekh Agarwal, Mandar Joshi, Robin Jia, Jesse Thomason, Kristina Toutanova
Large Multimodal Model Optical Character Recognition Document Understanding Character Recognition Image to Text

November 9, 2023

DONUT-hole: DONUT Sparsification by Harnessing Knowledge and Optimizing Learning Efficiency
Azhar Shaikh, Michael Cochez, Denis Diachkov, Michiel de Rijcke, Sahar Yousefi
Optical Character Recognition Efficient Learning Document Understanding Character Recognition Model Pruning Harnessing Data

September 22, 2023

Document Understanding for Healthcare Referrals
Jimit Mistry, Natalia M. Arzeno
Open Domain Document Understanding Exam Document Partial AUC

September 21, 2023

SCOB: Universal Text Understanding via Character-wise Supervised Contrastive Learning with Online Text Rendering for Bridging Domain Gap
Daehee Kim, Yoonsik Kim, DongHyun Kim, Yumin Lim, Geewook Kim, Taeho Kil
Natural Language Pre Training Scene Text Document Understanding Domain Gap Language Model Pre Training Text Rendering

September 11, 2023

September 3, 2023

Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region Concentration
Haoyu Cao, Changcun Bao, Chaohu Liu, Huang Chen, Kun Yin, Hao Liu, Yinsong Liu, Deqiang Jiang, Xing Sun
Human Attention Document Understanding Document Understanding Task

August 15, 2023

Enhancing Visually-Rich Document Understanding via Layout Structure Modeling
Qiwei Li, Zuchao Li, Xiantao Cai, Bo Du, Hai Zhao
Document Understanding Multi Head Self Attention Visually Rich Document

July 31, 2023

Workshop on Document Intelligence Understanding
Soyeon Caren Han, Yihao Ding, Siwen Luo, Josiah Poon, HeeGuen Yoon, Zhe Huang, Paul Duuring, Eun Jung Holden
Document Understanding Document Intelligence

July 24, 2023

MataDoc: Margin and Text Aware Document Dewarping for Arbitrary Boundary
Beiya Dai, Xing li, Qunyi Xie, Yulin Li, Xiameng Qin, Chengquan Zhang, Kun Yao, Junyu Han
Document Understanding Document Boundary Margin Maximization Gallery Style OCR Document Dewarping

July 4, 2023

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Yuhao Dan, Chenlin Zhao, Guohai Xu, Chenliang Li, Junfeng Tian, Qian Qi, Ji Zhang, Fei Huang
Multimodal Large Language Model Document Understanding OCR Free

June 5, 2023

Do-GOOD: Towards Distribution Shift Evaluation for Pre-Trained Visual Document Understanding Models
Jiabang He, Yi Hu, Lei Wang, Xing Xu, Ning Liu, Hui Liu, Heng Tao Shen
Pre Trained Model Distribution Generalization Document Understanding Global Descriptor Distribution Shift Detection

June 2, 2023

DocFormerv2: Local Features for Document Understanding
Srikar Appalaraju, Peng Tang, Qi Dong, Nishant Sankaran, Yichu Zhou, R. Manmatha
Document Understanding Local Feature Multi Modal Transformer Novel Task Encoder Decoder Transformer

May 30, 2023

May 24, 2023

May 23, 2023

DUBLIN -- Document Understanding By Language-Image Network
Kriti Aggarwal, Aditi Khandelwal, Kumar Tanmay, Owais Mohammed Khan, Qiang Liu, Monojit Choudhury, Hardik Hansrajbhai Chauhan, Subhojit Som, Vishrav Chaudhary, Saurabh Tiwary
Visual Question Answering Document Understanding Question Answering Task Document Classification Language Image

Document Understanding

Papers

Towards Improving Document Understanding: An Exploration on Text-Grounding via MLLMs

DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding

Efficient End-to-End Visual Document Understanding with Rationale Distillation

DONUT-hole: DONUT Sparsification by Harnessing Knowledge and Optimizing Learning Efficiency

Document Understanding for Healthcare Referrals

SCOB: Universal Text Understanding via Character-wise Supervised Contrastive Learning with Online Text Rendering for Bridging Domain Gap

GlobalDoc: A Cross-Modal Vision-Language Framework for Real-World Document Image Retrieval and Classification

Long-Range Transformer Architectures for Document Understanding

Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region Concentration

Enhancing Visually-Rich Document Understanding via Layout Structure Modeling

Workshop on Document Intelligence Understanding

MataDoc: Margin and Text Aware Document Dewarping for Arbitrary Boundary

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

Do-GOOD: Towards Distribution Shift Evaluation for Pre-Trained Visual Document Understanding Models

DocFormerv2: Local Features for Document Understanding

Table Detection for Visually Rich Document Images

LayoutMask: Enhance Text-Layout Interaction in Multi-modal Pre-training for Document Understanding

Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models

AWESOME: GPU Memory-constrained Long Document Summarization using Memory Mechanism and Global Salient Content

DUBLIN -- Document Understanding By Language-Image Network