Document Understanding
Document understanding aims to enable computers to comprehend the content and structure of documents, including text, images, and layouts, to extract key information and answer questions. Current research focuses on improving the efficiency and accuracy of multimodal large language models (MLLMs) for this task, often employing techniques like knowledge distillation, synthetic data generation, and efficient visual processing to handle high-resolution and long-context documents. These advancements are significant because they improve information retrieval, automate document processing tasks, and address privacy concerns through techniques like machine unlearning, ultimately impacting various fields from healthcare to finance.
Papers
Hierarchical Multimodal Pre-training for Visually Rich Webpage Understanding
Hongshen Xu, Lu Chen, Zihan Zhao, Da Ma, Ruisheng Cao, Zichen Zhu, Kai Yu
M3-VRD: Multimodal Multi-task Multi-teacher Visually-Rich Form Document Understanding
Yihao Ding, Lorenzo Vaiani, Caren Han, Jean Lee, Paolo Garza, Josiah Poon, Luca Cagliero