Text Rich Image
Text-rich images, containing significant textual information alongside visual content, present a unique challenge for artificial intelligence, demanding models capable of integrating both modalities for comprehensive understanding. Current research focuses on developing multimodal large language models (MLLMs) that leverage advanced architectures like transformers and incorporate techniques such as instruction tuning and data-centric approaches to improve text recognition, layout understanding, and complex reasoning abilities within these images. This field is crucial for advancing applications across diverse domains, including document processing, medical image analysis, and web content understanding, where accurate interpretation of text within images is essential. The development of robust benchmarks, like MMR and SEED-Bench-2-Plus, is also a key area of focus, driving the creation of more sophisticated and reliable models.
Papers
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension
Bohao Li, Yuying Ge, Yi Chen, Yixiao Ge, Ruimao Zhang, Ying Shan
Energy-Latency Manipulation of Multi-modal Large Language Models via Verbose Samples
Kuofeng Gao, Jindong Gu, Yang Bai, Shu-Tao Xia, Philip Torr, Wei Liu, Zhifeng Li