Multi Modal Benchmark
Multi-modal benchmarks are crucial for evaluating the performance of models that process and integrate information from multiple data types (e.g., text, images, audio). Current research focuses on developing comprehensive benchmarks that address limitations in existing datasets, such as insufficient diversity, lack of long-context understanding, and potential data leakage, often employing large language models (LLMs) for data generation and annotation. These benchmarks are vital for advancing the development of robust multi-modal models and improving applications across diverse fields, including video understanding, document analysis, and e-commerce.
Papers
LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content
Nimrod Shabtay, Felipe Maia Polo, Sivan Doveh, Wei Lin, M. Jehanzeb Mirza, Leshem Chosen, Mikhail Yurochkin, Yuekai Sun, Assaf Arbelle, Leonid Karlinsky, Raja Giryes
Spatial-Aware Efficient Projector for MLLMs via Multi-Layer Feature Aggregation
Shun Qian, Bingquan Liu, Chengjie Sun, Zhen Xu, Baoxun Wang