Question Answering Benchmark
Question answering (QA) benchmarks are crucial for evaluating the capabilities of large language models (LLMs) across diverse domains and complexities. Current research focuses on developing benchmarks that assess LLMs' abilities to handle long-context inputs, reason across multiple documents and modalities (e.g., video and text), and accurately answer questions in low-resource languages and specialized fields like economics and healthcare. These benchmarks are vital for identifying strengths and weaknesses in LLMs, guiding model improvements, and ultimately advancing the development of more reliable and robust AI systems for various applications.
Papers
CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding
Guo Chen, Yicheng Liu, Yifei Huang, Yuping He, Baoqi Pei, Jilan Xu, Yali Wang, Tong Lu, Limin Wang
SCITAT: A Question Answering Benchmark for Scientific Tables and Text Covering Diverse Reasoning Types
Xuanliang Zhang, Dingzirui Wang, Baoxin Wang, Longxu Dou, Xinyuan Lu, Keyan Xu, Dayong Wu, Qingfu Zhu, Wanxiang Che