New Benchmark
Recent research focuses on developing comprehensive benchmarks for evaluating large language models (LLMs) and other machine learning models across diverse tasks, including economic games, financial question answering, graph analysis, and robotic manipulation. These benchmarks aim to standardize evaluation methodologies, address issues like fairness and robustness, and quantify uncertainty in model performance, using various architectures such as transformers and graph neural networks. The resulting standardized evaluations and datasets are crucial for advancing the field by facilitating more rigorous comparisons of models and identifying areas needing improvement, ultimately leading to more reliable and effective AI systems across numerous applications.
Papers
Do LLMs Understand Social Knowledge? Evaluating the Sociability of Large Language Models with SocKET Benchmark
Minje Choi, Jiaxin Pei, Sagar Kumar, Chang Shu, David Jurgens
GlobalBench: A Benchmark for Global Progress in Natural Language Processing
Yueqi Song, Catherine Cui, Simran Khanuja, Pengfei Liu, Fahim Faisal, Alissa Ostapenko, Genta Indra Winata, Alham Fikri Aji, Samuel Cahyawijaya, Yulia Tsvetkov, Antonios Anastasopoulos, Graham Neubig
Emergent inabilities? Inverse scaling over the course of pretraining
James A. Michaelov, Benjamin K. Bergen
Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces
Srinivas Sridharan, Taekyung Heo, Louis Feng, Zhaodong Wang, Matt Bergeron, Wenyin Fu, Shengbao Zheng, Brian Coutinho, Saeed Rashidi, Changhai Man, Tushar Krishna
ManiTweet: A New Benchmark for Identifying Manipulation of News on Social Media
Kung-Hsiang Huang, Hou Pong Chan, Kathleen McKeown, Heng Ji
S\={a}mayik: A Benchmark and Dataset for English-Sanskrit Translation
Ayush Maheshwari, Ashim Gupta, Amrith Krishna, Atul Kumar Singh, Ganesh Ramakrishnan, G. Anil Kumar, Jitin Singla
DAPR: A Benchmark on Document-Aware Passage Retrieval
Kexin Wang, Nils Reimers, Iryna Gurevych
TeCS: A Dataset and Benchmark for Tense Consistency of Machine Translation
Yiming Ai, Zhiwei He, Kai Yu, Rui Wang
RenderMe-360: A Large Digital Asset Library and Benchmarks Towards High-fidelity Head Avatars
Dongwei Pan, Long Zhuo, Jingtan Piao, Huiwen Luo, Wei Cheng, Yuxin Wang, Siming Fan, Shengqi Liu, Lei Yang, Bo Dai, Ziwei Liu, Chen Change Loy, Chen Qian, Wayne Wu, Dahua Lin, Kwan-Yee Lin
Table Meets LLM: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study
Yuan Sui, Mengyu Zhou, Mingjie Zhou, Shi Han, Dongmei Zhang
DUMB: A Benchmark for Smart Evaluation of Dutch Models
Wietse de Vries, Martijn Wieling, Malvina Nissim
FACTIFY3M: A Benchmark for Multimodal Fact Verification with Explainability through 5W Question-Answering
Megha Chakraborty, Khushbu Pahwa, Anku Rani, Shreyas Chatterjee, Dwip Dalal, Harshit Dave, Ritvik G, Preethi Gurumurthy, Adarsh Mahor, Samahriti Mukherjee, Aditya Pakala, Ishan Paul, Janvita Reddy, Arghya Sarkar, Kinjal Sensharma, Aman Chadha, Amit P. Sheth, Amitava Das
A Benchmark on Extremely Weakly Supervised Text Classification: Reconcile Seed Matching and Prompting Approaches
Zihan Wang, Tianle Wang, Dheeraj Mekala, Jingbo Shang
MultiTurnCleanup: A Benchmark for Multi-Turn Spoken Conversational Transcript Cleanup
Hua Shen, Vicky Zayats, Johann C. Rocholl, Daniel D. Walker, Dirk Padfield
A New Benchmark of Aphasia Speech Recognition and Detection Based on E-Branchformer and Multi-task Learning
Jiyang Tang, William Chen, Xuankai Chang, Shinji Watanabe, Brian MacWhinney
A benchmark for computational analysis of animal behavior, using animal-borne tags
Benjamin Hoffman, Maddie Cusimano, Vittorio Baglione, Daniela Canestrari, Damien Chevallier, Dominic L. DeSantis, Lorène Jeantet, Monique A. Ladds, Takuya Maekawa, Vicente Mata-Silva, Víctor Moreno-González, Eva Trapote, Outi Vainio, Antti Vehkaoja, Ken Yoda, Katherine Zacarian, Ari Friedlaender
NoisywikiHow: A Benchmark for Learning with Real-world Noisy Labels in Natural Language Processing
Tingting Wu, Xiao Ding, Minji Tang, Hao Zhang, Bing Qin, Ting Liu