Multiple Choice
Multiple-choice question answering (MCQA) serves as a crucial benchmark for evaluating large language models (LLMs), assessing their knowledge, reasoning, and ability to follow instructions across diverse domains. Current research focuses on improving LLM performance on MCQA tasks by addressing limitations like format biases and developing more robust evaluation metrics, often employing techniques like parameter-efficient fine-tuning (e.g., LoRA) and attention mechanism analysis within transformer architectures. These advancements are significant because reliable MCQA benchmarks are essential for advancing LLM development and ensuring their responsible deployment in various applications, from education and healthcare to specialized fields like materials science and cybersecurity.
Papers
MedQA-CS: Benchmarking Large Language Models Clinical Skills Using an AI-SCE Framework
Zonghai Yao, Zihao Zhang, Chaolong Tang, Xingyu Bian, Youxia Zhao, Zhichao Yang, Junda Wang, Huixue Zhou, Won Seok Jang, Feiyun Ouyang, Hong Yu
DLP-LoRA: Efficient Task-Specific LoRA Fusion with a Dynamic, Lightweight Plugin for Large Language Models
Yuxuan Zhang, Ruizhe Li
On the Principles behind Opinion Dynamics in Multi-Agent Systems of Large Language Models
Pedro Cisneros-Velarde
UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions
Xunzhi Wang, Zhuowei Zhang, Qiongyu Li, Gaonan Chen, Mengting Hu, Zhiyu li, Bitong Luo, Hang Gao, Zhixin Han, Haotian Wang
Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?
Rylan Schaeffer, Hailey Schoelkopf, Brando Miranda, Gabriel Mukobi, Varun Madan, Adam Ibrahim, Herbie Bradley, Stella Biderman, Sanmi Koyejo
Uncovering Limitations of Large Language Models in Information Seeking from Tables
Chaoxu Pang, Yixuan Cao, Chunhao Yang, Ping Luo