Multiple Choice
Multiple-choice question answering (MCQA) serves as a crucial benchmark for evaluating large language models (LLMs), assessing their knowledge, reasoning, and ability to follow instructions across diverse domains. Current research focuses on improving LLM performance on MCQA tasks by addressing limitations like format biases and developing more robust evaluation metrics, often employing techniques like parameter-efficient fine-tuning (e.g., LoRA) and attention mechanism analysis within transformer architectures. These advancements are significant because reliable MCQA benchmarks are essential for advancing LLM development and ensuring their responsible deployment in various applications, from education and healthcare to specialized fields like materials science and cybersecurity.
Papers
A review of faithfulness metrics for hallucination assessment in Large Language Models
Ben Malin, Tatiana Kalganova, Nikoloas Boulgouris
EQUATOR: A Deterministic Framework for Evaluating LLM Reasoning with Open-Ended Questions. # v1.0.0-beta
Raymond Bernard, Shaina Raza (PhD), Subhabrata Das (PhD), Rahul Murugan
Does Multiple Choice Have a Future in the Age of Generative AI? A Posttest-only RCT
Danielle R. Thomas, Conrad Borchers, Sanjit Kakarla, Jionghao Lin, Shambhavi Bhushan, Boyuan Guo, Erin Gatz, Kenneth R. Koedinger
Benchmarking large language models for materials synthesis: the case of atomic layer deposition
Angel Yanguas-Gil, Matthew T. Dearing, Jeffrey W. Elam, Jessica C. Jones, Sungjoon Kim, Adnan Mohammad, Chi Thang Nguyen, Bratin Sengupta