Evaluation Datasets
Evaluation datasets are crucial for benchmarking the performance of artificial intelligence models, particularly large language models (LLMs) and their variants like retrieval-augmented generation (RAG) systems and multimodal LLMs. Current research emphasizes creating more robust and representative datasets that address limitations of existing benchmarks, focusing on aspects like dynamic interactions, factual accuracy, reasoning capabilities, and ethical considerations in data sourcing and bias mitigation. These efforts are vital for ensuring reliable model comparisons, fostering responsible AI development, and ultimately improving the performance and trustworthiness of AI systems across diverse applications.
Papers
CSMeD: Bridging the Dataset Gap in Automated Citation Screening for Systematic Literature Reviews
Wojciech Kusa, Oscar E. Mendoza, Matthias Samwald, Petr Knoth, Allan Hanbury
Do Smaller Language Models Answer Contextualised Questions Through Memorisation Or Generalisation?
Tim Hartill, Joshua Bensemann, Michael Witbrock, Patricia J. Riddle