Large Scale Evaluation

Large-scale evaluation aims to rigorously assess the performance of machine learning models and algorithms across diverse datasets and tasks, providing objective benchmarks for comparison and advancement. Current research focuses on developing standardized evaluation frameworks and metrics for various modalities, including images, text, speech, and even gestures, often employing transformer-based models and Bayesian deep learning techniques. These comprehensive evaluations are crucial for identifying strengths and weaknesses of existing methods, guiding future research directions, and ultimately improving the reliability and effectiveness of AI systems in real-world applications.

Papers

September 8, 2023

Beyond Static Datasets: A Deep Interaction Approach to LLM Evaluation
Jiatong Li, Rui Li, Qi Liu
Large Scale Evaluation Historical Datasets Rich Interaction Deep Interaction

August 24, 2023

The GENEA Challenge 2023: A large scale evaluation of gesture generation models in monadic and dyadic settings
Taras Kucherenko, Rajmund Nagy, Youngwoo Yoon, Jieyeon Woo, Teodor Nikolov, Mihail Tsakov, Gustav Eje Henter
Human Motion Motion Capture Gesture Generation Large Scale Evaluation Dyadic Interaction Speech Driven Gesture Agent Task

July 17, 2023

Large-Scale Evaluation of Topic Models and Dimensionality Reduction Methods for 2D Text Spatialization
Daniel Atzberger, Tim Cech, Willy Scheibel, Matthias Trapp, Rico Richter, Jürgen Döllner, Tobias Schreck
Dimensionality Reduction Topic Model Large Scale Evaluation

June 21, 2023

Beyond Deep Ensembles: A Large-Scale Evaluation of Bayesian Deep Learning under Distribution Shift
Florian Seligmann, Philipp Becker, Michael Volpp, Gerhard Neumann
Distribution Shift Deep Ensemble Large Pre Trained Model Bayesian Deep Learning Approximate Inference Posterior Mode Large Scale Evaluation

February 13, 2023

An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation
Max Schäfer, Sarah Nadi, Aryaz Eghbali, Frank Tip
Empirical Evaluation Test Generation Large Scale Evaluation Unit Test Unit Test Generation

December 7, 2022

SAIH: A Scalable Evaluation Methodology for Understanding AI Performance Trend on HPC Systems
Jiangsu Du, Dongsheng Li, Yingpeng Wen, Jiazhi Jiang, Dan Huang, Xiangke Liao, Yutong Lu
High Performance Computing System Large Scale Evaluation Artificial Intelligence Performance AI Benchmark

November 16, 2022

Holistic Evaluation of Language Models
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao, Jue Wang, Keshav Santhanam, Laurel Orr, Lucia Zheng, Mert Yuksekgonul, Mirac Suzgun, Nathan Kim, Neel Guha, Niladri Chatterji, Omar Khattab, Peter Henderson, Qian Huang, Ryan Chi, Sang Michael Xie, Shibani Santurkar, Surya Ganguli, Tatsunori Hashimoto, Thomas Icard, Tianyi Zhang, Vishrav Chaudhary, William Wang, Xuechen Li, Yifan Mai, Yuhui Zhang, Yuta Koreeda
Language Model Holistic Evaluation Language Technology Large Scale Evaluation LLM Based Metric

November 1, 2022

Optimization of Oblivious Decision Tree Ensembles Evaluation for CPU
Alexey Mironov, Ilnur Khuziev
Optimization Purpose Large Scale Evaluation Catboost Model Instruction Set

September 30, 2022

Evaluate & Evaluation on the Hub: Better Best Practices for Data and Model Measurements
Leandro von Werra, Lewis Tunstall, Abhishek Thakur, Alexandra Sasha Luccioni, Tristan Thrush, Aleksandra Piktus, Felix Marty, Nazneen Rajani, Victor Mustar, Helen Ngo, Omar Sanseviero, Mario Šaško, Albert Villanova, Quentin Lhoest, Julien Chaumond, Margaret Mitchell, Alexander M. Rush, Thomas Wolf, Douwe Kiela
Machine Learning Global Evaluation Raw Data Best Practice Large Scale Evaluation Hub Structure Model Metrology

September 12, 2022

Large-scale Evaluation of Transformer-based Article Encoders on the Task of Citation Recommendation
Zoran Medić, Jan Šnajder
Related Task Large Scale Evaluation News Classification Citation Recommendation Scientific Document Representation

August 22, 2022

The GENEA Challenge 2022: A large evaluation of data-driven co-speech gesture generation
Youngwoo Yoon, Pieter Wolfert, Taras Kucherenko, Carla Viegas, Teodor Nikolov, Mihail Tsakov, Gustav Eje Henter
Gesture Generation Large Scale Evaluation Human Gesture

April 21, 2022

A Revealing Large-Scale Evaluation of Unsupervised Anomaly Detection Algorithms
Maxime Alvarez, Jean-Charles Verdier, D'Jeff K. Nkashama, Marc Frappier, Pierre-Martin Tardif, Froduald Kabanza
Anomaly Detection Unsupervised Anomaly Detection Fraud Detection Threat Detection Novel Evaluation Large Scale Evaluation

Large Scale Evaluation

Papers

Beyond Static Datasets: A Deep Interaction Approach to LLM Evaluation

The GENEA Challenge 2023: A large scale evaluation of gesture generation models in monadic and dyadic settings

Large-Scale Evaluation of Topic Models and Dimensionality Reduction Methods for 2D Text Spatialization

Beyond Deep Ensembles: A Large-Scale Evaluation of Bayesian Deep Learning under Distribution Shift

An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation

SAIH: A Scalable Evaluation Methodology for Understanding AI Performance Trend on HPC Systems

Holistic Evaluation of Language Models

Optimization of Oblivious Decision Tree Ensembles Evaluation for CPU

Evaluate & Evaluation on the Hub: Better Best Practices for Data and Model Measurements

Large-scale Evaluation of Transformer-based Article Encoders on the Task of Citation Recommendation

The GENEA Challenge 2022: A large evaluation of data-driven co-speech gesture generation

A Revealing Large-Scale Evaluation of Unsupervised Anomaly Detection Algorithms