Novel Benchmark
Novel benchmarks are being developed to rigorously evaluate the performance of large language models (LLMs) and other AI models across diverse tasks, addressing limitations in existing evaluation methods. Current research focuses on creating benchmarks that assess capabilities in areas such as code generation, multimodal reasoning, and handling complex real-world scenarios, often incorporating diverse data sources and evaluating robustness to various factors like language variations and data distribution shifts. These improved benchmarks are crucial for advancing the field by providing more accurate and comprehensive assessments of model performance, ultimately leading to the development of more reliable and effective AI systems for various applications.
Papers
A Novel Psychometrics-Based Approach to Developing Professional Competency Benchmark for Large Language Models
Elena Kardanova, Alina Ivanova, Ksenia Tarasova, Taras Pashchenko, Aleksei Tikhoniuk, Elen Yusupova, Anatoly Kasprzhak, Yaroslav Kuzminov, Ekaterina Kruchinskaia, Irina Brun (National Research University Higher School of Economics, Moscow, Russia)
SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types
Yutao Mou, Shikun Zhang, Wei Ye
Benchmarking Mobile Device Control Agents across Diverse Configurations
Juyong Lee, Taywon Min, Minyong An, Dongyoon Hahm, Haeone Lee, Changyeon Kim, Kimin Lee
Real-Time 4K Super-Resolution of Compressed AVIF Images. AIS 2024 Challenge Survey
Marcos V. Conde, Zhijun Lei, Wen Li, Cosmin Stejerean, Ioannis Katsavounidis, Radu Timofte, Kihwan Yoon, Ganzorig Gankhuyag, Jiangtao Lv, Long Sun, Jinshan Pan, Jiangxin Dong, Jinhui Tang, Zhiyuan Li, Hao Wei, Chenyang Ge, Dongyang Zhang, Tianle Liu, Huaian Chen, Yi Jin, Menghan Zhou, Yiqiang Yan, Si Gao, Biao Wu, Shaoli Liu, Chengjian Zheng, Diankai Zhang, Ning Wang, Xintao Qiu, Yuanbo Zhou, Kongxian Wu, Xinwei Dai, Hui Tang, Wei Deng, Qingquan Gao, Tong Tong, Jae-Hyeon Lee, Ui-Jin Choi, Min Yan, Xin Liu, Qian Wang, Xiaoqian Ye, Zhan Du, Tiansen Zhang, Long Peng, Jiaming Guo, Xin Di, Bohao Liao, Zhibo Du, Peize Xia, Renjing Pei, Yang Wang, Yang Cao, Zhengjun Zha, Bingnan Han, Hongyuan Yu, Zhuoyuan Wu, Cheng Wan, Yuqing Liu, Haodong Yu, Jizhe Li, Zhijuan Huang, Yuan Huang, Yajun Zou, Xianyu Guan, Qi Jia, Heng Zhang, Xuanwu Yin, Kunlong Zuo, Hyeon-Cheol Moon, Tae-hyun Jeong, Yoonmo Yang, Jae-Gon Kim, Jinwoo Jeong, Sunjei Kim