the latest in aiBeta

Massive Multitask Language Understanding

Massive Multitask Language Understanding (MMLU) research aims to evaluate the breadth and depth of knowledge and reasoning capabilities in large language models (LLMs) across diverse domains. Current research focuses on developing more robust and challenging benchmarks like MMLU-Pro and its variants, addressing issues like shortcut learning, answer order bias, and data contamination to obtain more reliable performance metrics. These efforts are crucial for improving LLM development and ensuring responsible deployment, impacting both the scientific understanding of AI and the practical application of LLMs in various fields.

17papers

Papers

May 9, 2025

A Scaling Law for Token Efficiency in LLM Fine-Tuning Under Fixed Compute Budgets
Ryan Lagasse, Aidan Kierans, Avijit Ghosh, Shiri Dori-Hacohen
University of Connecticut
Computational Budget Massive Multitask Language Understanding Adaptive Token Model Performance Fine Tuning Scaling Law LLM Fine Tuning Large Language Model

May 4, 2025

Measuring Hong Kong Massive Multi-Task Language Understanding
Chuxue Cao, Zhenghao Zhu, Junqi Zhu, Guoying Lu, Siyu Peng, Juntao Dai, Weijie Shi, Sirui Han, Yike Guo
Hong Kong University of Science and Technology●Peking University
Measurement System Cantonese Speech Chinese Character Massive Multitask Language Understanding

April 20, 2025

Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark
Enxin Song, Wenhao Chai, Weili Xu, Jianwen Xie, Yuxuan Liu, Gaoang Wang
Zhejiang University●University of Washington●University of Illinois Urbana-Champaign●Inc.
Multimodal Perception Massive Multitask Language Understanding Video Content Lecture Note Multimodal Language Model

March 26, 2025

Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark
Sondos Mahmoud Bsharat, Mukul Ranjan, Aidar Myrzakhan, Jiacheng Liu, Bowei Guo, Shengkun Tang, Zhuang Liu, Yuanzhi Li, Zhiqiang Shen
MBZUAI●Princeton University●Apple
Language Model Language Understanding Massive Multitask Language Understanding

March 15, 2025

RECSIP: REpeated Clustering of Scores Improving the Precision
André Schamschurko, Nenad Petrovic, Alois Christian Knoll
Technical University of Munich (TUM)
Multidimensional Local Precision Rate Score Matching Iterative Clustering Natural Language Processing Massive Multitask Language Understanding Performance Score

March 13, 2025

MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation
Weihao Xuan, Rui Yang, Heli Qi, Qingcheng Zeng, Yunze Xiao, Aosong Feng, Dairui Liu, Yun Xing, Junjue Wang, Fan Gao, Jinghui Lu, Yuang Jiang+20
The University of Tokyo●Duke-NUS Medical School●Waseda University●Northwestern University●Carnegie Mellon University●Nanyang...+3
Advanced Large Language Model High Resource Language Multilingual Capability Global Evaluation Multilingual Benchmark Massive Multitask Language Understanding Language Model Large Language Model

February 27, 2025

PolyPrompt: Automating Knowledge Extraction from Multilingual Language Models with Dynamic Prompt Generation
Nathan Roll
Stanford University
Knowledge Extraction Full Model Large Language Model Massive Multitask Language Understanding Multilingual Text Translation Pipeline Multilingual Capability

February 24, 2025

Evaluating Expert Contributions in a MoE LLM for Quiz-Based Tasks
Andrei Chernov
Independent Researcher
Pop Quiz Mixture of Expert Contribution Evaluation State of the Art Large Massive Multitask Language Understanding Related Hyperparameters

February 16, 2025

TUMLU: A Unified and Native Language Understanding Benchmark for Turkic Languages
Jafar Isbarov, Arofat Akhundjanova, Mammad Hajili, Kavsar Huseynova, Dmitry Gaynullin, Anar Rzayev, Osman Tursun, Ilshat Saetov, Rinat Kharisov+7
Unknown Language Massive Multitask Language Understanding Multilingual Language Model Language Understanding Turkish Text

January 29, 2025

DFPE: A Diverse Fingerprint Ensemble for Enhancing LLM Performance
Seffi Cohen, Niv Goldshlager, Nurit Cohen-Inger, Bracha Shapira, Lior Rokach
Validation Performance Massive Multitask Language Understanding LLM Performance

January 5, 2025

CHAIR -- Classifier of Hallucination as Improver
Ao Sun
Type II Hallucination Content Hallucination Massive Multitask Language Understanding Large Relevance Improvement Strong Generalization Logistic Regression Language Model

December 31, 2024

Setting Standards in Turkish NLP: TR-MMLU for Large Language Model Evaluation
M. Ali Bayram, Ali Arda Fincan, Ahmet Semih G"um"uş, Banu Diri, Savaş Yıldırım, "Oner Aytaş
Massive Multitask Language Understanding Large Language Model Language Model Turkish Natural Language Global Evaluation Future Standard

December 19, 2024

MMLU-CF: A Contamination-free Multi-task Language Understanding Benchmark
Qihao Zhao, Yangyu Huang, Tengchao Lv, Lei Cui, Qinzheng Sun, Shaoguang Mao, Xin Zhang, Ying Xin, Qiufeng Yin, Scarlett Li, Furu Wei
Large Language Model Massive Multitask Language Understanding Multiple Choice New Benchmark

December 15, 2024

Nanoscaling Floating-Point (NxFP): NanoMantissa, Adaptive Microexponents, and Code Recycling for Direct-Cast Compression of Large Language Models
Yun-Chen Lo, Gu-Yeon Wei, David Brooks
Feature Reuse Language Model Functional Compression Massive Multitask Language Understanding

December 4, 2024

Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation
Shivalika Singh, Angelika Romanou, Clémentine Fourrier, David I. Adelani, Jian Gang Ngui, Daniel Vila-Suero, Peerat Limkonchotiwat, Kelly Marchisio+16
Cultural Bias Massive Multitask Language Understanding Human Understanding Multilingual Evaluation Large Scale Evaluation Human Machine Multilingual Dataset Evaluation Set

October 28, 2024

Shopping MMLU: A Massive Multi-Task Online Shopping Benchmark for Large Language Models
Yilun Jin, Zheng Li, Chenwei Zhang, Tianyu Cao, Yifan Gao, Pratik Jayarao, Mao Li, Xin Liu, Ritesh Sarkhel, Xianfeng Tang, Haodong Wang+11
Multi Task Benchmark Language Model Massive Multitask Language Understanding Multi Task Shot Learning

September 3, 2024

MMLU-Pro+: Evaluating Higher-Order Reasoning and Shortcut Learning in LLMs
Saeid Asgari Taghanaki, Aliasgahr Khani, Amir Khasahmadi
Shortcut Learning High Level Reasoning Massive Multitask Language Understanding Large Language Model

August 19, 2024

Performance Law of Large Language Models
Chuhan Wu, Ruiming Tang
Legal Document Massive Multitask Language Understanding LLM Based Framework Large Language Model Related Hyperparameters

August 1, 2024

Improving Retrieval-Augmented Generation in Medicine with Iterative Follow-up Questions
Guangzhi Xiong, Qiao Jin, Xiao Wang, Minjia Zhang, Zhiyong Lu, Aidong Zhang
University of Virginia●National Library of Medicine●University of Illinois Urbana-Champaign
MedQA Dataset Retrieval Augmented Generation Massive Multitask Language Understanding Medical Question Answering Large Language Model Follow Up Question

July 17, 2024

TurkishMMLU: Measuring Massive Multitask Language Understanding in Turkish
Arda Yüksel, Abdullatif Köksal, Lütfi Kerem Şenel, Anna Korhonen, Hinrich Schütze
Turkish Text Massive Multitask Language Understanding Multiple Choice Multilingual Evaluation