NLP Benchmark

NLP benchmarks are standardized evaluation sets used to assess the performance of natural language processing (NLP) models across various tasks, aiming to objectively compare and improve model capabilities. Current research focuses on developing more comprehensive benchmarks that address limitations of existing datasets, including biases, the need for more diverse question types and languages, and the evaluation of reasoning abilities beyond simple memorization, exploring techniques like knowledge distillation and multi-layer key-value caching for efficiency. These advancements are crucial for driving progress in NLP, enabling the development of more robust and reliable models for real-world applications.

Papers

October 12, 2024

LexSumm and LexT5: Benchmarking and Modeling Legal Summarization Tasks in English
T.Y.S.S. Santosh, Cornelius Weiss, Matthias Grabmair
NLP Field Benchmark Platform NLP Benchmark Legal Summarization Zero Shot LLM

September 30, 2024

Linear Projections of Teacher Embeddings for Few-Class Distillation
Noel Loo, Fotis Iliopoulos, Wei Hu, Erik Vee
Knowledge Distillation Teacher Network Distillation Method NLP Benchmark

September 21, 2024

Bias and Toxicity in Role-Play Reasoning
Jinman Zhao, Zifan Qian, Linbo Cao, Yining Wang, Yitian Ding
Large Language Model Language Model Absolute Stance Bias Non Toxic Role Play NLP Benchmark

August 21, 2024

Great Memory, Shallow Reasoning: Limits of $k$NN-LMs
Shangyi Geng, Wenting Zhao, Alexander M Rush
Continuum Limit Personal Memory Next Word Prediction NLP Benchmark Nearest Neighbor Language Model

June 18, 2024

SNAP: Unlearning Selective Knowledge in Large Language Models with Negative Instructions
Minseok Choi, Daniel Rim, Dohyun Lee, Jaegul Choo
Large Language Model LLM Capability Snap Video NLP Benchmark Negative Prompt Selective Explanation

June 13, 2024

MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding
Zayd Muhammad Kawakibi Zuhri, Muhammad Farid Adilazuarda, Ayu Purwarianti, Alham Fikri Aji
Memory Efficient Transformer Layer Query Attention Key Value NLP Benchmark

April 2, 2024

METAL: Towards Multilingual Meta-Evaluation
Rishav Hada, Varun Gumma, Mohamed Ahmed, Kalika Bali, Sunayana Sitaram
LLM Evaluator Meta Evaluation Native Speaker NLP Benchmark

March 19, 2024

LHMKE: A Large-scale Holistic Multi-subject Knowledge Evaluation Benchmark for Chinese Large Language Models
Chuang Liu, Renren Jin, Yuqi Ren, Deyi Xiong
Large Language Model Chinese Large Language Model NLP Benchmark

November 22, 2023

LIMIT: Less Is More for Instruction Tuning Across Evaluation Paradigms
Aditi Jha, Sam Havens, Jeremy Dohmann, Alex Trott, Jacob Portes
Instruction Datasets NLP Benchmark Learning Multi Phase Incremental Task

October 23, 2023

The Skipped Beat: A Study of Sociopragmatic Understanding in LLMs for 64 Languages
Chiyu Zhang, Khai Duy Doan, Qisheng Liao, Muhammad Abdul-Mageed
Large Language Model Language Model Study Feature Cross Lingual Unknown Language Rhythm Game NLP Benchmark Pragmatic Language Understanding

October 19, 2023

The Shifted and The Overlooked: A Task-oriented Investigation of User-GPT Interactions
Siru Ouyang, Shuohang Wang, Yang Liu, Ming Zhong, Yizhu Jiao, Dan Iter, Reid Pryzant, Chenguang Zhu, Heng Ji, Jiawei Han
NLP Task NLP Research Two Shift NLP Application NLP Benchmark

June 15, 2023

SCALE: Scaling up the Complexity for Advanced Language Model Evaluation
Vishvaksenan Rasiah, Ronja Stern, Veton Matoshi, Matthias Stürmer, Ilias Chalkidis, Daniel E. Ho, Joel Niklaus
Language Model Complexity Matter Multilingual Benchmark Natural Language Processing Benchmark NLP Benchmark

May 27, 2023

FERMAT: An Alternative to Accuracy for Numerical Reasoning
Jasivan Alex Sivakumar, Nafise Sadat Moosavi
Large Language Model Numerical Reasoning NLP Benchmark Fermat Distance

May 25, 2023

The False Promise of Imitating Proprietary LLMs
Arnav Gudibande, Eric Wallace, Charlie Snell, Xinyang Geng, Hao Liu, Pieter Abbeel, Sergey Levine, Dawn Song
Language Model State of the Art Imitation NLP Benchmark Conditioned Imitation Active Imitation