Reasoning Benchmark
Reasoning benchmarks are standardized tests designed to evaluate the logical reasoning capabilities of large language models (LLMs). Current research focuses on developing more challenging benchmarks that go beyond simple question-answering, including those requiring multi-step reasoning, handling long contexts, and incorporating diverse reasoning types (deductive, inductive, abductive, analogical). These benchmarks utilize various techniques like chain-of-thought prompting, in-context learning, and model architectures incorporating generator-discriminator networks or hybrid thinking frameworks to improve LLM performance. The development of robust and comprehensive reasoning benchmarks is crucial for advancing the field of artificial intelligence by providing objective measures of progress and identifying areas needing further research.
Papers
A Comparative Study on Reasoning Patterns of OpenAI's o1 Model
Siwei Wu, Zhongyuan Peng, Xinrun Du, Tuney Zheng, Minghao Liu, Jialong Wu, Jiachen Ma, Yizhi Li, Jian Yang, Wangchunshu Zhou, Qunshu Lin, Junbo Zhao, Zhaoxiang Zhang, Wenhao Huang, Ge Zhang, Chenghua Lin, J.H. Liu
LAR-ECHR: A New Legal Argument Reasoning Task and Dataset for Cases of the European Court of Human Rights
Odysseas S. Chlapanis, Dimitrios Galanis, Ion Androutsopoulos
Mars: Situated Inductive Reasoning in an Open-World Environment
Xiaojuan Tang, Jiaqi Li, Yitao Liang, Song-chun Zhu, Muhan Zhang, Zilong Zheng
Divide and Translate: Compositional First-Order Logic Translation and Verification for Complex Logical Reasoning
Hyun Ryu, Gyeongman Kim, Hyemin S. Lee, Eunho Yang
Scheherazade: Evaluating Chain-of-Thought Math Reasoning in LLMs with Chain-of-Problems
Stephen Miner, Yoshiki Takashima, Simeng Han, Ferhat Erata, Timos Antonopoulos, Ruzica Piskac, Scott J Shapiro
A Looming Replication Crisis in Evaluating Behavior in Language Models? Evidence and Solutions
Laurène Vaugrante, Mathias Niepert, Thilo Hagendorff