Reasoning Benchmark
Reasoning benchmarks are standardized tests designed to evaluate the logical reasoning capabilities of large language models (LLMs). Current research focuses on developing more challenging benchmarks that go beyond simple question-answering, including those requiring multi-step reasoning, handling long contexts, and incorporating diverse reasoning types (deductive, inductive, abductive, analogical). These benchmarks utilize various techniques like chain-of-thought prompting, in-context learning, and model architectures incorporating generator-discriminator networks or hybrid thinking frameworks to improve LLM performance. The development of robust and comprehensive reasoning benchmarks is crucial for advancing the field of artificial intelligence by providing objective measures of progress and identifying areas needing further research.
Papers
Self-Polish: Enhance Reasoning in Large Language Models via Problem Refinement
Zhiheng Xi, Senjie Jin, Yuhao Zhou, Rui Zheng, Songyang Gao, Tao Gui, Qi Zhang, Xuanjing Huang
Exploring Self-supervised Logic-enhanced Training for Large Language Models
Fangkai Jiao, Zhiyang Teng, Bosheng Ding, Zhengyuan Liu, Nancy F. Chen, Shafiq Joty