Execution Based
Execution-based evaluation is revolutionizing the assessment of large language models (LLMs) for code generation and other tasks, shifting the focus from superficial code similarity metrics to actual program execution and output verification. Current research emphasizes developing robust benchmarks with comprehensive test suites and exploring reinforcement learning techniques to train LLMs that effectively leverage execution feedback for iterative improvement and self-debugging. This rigorous evaluation approach is crucial for advancing the reliability and practical applicability of LLMs in software engineering, data science, and other fields where correct code execution is paramount.
Papers
Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models
Guanting Dong, Keming Lu, Chengpeng Li, Tingyu Xia, Bowen Yu, Chang Zhou, Jingren Zhou
VDebugger: Harnessing Execution Feedback for Debugging Visual Programs
Xueqing Wu, Zongyu Lin, Songyan Zhao, Te-Lin Wu, Pan Lu, Nanyun Peng, Kai-Wei Chang