Language Reward
Language reward research focuses on developing methods for training and aligning large language models (LLMs) using reward signals derived from language, rather than solely relying on human-labeled data. Current research emphasizes self-supervised learning techniques, including contrastive learning and iterative preference optimization, often employing LLMs themselves as meta-judges to assess and refine their own responses. This work aims to improve model alignment, efficiency, and generalization across diverse tasks, such as instruction following, robotic control, and multi-lingual applications. The resulting advancements have significant implications for creating more robust, efficient, and human-aligned AI systems.
Papers
Feedback Loops With Language Models Drive In-Context Reward Hacking
Alexander Pan, Erik Jones, Meena Jagadeesan, Jacob Steinhardt
Diffusion-ES: Gradient-free Planning with Diffusion for Autonomous Driving and Zero-Shot Instruction Following
Brian Yang, Huangyuan Su, Nikolaos Gkanatsios, Tsung-Wei Ke, Ayush Jain, Jeff Schneider, Katerina Fragkiadaki