Reward Model
Reward models are crucial for aligning large language models (LLMs) and other AI systems with human preferences, enabling more helpful and harmless behavior. Current research focuses on improving reward model accuracy and robustness, exploring techniques like preference optimization, multimodal approaches incorporating both text and image data, and methods to mitigate biases and noise in reward signals, often employing transformer-based architectures and reinforcement learning algorithms. These advancements are vital for building more reliable and trustworthy AI systems, impacting both the development of safer LLMs and the broader field of human-centered AI.
Papers
Aligning Neural Machine Translation Models: Human Feedback in Training and Inference
Miguel Moura Ramos, Patrick Fernandes, António Farinhas, André F. T. Martins
Routing to the Expert: Efficient Reward-guided Ensemble of Large Language Models
Keming Lu, Hongyi Yuan, Runji Lin, Junyang Lin, Zheng Yuan, Chang Zhou, Jingren Zhou
SALMON: Self-Alignment with Instructable Reward Models
Zhiqing Sun, Yikang Shen, Hongxin Zhang, Qinhong Zhou, Zhenfang Chen, David Cox, Yiming Yang, Chuang Gan
SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative to RLHF
Yi Dong, Zhilin Wang, Makesh Narsimhan Sreedhar, Xianchao Wu, Oleksii Kuchaiev