Large Language Model Alignment
Large language model (LLM) alignment focuses on ensuring that LLMs generate outputs consistent with human values and preferences, mitigating risks associated with harmful or misleading content. Current research emphasizes methods for aligning LLMs with both general and individual preferences, exploring techniques like reinforcement learning from human feedback (RLHF), preference optimization, and instruction tuning, often employing diverse datasets and model architectures (e.g., 7B parameter models). These efforts are crucial for responsible LLM development, impacting not only the safety and trustworthiness of these powerful tools but also their broader adoption across various applications in research and industry.
Papers
Weak-to-Strong Search: Align Large Language Models via Searching over Small Language Models
Zhanhui Zhou, Zhixuan Liu, Jie Liu, Zhichen Dong, Chao Yang, Yu Qiao
Offline Regularised Reinforcement Learning for Large Language Models Alignment
Pierre Harvey Richemond, Yunhao Tang, Daniel Guo, Daniele Calandriello, Mohammad Gheshlaghi Azar, Rafael Rafailov, Bernardo Avila Pires, Eugene Tarassov, Lucas Spangher, Will Ellsworth, Aliaksei Severyn, Jonathan Mallinson, Lior Shani, Gil Shamir, Rishabh Joshi, Tianqi Liu, Remi Munos, Bilal Piot