Alignment Performance
Alignment performance in large language models (LLMs) and other AI systems focuses on ensuring model outputs align with human intentions and values, encompassing safety, fairness, and adherence to social norms. Current research emphasizes improving alignment through techniques like reinforcement learning from human feedback (RLHF), direct preference optimization (DPO), and in-context learning (ICL), often employing novel model architectures and algorithms to enhance efficiency and robustness. These advancements are crucial for responsible AI development, mitigating risks associated with harmful outputs and enabling safer and more beneficial deployment of LLMs across various applications.
Papers
Foundational Challenges in Assuring Alignment and Safety of Large Language Models
Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh Lubana, Erik Jenner, Stephen Casper, Oliver Sourbut, Benjamin L. Edelman, Zhaowei Zhang, Mario Günther, Anton Korinek, Jose Hernandez-Orallo, Lewis Hammond, Eric Bigelow, Alexander Pan, Lauro Langosco, Tomasz Korbak, Heidi Zhang, Ruiqi Zhong, Seán Ó hÉigeartaigh, Gabriel Recchia, Giulio Corsi, Alan Chan, Markus Anderljung, Lilian Edwards, Aleksandar Petrov, Christian Schroeder de Witt, Sumeet Ramesh Motwan, Yoshua Bengio, Danqi Chen, Philip H.S. Torr, Samuel Albanie, Tegan Maharaj, Jakob Foerster, Florian Tramer, He He, Atoosa Kasirzadeh, Yejin Choi, David Krueger
Impact of Preference Noise on the Alignment Performance of Generative Language Models
Yang Gao, Dana Alon, Donald Metzler
Learn Your Reference Model for Real Good Alignment
Alexey Gorbatovski, Boris Shaposhnikov, Alexey Malakhov, Nikita Surnachev, Yaroslav Aksenov, Ian Maksimov, Nikita Balagansky, Daniil Gavrilov