Alignment Approach
Alignment approaches in artificial intelligence aim to ensure that artificial intelligence models, particularly large language models, behave in ways consistent with human values and intentions. Current research focuses on developing and evaluating various alignment techniques, including reinforcement learning from human feedback (RLHF), direct preference optimization (DPO), and methods leveraging in-context learning and prompt engineering, often implemented within specific model architectures like mixture-of-experts. These efforts are crucial for mitigating risks associated with misaligned AI and for building trustworthy and beneficial AI systems across diverse applications, from healthcare to conversational agents.
Papers
Aligning Model Properties via Conformal Risk Control
William Overman, Jacqueline Jil Vallon, Mohsen Bayati
The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm
Aakanksha, Arash Ahmadian, Beyza Ermis, Seraphina Goldfarb-Tarrant, Julia Kreutzer, Marzieh Fadaee, Sara Hooker