Alignment Approach
Alignment approaches in artificial intelligence aim to ensure that artificial intelligence models, particularly large language models, behave in ways consistent with human values and intentions. Current research focuses on developing and evaluating various alignment techniques, including reinforcement learning from human feedback (RLHF), direct preference optimization (DPO), and methods leveraging in-context learning and prompt engineering, often implemented within specific model architectures like mixture-of-experts. These efforts are crucial for mitigating risks associated with misaligned AI and for building trustworthy and beneficial AI systems across diverse applications, from healthcare to conversational agents.
Papers
Proxy-RLHF: Decoupling Generation and Alignment in Large Language Model with Proxy
Yu Zhu, Chuxiong Sun, Wenfei Yang, Wenqiang Wei, Bo Tang, Tianzhu Zhang, Zhiyu Li, Shifeng Zhang, Feiyu Xiong, Jie Hu, Mingchuan yang
On the Essence and Prospect: An Investigation of Alignment Approaches for Big Models
Xinpeng Wang, Shitong Duan, Xiaoyuan Yi, Jing Yao, Shanlin Zhou, Zhihua Wei, Peng Zhang, Dongkuan Xu, Maosong Sun, Xing Xie
Black-Box Prompt Optimization: Aligning Large Language Models without Model Training
Jiale Cheng, Xiao Liu, Kehan Zheng, Pei Ke, Hongning Wang, Yuxiao Dong, Jie Tang, Minlie Huang
Beyond Imitation: Leveraging Fine-grained Quality Signals for Alignment
Geyang Guo, Ranchi Zhao, Tianyi Tang, Wayne Xin Zhao, Ji-Rong Wen