Safety Alignment
Safety alignment in large language models (LLMs) focuses on ensuring these powerful systems generate helpful and harmless outputs, mitigating risks from malicious prompts or unintended consequences of fine-tuning. Current research emphasizes developing robust methods for data curation, improving the design of safety mechanisms (including those operating at the decoding stage), and understanding how various factors like model architecture, fine-tuning techniques, and even model personality influence safety. This crucial area of research directly impacts the responsible development and deployment of LLMs, influencing their trustworthiness and societal impact across diverse applications.
Papers
PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference
Jiaming Ji, Donghai Hong, Borong Zhang, Boyuan Chen, Josef Dai, Boren Zheng, Tianyi Qiu, Boxun Li, Yaodong Yang
Model Merging and Safety Alignment: One Bad Model Spoils the Bunch
Hasan Abed Al Kader Hammoud, Umberto Michieli, Fabio Pizzati, Philip Torr, Adel Bibi, Bernard Ghanem, Mete Ozay
SafeSora: Towards Safety Alignment of Text2Video Generation via a Human Preference Dataset
Josef Dai, Tianle Chen, Xuyao Wang, Ziran Yang, Taiye Chen, Jiaming Ji, Yaodong Yang
Finding Safety Neurons in Large Language Models
Jianhui Chen, Xiaozhi Wang, Zijun Yao, Yushi Bai, Lei Hou, Juanzi Li
Towards Comprehensive and Efficient Post Safety Alignment of Large Language Models via Safety Patching
Weixiang Zhao, Yulin Hu, Zhuojun Li, Yang Deng, Yanyan Zhao, Bing Qin, Tat-Seng Chua
Safety Alignment for Vision Language Models
Zhendong Liu, Yuanbi Nie, Yingshui Tan, Xiangyu Yue, Qiushi Cui, Chongjun Wang, Xiaoyong Zhu, Bo Zheng