Safety Alignment
Safety alignment in large language models (LLMs) focuses on ensuring these powerful systems generate helpful and harmless outputs, mitigating risks from malicious prompts or unintended consequences of fine-tuning. Current research emphasizes developing robust methods for data curation, improving the design of safety mechanisms (including those operating at the decoding stage), and understanding how various factors like model architecture, fine-tuning techniques, and even model personality influence safety. This crucial area of research directly impacts the responsible development and deployment of LLMs, influencing their trustworthiness and societal impact across diverse applications.
Papers
Towards Comprehensive Post Safety Alignment of Large Language Models via Safety Patching
Weixiang Zhao, Yulin Hu, Zhuojun Li, Yang Deng, Jiahe Guo, Xingyu Sui, Yanyan Zhao, Bing Qin, Tat-Seng Chua, Ting Liu
Safety Alignment for Vision Language Models
Zhendong Liu, Yuanbi Nie, Yingshui Tan, Xiangyu Yue, Qiushi Cui, Chongjun Wang, Xiaoyong Zhu, Bo Zheng
Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!
Zhanhui Zhou, Jie Liu, Zhichen Dong, Jiaheng Liu, Chao Yang, Wanli Ouyang, Yu Qiao
Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic
Rishabh Bhardwaj, Do Duc Anh, Soujanya Poria