Paper ID: 2405.13820 • Published May 22, 2024
Towards Comprehensive Post Safety Alignment of Large Language Models via Safety Patching
Weixiang Zhao, Yulin Hu, Zhuojun Li, Yang Deng, Jiahe Guo, Xingyu Sui, Yanyan Zhao, Bing Qin, Tat-Seng Chua, Ting Liu
TL;DR
Get AI-generated summaries with premium
Get AI-generated summaries with premium
Safety alignment of large language models (LLMs) has been gaining increasing
attention. However, current safety-aligned LLMs suffer from the fragile and
imbalanced safety mechanisms, which can still be induced to generate unsafe
responses, exhibit over-safety by rejecting safe user inputs, and fail to
preserve general utility after safety alignment. To this end, we propose a
novel post safety alignment (PSA) method to address these inherent and emerging
safety challenges, including safety enhancement, over-safety mitigation, and
utility preservation. In specific, we introduce \textsc{SafePatching}, a novel
framework for comprehensive PSA, where two distinct safety patches are
developed on the harmful data to enhance safety and mitigate over-safety
concerns, and then seamlessly integrated into the target LLM backbone without
compromising its utility. Extensive experiments on four representative aligned
LLMs, including LLaMA-2/3, Gemma and Mistral, show that \textsc{SafePatching}
achieves a more comprehensive PSA than baseline methods, further optimizing the
balance between being helpful and harmless in current aligned LLMs. Also,
\textsc{SafePatching} demonstrates its superiority in continual PSA scenarios.