Safety Alignment
Safety alignment in large language models (LLMs) focuses on ensuring these powerful systems generate helpful and harmless outputs, mitigating risks from malicious prompts or unintended consequences of fine-tuning. Current research emphasizes developing robust methods for data curation, improving the design of safety mechanisms (including those operating at the decoding stage), and understanding how various factors like model architecture, fine-tuning techniques, and even model personality influence safety. This crucial area of research directly impacts the responsible development and deployment of LLMs, influencing their trustworthiness and societal impact across diverse applications.
Papers
Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models
Qin Liu, Chao Shang, Ling Liu, Nikolaos Pappas, Jie Ma, Neha Anna John, Srikanth Doss, Lluis Marquez, Miguel Ballesteros, Yassine Benajiba
Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements
Jingyu Zhang, Ahmed Elgohary, Ahmed Magooda, Daniel Khashabi, Benjamin Van Durme
Risk Alignment in Agentic AI Systems
Hayley Clatterbuck, Clinton Castro, Arvo Muñoz Morán
SciSafeEval: A Comprehensive Benchmark for Safety Alignment of Large Language Models in Scientific Tasks
Tianhao Li, Jingyu Lu, Chuangxin Chu, Tianyu Zeng, Yujia Zheng, Mei Li, Haotian Huang, Bin Wu, Zuoxian Liu, Kai Ma, Xuejing Yuan, Xingkai Wang, Keyan Ding, Huajun Chen, Qiang Zhang