LLM Safety
Large language model (LLM) safety research focuses on mitigating the risks associated with these powerful models, primarily by preventing the generation of harmful or biased content and enhancing their robustness against adversarial attacks. Current research emphasizes developing and evaluating various defense mechanisms, including alignment techniques (like preference optimization and constrained direct preference optimization), and analyzing attack methods such as jailbreaks and prompt injection, often using reinforcement learning and adversarial training. This field is crucial for responsible LLM deployment, impacting both the development of safer models and the creation of effective content moderation and security protocols for a wide range of applications.
Papers
Arabic Dataset for LLM Safeguard Evaluation
Yasser Ashraf, Yuxia Wang, Bin Gu, Preslav Nakov, Timothy Baldwin
SafetyAnalyst: Interpretable, transparent, and steerable LLM safety moderation
Jing-Jing Li, Valentina Pyatkin, Max Kleiman-Weiner, Liwei Jiang, Nouha Dziri, Anne G. E. Collins, Jana Schaich Borg, Maarten Sap, Yejin Choi, Sydney Levine