LLM Safety
Large language model (LLM) safety research focuses on mitigating the risks associated with these powerful models, primarily by preventing the generation of harmful or biased content and enhancing their robustness against adversarial attacks. Current research emphasizes developing and evaluating various defense mechanisms, including alignment techniques (like preference optimization and constrained direct preference optimization), and analyzing attack methods such as jailbreaks and prompt injection, often using reinforcement learning and adversarial training. This field is crucial for responsible LLM deployment, impacting both the development of safer models and the creation of effective content moderation and security protocols for a wide range of applications.
Papers
Current state of LLM Risks and AI Guardrails
Suriya Ganesh Ayyamperumal, Limin Ge
garak: A Framework for Security Probing Large Language Models
Leon Derczynski, Erick Galinkin, Jeffrey Martin, Subho Majumdar, Nanna Inie
TorchOpera: A Compound AI System for LLM Safety
Shanshan Han, Zijian Hu, Alay Dilipbhai Shah, Han Jin, Yuhang Yao, Dimitris Stripelis, Zhaozhuo Xu, Chaoyang He