LLM Safety
Large language model (LLM) safety research focuses on mitigating the risks associated with these powerful models, primarily by preventing the generation of harmful or biased content and enhancing their robustness against adversarial attacks. Current research emphasizes developing and evaluating various defense mechanisms, including alignment techniques (like preference optimization and constrained direct preference optimization), and analyzing attack methods such as jailbreaks and prompt injection, often using reinforcement learning and adversarial training. This field is crucial for responsible LLM deployment, impacting both the development of safer models and the creation of effective content moderation and security protocols for a wide range of applications.
Papers
Ink and Individuality: Crafting a Personalised Narrative in the Age of LLMs
Azmine Toushik Wasi, Raima Islam, Mst Rafia Islam
Polaris: A Safety-focused LLM Constellation Architecture for Healthcare
Subhabrata Mukherjee, Paul Gamble, Markel Sanz Ausin, Neel Kant, Kriti Aggarwal, Neha Manjunath, Debajyoti Datta, Zhengliang Liu, Jiayuan Ding, Sophia Busacca, Cezanne Bianco, Swapnil Sharma, Rae Lasko, Michelle Voisard, Sanchay Harneja, Darya Filippova, Gerry Meixiong, Kevin Cha, Amir Youssefi, Meyhaa Buvanesh, Howard Weingram, Sebastian Bierman-Lytle, Harpreet Singh Mangat, Kim Parikh, Saad Godil, Alex Miller
Mapping LLM Security Landscapes: A Comprehensive Stakeholder Risk Assessment Proposal
Rahul Pankajakshan, Sumitra Biswal, Yuvaraj Govindarajulu, Gilad Gressel