Model Safety
Model safety research focuses on ensuring that artificial intelligence systems, particularly large language models (LLMs) and reinforcement learning agents, behave reliably and avoid harmful outputs or actions. Current efforts concentrate on improving data curation for training, developing methods to prevent catastrophic forgetting during model updates, and mitigating vulnerabilities to adversarial attacks and biases through techniques like decoupled refusal training and inference-time alignment. This field is crucial for responsible AI development, impacting the safety and trustworthiness of AI systems across diverse applications, from autonomous vehicles to medical diagnosis.
Papers
Who's asking? User personas and the mechanics of latent misalignment
Asma Ghandeharioun, Ann Yuan, Marius Guerard, Emily Reif, Michael A. Lepori, Lucas Dixon
Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations
Rima Hazra, Sayan Layek, Somnath Banerjee, Soujanya Poria