Model Safety

Model safety research focuses on ensuring that artificial intelligence systems, particularly large language models (LLMs) and reinforcement learning agents, behave reliably and avoid harmful outputs or actions. Current efforts concentrate on improving data curation for training, developing methods to prevent catastrophic forgetting during model updates, and mitigating vulnerabilities to adversarial attacks and biases through techniques like decoupled refusal training and inference-time alignment. This field is crucial for responsible AI development, impacting the safety and trustworthiness of AI systems across diverse applications, from autonomous vehicles to medical diagnosis.

Papers