Model Safety
Model safety research focuses on ensuring that artificial intelligence systems, particularly large language models (LLMs) and reinforcement learning agents, behave reliably and avoid harmful outputs or actions. Current efforts concentrate on improving data curation for training, developing methods to prevent catastrophic forgetting during model updates, and mitigating vulnerabilities to adversarial attacks and biases through techniques like decoupled refusal training and inference-time alignment. This field is crucial for responsible AI development, impacting the safety and trustworthiness of AI systems across diverse applications, from autonomous vehicles to medical diagnosis.
Papers
January 20, 2024
December 10, 2023
November 3, 2023
October 22, 2023
July 31, 2023
September 29, 2022
August 17, 2022
May 11, 2022
February 23, 2022