Safety Fine Tuning
Safety fine-tuning aims to mitigate the risks of large language models (LLMs) generating harmful content, even after training on seemingly benign data. Current research focuses on methods like adjusting internal model representations, modifying learning rates during fine-tuning, and developing specialized datasets to improve safety without sacrificing helpfulness, addressing issues like jailbreaks and refusal biases. These efforts are crucial for responsible LLM deployment, impacting both the trustworthiness of AI systems and the development of robust safety evaluation methodologies.
Papers
November 10, 2024
November 2, 2024
October 13, 2024
October 10, 2024
October 9, 2024
October 8, 2024
October 6, 2024
October 3, 2024
September 18, 2024
July 14, 2024
July 12, 2024
July 1, 2024
May 28, 2024
May 15, 2024
March 30, 2024
March 20, 2024
March 18, 2024
February 3, 2024
October 31, 2023