Activation Steering
Activation steering is a technique for modifying the behavior of large language models (LLMs) by subtly altering their internal activations during inference, without retraining the model. Current research focuses on improving the precision and effectiveness of steering, addressing issues like exaggerated safety responses and ensuring control over multiple properties simultaneously, often employing information-theoretic approaches and mean-centering techniques. This research is crucial for enhancing LLM safety and reliability, mitigating biases, and enabling more nuanced control over their outputs in various applications, including content moderation and agent-based systems.
Papers
November 4, 2024
October 15, 2024
September 6, 2024
August 21, 2024
June 25, 2024
June 17, 2024
June 1, 2024
May 26, 2024
April 2, 2024
March 9, 2024
February 1, 2024
December 6, 2023