Activation Steering

Activation steering is a technique for modifying the behavior of large language models (LLMs) by subtly altering their internal activations during inference, without retraining the model. Current research focuses on improving the precision and effectiveness of steering, addressing issues like exaggerated safety responses and ensuring control over multiple properties simultaneously, often employing information-theoretic approaches and mean-centering techniques. This research is crucial for enhancing LLM safety and reliability, mitigating biases, and enabling more nuanced control over their outputs in various applications, including content moderation and agent-based systems.

Papers