Paper ID: 2410.12299 • Published Oct 16, 2024
Semantics-Adaptive Activation Intervention for LLMs via Dynamic Steering Vectors
Weixuan Wang, Jingyuan Yang, Wei Peng
TL;DR
Get AI-generated summaries with premium
Get AI-generated summaries with premium
Large language models (LLMs) have achieved remarkable performance across many
tasks, yet aligning them with desired behaviors remains challenging. Activation
intervention has emerged as an effective and economical method to modify the
behavior of LLMs. Despite considerable interest in this area, current
intervention methods exclusively employ a fixed steering vector to modify model
activations, lacking adaptability to diverse input semantics. To address this
limitation, we propose Semantics-Adaptive Dynamic Intervention (SADI), a novel
method that constructs a dynamic steering vector to intervene model activations
at inference time. More specifically, SADI utilizes activation differences in
contrastive pairs to precisely identify critical elements of an LLM (i.e.,
attention heads, hidden states, and neurons) for targeted intervention. During
inference, SADI dynamically steers model behavior by scaling element-wise
activations based on the directions of input semantics. Experimental results
show that SADI outperforms established baselines by substantial margins,
improving task performance without training. SADI's cost-effectiveness and
generalizability across various LLM backbones and tasks highlight its potential
as a versatile alignment technique.