Targeted Activation Penalty
Targeted activation penalty (TAP) research focuses on improving the robustness and interpretability of neural networks by manipulating neuron activations. Current work investigates how activation scaling, dropout, and other techniques can mitigate issues like spurious signal reliance, massive activations (excessively large activation values in specific dimensions), and task drift in large language models (LLMs) and other architectures, including convolutional neural networks (CNNs) and graph neural networks (GNNs). These efforts aim to enhance model generalization, safety, and explainability, leading to more reliable and trustworthy AI systems across various applications. The ultimate goal is to develop more robust and interpretable models by better understanding and controlling the internal workings of neural networks.
Papers
Sparsing Law: Towards Large Language Models with Greater Activation Sparsity
Yuqi Luo, Chenyang Song, Xu Han, Yingfa Chen, Chaojun Xiao, Zhiyuan Liu, Maosong Sun
Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control
Yuxin Xiao, Chaoqun Wan, Yonggang Zhang, Wenxiao Wang, Binbin Lin, Xiaofei He, Xu Shen, Jieping Ye
Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations
Rima Hazra, Sayan Layek, Somnath Banerjee, Soujanya Poria
On GNN explanability with activation rules
Luca Veyrin-Forrer, Ataollah Kamal, Stefan Duffner, Marc Plantevit, Céline Robardet