Targeted Activation Penalty

Targeted activation penalty (TAP) research focuses on improving the robustness and interpretability of neural networks by manipulating neuron activations. Current work investigates how activation scaling, dropout, and other techniques can mitigate issues like spurious signal reliance, massive activations (excessively large activation values in specific dimensions), and task drift in large language models (LLMs) and other architectures, including convolutional neural networks (CNNs) and graph neural networks (GNNs). These efforts aim to enhance model generalization, safety, and explainability, leading to more reliable and trustworthy AI systems across various applications. The ultimate goal is to develop more robust and interpretable models by better understanding and controlling the internal workings of neural networks.

Papers