Polysemantic Neuron

Polysemantic neurons, those that respond to multiple unrelated features within a neural network, pose a significant challenge to interpretability. Current research focuses on understanding the origins and consequences of polysemanticity, employing techniques like sparse autoencoders and analyzing second-order effects to disentangle these complex neuronal responses in models such as CLIP and ResNet. This work aims to improve model interpretability and potentially enhance model performance by either promoting monosemanticity (one neuron, one feature) or leveraging the information encoded within polysemantic neurons. The insights gained could lead to more transparent and controllable AI systems.

Papers