Representation Engineering

Representation engineering focuses on modifying the internal representations of machine learning models, particularly large language models (LLMs), to improve their safety, alignment with human values, and performance on specific tasks. Current research emphasizes techniques like manipulating activation vectors to steer model outputs, developing methods for detecting and mitigating harmful outputs (e.g., through "circuit breakers"), and creating more interpretable representations for better understanding and control. This field is crucial for addressing safety and ethical concerns in AI, enabling more reliable and beneficial applications while enhancing transparency and explainability.

Papers