Representation Engineering
Representation engineering focuses on modifying the internal representations of machine learning models, particularly large language models (LLMs), to improve their safety, alignment with human values, and performance on specific tasks. Current research emphasizes techniques like manipulating activation vectors to steer model outputs, developing methods for detecting and mitigating harmful outputs (e.g., through "circuit breakers"), and creating more interpretable representations for better understanding and control. This field is crucial for addressing safety and ethical concerns in AI, enabling more reliable and beneficial applications while enhancing transparency and explainability.
Papers
PaCE: Parsimonious Concept Engineering for Large Language Models
Jinqi Luo, Tianjiao Ding, Kwan Ho Ryan Chan, Darshan Thaker, Aditya Chattopadhyay, Chris Callison-Burch, René Vidal
Improving Alignment and Robustness with Circuit Breakers
Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym Andriushchenko, Rowan Wang, Zico Kolter, Matt Fredrikson, Dan Hendrycks