Hidden Knowledge
Hidden knowledge research explores the latent information and capabilities embedded within complex systems, particularly machine learning models, aiming to understand, extract, and mitigate their implications. Current research focuses on detecting hidden biases and vulnerabilities in models like LLMs and neural networks, employing techniques such as steganalysis, quiver representation theory, and contrastive learning to analyze hidden activations and emergent behaviors. This work is crucial for enhancing model safety, improving interpretability, and addressing concerns about fairness and security in various applications, from medical diagnosis to autonomous systems.
Papers
Obfuscated Activations Bypass LLM Latent-Space Defenses
Luke Bailey, Alex Serrano, Abhay Sheshadri, Mikhail Seleznyov, Jordan Taylor, Erik Jenner, Jacob Hilton, Stephen Casper, Carlos Guestrin, Scott Emmons
All You Need in Knowledge Distillation Is a Tailored Coordinate System
Junjie Zhou, Ke Zhu, Jianxin Wu