Dangerous Capability
Dangerous capability in AI refers to the potential for advanced models, particularly large language models (LLMs), to perform actions with harmful consequences, such as generating deceptive content, facilitating cyberattacks, or aiding in the development of weapons. Current research focuses on evaluating these capabilities through various benchmarks and datasets, investigating the vulnerabilities of models to attacks (like fine-tuning or prompting), and exploring methods to mitigate risks, including techniques like "unlearning" harmful knowledge. Understanding and mitigating dangerous capabilities is crucial for responsible AI development and deployment, informing policy decisions and ensuring the safe integration of powerful AI systems into society.
Papers
Prioritizing High-Consequence Biological Capabilities in Evaluations of Artificial Intelligence Models
Jaspreet Pannu, Doni Bloomfield, Alex Zhu, Robert MacKnight, Gabe Gomes, Anita Cicero, Thomas V. Inglesby
No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning Attacks
Chak Tou Leong, Yi Cheng, Kaishuai Xu, Jian Wang, Hanlin Wang, Wenjie Li