Adversarial Misuse
Adversarial misuse of large language models (LLMs) focuses on exploiting vulnerabilities to circumvent safety protocols and elicit undesirable outputs, often through "jailbreaking" techniques. Current research investigates the mechanisms behind successful attacks, developing both novel attack methods (like personalized encryption) and defensive strategies (including self-refinement and improved safety training). This area is crucial because the widespread deployment of LLMs necessitates robust security measures to prevent malicious use and ensure responsible AI development, impacting both the trustworthiness of AI systems and the integrity of applications like automated grading.
Papers
April 6, 2024
February 26, 2024
February 23, 2024
December 22, 2023
July 5, 2023