Token Level Attack

Token-level attacks aim to manipulate large language models (LLMs) by subtly altering individual words or sub-word units within prompts, causing the model to generate undesirable or harmful outputs. Current research focuses on developing increasingly sophisticated attack methods, employing techniques like Markov decision processes and gradient-based optimization to find effective adversarial perturbations, often within a black-box setting. These attacks highlight vulnerabilities in LLMs and drive the development of more robust models and defenses, impacting the safety and reliability of AI systems across various applications. The effectiveness of these attacks varies across different LLMs, suggesting ongoing improvements in model robustness are needed.

Papers