Paper ID: 2411.18699 • Published Nov 27, 2024
An indicator for effectiveness of text-to-image guardrails utilizing the Single-Turn Crescendo Attack (STCA)
TL;DR
Get AI-generated summaries with premium
Get AI-generated summaries with premium
The Single-Turn Crescendo Attack (STCA), first introduced in Aqrawi and
Abbasi [2024], is an innovative method designed to bypass the ethical
safeguards of text-to-text AI models, compelling them to generate harmful
content. This technique leverages a strategic escalation of context within a
single prompt, combined with trust-building mechanisms, to subtly deceive the
model into producing unintended outputs. Extending the application of STCA to
text-to-image models, we demonstrate its efficacy by compromising the
guardrails of a widely-used model, DALL-E 3, achieving outputs comparable to
outputs from the uncensored model Flux Schnell, which served as a baseline
control. This study provides a framework for researchers to rigorously evaluate
the robustness of guardrails in text-to-image models and benchmark their
resilience against adversarial attacks.