Paper ID: 2406.07778

On Trojans in Refined Language Models

Jayaram Raghuram, George Kesidis, David J. Miller

A Trojan in a language model can be inserted when the model is refined for a particular application such as determining the sentiment of product reviews. In this paper, we clarify and empirically explore variations of the data-poisoning threat model. We then empirically assess two simple defenses each for a different defense scenario. Finally, we provide a brief survey of related attacks and defenses.

Submitted: Jun 12, 2024