Trojaned Model

Trojaned models, malicious machine learning models containing hidden triggers that activate unintended behavior, pose a significant security risk across various applications. Current research focuses on detecting these attacks in large language models (LLMs) and convolutional neural networks (CNNs), investigating methods like analyzing attention mechanisms, identifying weight-based signatures, and leveraging explainability techniques. The difficulty of reliably detecting trojans, particularly in LLMs, and the development of adaptive adversarial attacks highlight the need for robust defense mechanisms and improved model interpretability to ensure the trustworthiness of AI systems. This research is crucial for safeguarding the integrity and reliability of AI in high-stakes domains.

Papers