Harmful Content
Harmful content generation and detection in large language models (LLMs) and text-to-image diffusion models is a rapidly evolving research area focused on mitigating the risks of bias, toxicity, and misinformation. Current research emphasizes developing methods to prevent harmful outputs through techniques like attention re-weighting, prompt engineering, and unlearning harmful knowledge, often employing multimodal approaches and continual learning frameworks. This work is crucial for ensuring the responsible development and deployment of AI systems, impacting both the safety of online environments and the ethical considerations surrounding AI development.
Papers
TraSCE: Trajectory Steering for Concept Erasure
Anubhav Jain, Yuya Kobayashi, Takashi Shibuya, Yuhta Takida, Nasir Memon, Julian Togelius, Yuki Mitsufuji
Look Before You Leap: Enhancing Attention and Vigilance Regarding Harmful Content with GuidelineLLM
Shaoqing Zhang, Zhuosheng Zhang, Kehai Chen, Rongxiang Weng, Muyun Yang, Tiejun Zhao, Min Zhang
Buster: Implanting Semantic Backdoor into Text Encoder to Mitigate NSFW Content Generation
Xin Zhao, Xiaojun Chen, Yuexin Xuan, Zhendong Zhao, Xiaojun Jia, Xinfeng Li, Xiaofeng Wang