Risky Prompt Rejection

Risky prompt rejection in machine learning focuses on enabling models to reliably abstain from answering questions or making predictions when faced with uncertainty, ambiguity, or potentially harmful inputs. Current research explores various approaches, including modifying loss functions, leveraging density ratios, and employing reinforcement learning to train models to identify and reject such prompts, often incorporating techniques like counterfactual analysis and conditional evidence decoupling. This area is crucial for improving the safety and reliability of AI systems, particularly large language models, and for mitigating biases and vulnerabilities to adversarial attacks in diverse applications.

Papers