Risky Prompt Rejection
Risky prompt rejection in machine learning focuses on enabling models to reliably abstain from answering questions or making predictions when faced with uncertainty, ambiguity, or potentially harmful inputs. Current research explores various approaches, including modifying loss functions, leveraging density ratios, and employing reinforcement learning to train models to identify and reject such prompts, often incorporating techniques like counterfactual analysis and conditional evidence decoupling. This area is crucial for improving the safety and reliability of AI systems, particularly large language models, and for mitigating biases and vulnerabilities to adversarial attacks in diverse applications.
Papers
September 13, 2022
August 12, 2022
July 5, 2022
May 16, 2022
May 1, 2022
February 26, 2022
February 18, 2022
January 10, 2022
December 22, 2021