Risky Prompt Rejection
Risky prompt rejection in machine learning focuses on enabling models to reliably abstain from answering questions or making predictions when faced with uncertainty, ambiguity, or potentially harmful inputs. Current research explores various approaches, including modifying loss functions, leveraging density ratios, and employing reinforcement learning to train models to identify and reject such prompts, often incorporating techniques like counterfactual analysis and conditional evidence decoupling. This area is crucial for improving the safety and reliability of AI systems, particularly large language models, and for mitigating biases and vulnerabilities to adversarial attacks in diverse applications.
Papers
October 26, 2024
August 20, 2024
June 26, 2024
June 3, 2024
May 29, 2024
May 22, 2024
March 27, 2024
February 19, 2024
November 22, 2023
October 25, 2023
October 7, 2023
October 3, 2023
September 12, 2023
July 6, 2023
May 27, 2023
May 22, 2023
May 2, 2023
February 16, 2023
January 7, 2023