Reward Misspecification

Reward misspecification, a critical challenge in artificial intelligence, arises when a system's reward function inadequately captures desired behavior, leading to unintended or harmful outcomes. Current research focuses on detecting and mitigating this problem through various approaches, including information-theoretic reward modeling, iterative reward shaping with human feedback, and methods that leverage causal inference to identify spurious correlations in reward signals. Addressing reward misspecification is crucial for ensuring the safety and reliability of AI systems, particularly in complex applications like large language models and reinforcement learning agents, and is driving the development of more robust and aligned AI.

Papers

June 20, 2024

Jailbreaking as a Reward Misspecification Problem
Zhihui Xie, Jiahui Gao, Lei Li, Zhenguo Li, Qi Liu, Lingpeng Kong
Large Language Model Adversarial Attack Jailbreak Attack Adversarial Prompt Malicious Prompt Reward Misspecification

April 12, 2024

Expectation Alignment: Handling Reward Misspecification in the Presence of Expectation Mismatch
Malek Mechergui, Sarath Sreedharan
Speech Presence Artificial Intelligence Agent Expectation Alignment Reward Misspecification Addressing Misspecification

February 14, 2024

InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling
Yuchun Miao, Sen Zhang, Liang Ding, Rong Bao, Lefei Zhang, Dacheng Tao
Reward Model Reinforcement Learning From Human Feedback Reward Hacking Reward Overoptimization Reward Misspecification Information Theoretic Reward

August 30, 2023

Iterative Reward Shaping using Human Feedback for Correcting Reward Misspecification
Jasmina Gajcin, James McCarthy, Rahul Nair, Radu Marinescu, Elizabeth Daly, Ivana Dusparic
Reinforcement Learning Human Feedback Reward Shaping Reward Misspecification Complex Reward Function

February 9, 2023

CLARE: Conservative Model-Based Reward Learning for Offline Inverse Reinforcement Learning
Sheng Yue, Guanbo Wang, Wei Shao, Zhaofeng Zhang, Sen Lin, Ju Ren, Junshan Zhang
Reward Function Inverse Reinforcement Learning Exploration Exploitation Reward Misspecification Conservative Reward

January 3, 2023

Genetic Imitation Learning by Reward Extrapolation
Boyuan Zheng, Jianlong Zhou, Fang Chen
Imitation Learning Genetic Algorithm Suboptimal Demonstration Stochastic Policy Evolutionary Learning Reward Misspecification Reward Function Parameter

May 19, 2022

Reinforcement Learning with Brain-Inspired Modulation can Improve Adaptation to Environmental Changes
Eric Chalmers, Artur Luczak
Reinforcement Learning Adaptation Concern Digital Modulation Deep Q Biological Learning Environmental Change Reward Misspecification

April 13, 2022

Causal Confusion and Reward Misidentification in Preference-Based Reward Learning
Jeremy Tien, Jerry Zhi-Yang He, Zackory Erickson, Anca D. Dragan, Daniel S. Brown
Reinforcement Learning Preference Based Reward Learning Policy Reward Misspecification

January 10, 2022

The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models
Alexander Pan, Kush Bhatia, Jacob Steinhardt
Mixed Effect Reward Function Area MaPPing Agent Capability RL Agent Reward Hacking Reward Misspecification Model Misalignment

Reward Misspecification

Papers

Jailbreaking as a Reward Misspecification Problem

Expectation Alignment: Handling Reward Misspecification in the Presence of Expectation Mismatch

InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling

Iterative Reward Shaping using Human Feedback for Correcting Reward Misspecification

CLARE: Conservative Model-Based Reward Learning for Offline Inverse Reinforcement Learning

Genetic Imitation Learning by Reward Extrapolation

Reinforcement Learning with Brain-Inspired Modulation can Improve Adaptation to Environmental Changes

Causal Confusion and Reward Misidentification in Preference-Based Reward Learning

The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models