Model Misalignment

Model misalignment refers to discrepancies between a model's intended behavior and its actual performance, arising from various sources including incomplete or inaccurate training data, flawed reward functions, and limitations in model architecture. Current research focuses on identifying and mitigating these misalignments across diverse applications, examining their impact on areas such as vision-language models, reinforcement learning agents, and privacy-preserving machine learning. Understanding and addressing model misalignment is crucial for ensuring the reliability, safety, and ethical deployment of increasingly sophisticated AI systems, impacting fields ranging from robotics to data security.

Papers

February 14, 2024

Quantified Task Misalignment to Inform PEFT: An Exploration of Domain Generalization and Catastrophic Forgetting in CLIP
Laura Niss, Kevin Vogt-Lowell, Theodoros Tsiligkaridis
Strong Generalization Catastrophic Forgetting Domain Generalization Single CLIP Attention Weight Task Performance Model Misalignment

September 20, 2023

The Scenario Refiner: Grounding subjects in images at the morphological level
Claudia Tagliaferri, Sofia Axioti, Albert Gatt, Denis Paperno
Digital Image Language Bias Scenario Generation Multi Subject Semantic Variation Morphological Variation Model Misalignment

June 8, 2023

Investigating the Effect of Misalignment on Membership Privacy in the White-box Setting
Ana-Maria Cretu, Daniel Jones, Yves-Alexandre de Montjoye, Shruti Tople
Mixed Effect White Box Membership Privacy Segment Misalignment Shadow Model Box Membership Inference Attack Local Ultimate Gradient Inspection Model Misalignment

April 11, 2023

Diagnosing and Augmenting Feature Representations in Correctional Inverse Reinforcement Learning
Inês Lourenço, Andreea Bobu, Cristian R. Rojas, Bo Wahlberg
Non Humanoid Robot Robot Person Inverse Reinforcement Learning Feature Representation Segment Misalignment Model Misalignment

January 10, 2022

The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models
Alexander Pan, Kush Bhatia, Jacob Steinhardt
Mixed Effect Reward Function Area MaPPing Agent Capability RL Agent Reward Hacking Reward Misspecification Model Misalignment

Model Misalignment

Papers

Quantified Task Misalignment to Inform PEFT: An Exploration of Domain Generalization and Catastrophic Forgetting in CLIP

The Scenario Refiner: Grounding subjects in images at the morphological level

Investigating the Effect of Misalignment on Membership Privacy in the White-box Setting

Diagnosing and Augmenting Feature Representations in Correctional Inverse Reinforcement Learning

The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models