AI Alignment

AI alignment focuses on ensuring artificial intelligence systems act in accordance with human values and intentions, addressing potential risks from misaligned goals. Current research emphasizes diverse approaches, including reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO), often applied to large language models (LLMs), to achieve alignment through various methods like reward shaping and preference aggregation. This field is crucial for responsible AI development, impacting both the safety and ethical implications of increasingly capable AI systems across numerous applications.

Papers

October 24, 2023

AI Alignment and Social Choice: Fundamental Limitations and Policy Implications
Abhilash Mishra
AI Agent Fundamental Limitation AI Alignment Social Choice Agent Alignment Human AI Alignment Policy Recommendation

October 23, 2023

Interactive AI Alignment: Specification, Process, and Evaluation Alignment
Michael Terry, Chinmay Kulkarni, Martin Wattenberg, Lucas Dixon, Meredith Ringel Morris
AI Alignment Alignment Objective Interactive AI Collaborative Evaluation

September 28, 2023

September 26, 2023

Large Language Model Alignment: A Survey
Tianhao Shen, Renren Jin, Yufei Huang, Chuang Liu, Weilong Dong, Zishan Guo, Xinwei Wu, Yan Liu, Deyi Xiong
Large Language Model Timely Survey AI Alignment Alignment Approach Large Language Model Alignment

September 10, 2023

Decolonial AI Alignment: Openness, Vi\'{s}e\d{s}a-Dharma, and Including Excluded Knowledges
Kush R. Varshney
Artificial Intelligence Knowledge Based AI Alignment Mutual Influence Sanskrit Poetry

August 3, 2023

VisAlign: Dataset for Measuring the Degree of Alignment between AI and Humans in Visual Perception
Jiyoung Lee, Seungho Kim, Seunghyun Won, Joonseok Lee, Marzyeh Ghassemi, James Thorne, Jaeseok Choi, O-Kil Kwon, Edward Choi
Artificial Intelligence Data Set Alignment Problem Visual Perception Perception Model AI Alignment Visual Alignment

July 20, 2023

Of Models and Tin Men: A Behavioural Economics Study of Principal-Agent Problems in AI Alignment using Large-Language Models
Steve Phelps, Rebecca Ranson
Large Language Model Full Model AI Safety AI Alignment Principal Agent Behavioral Economics

June 19, 2023

Concept Extrapolation: A Conceptual Primer
Matija Franklin, Rebecca Gorman, Hal Ashton, Stuart Armstrong
AI Alignment Model Splitting Concept Shift Cartesian Product Extrapolation Knowledge Extrapolation

May 30, 2023

Intent-aligned AI systems deplete human agency: the need for agency foundations research in AI safety
Catalin Mitelut, Ben Smith, Peter Vamplew
Artificial Intelligence Foundation Model Artificial General Intelligence Community Need AI Alignment Safe AI Human Artificial Intelligence Artificial Intelligence Safety

May 9, 2023

Fine-tuning Language Models with Generative Adversarial Reward Modelling
Zhang Ze Yu, Lau Jia Jaw, Zhang Hui, Bryan Kian Hsiang Low
Reinforcement Learning Adversarial Training Supervised Fine Tuning AI Alignment Language Model Fine Tuning Adversarial Reward Adversarial Feedback

April 26, 2023

Towards ethical multimodal systems
Alexis Roger, Esma Aïmeur, Irina Rish
AI System Multimodal AI AI Alignment Generative AI System Medical Multimodal

January 16, 2023

AI Alignment Dialogues: An Interactive Approach to AI Alignment in Support Agents
Pei-Yu Chen, Myrthe L. Tielman, Dirk K. J. Heylen, Catholijn M. Jonker, M. Birna van Riemsdijk
AI System Customer Service AI Alignment

January 10, 2023

A Multi-Level Framework for the AI Alignment Problem
Betty Li Hou, Brian Patrick Green
Artificial Intelligence AI System Content Moderation AI Alignment Value Alignment Hierarchical Framework

December 22, 2022

Methodological reflections for AI alignment research using human feedback
Thilo Hagendorff, Sarah Fabi
Large Language Model Human Feedback Summarization Task Summarization Model AI Alignment Methodological Insight

October 4, 2022

Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals
Rohin Shah, Vikrant Varma, Ramana Kumar, Mary Phuong, Victoria Krakovna, Jonathan Uesato, Zac Kenton
Artificial Intelligence System AI Alignment Pseudo Goal Design Specification Robustness Issue

June 25, 2022

Aligning Artificial Intelligence with Humans through Public Policy
John Nay, James Daily
Artificial Intelligence Real Human Artificial Intelligence System AI Alignment Public Policy

June 6, 2022

Researching Alignment Research: Unsupervised Analysis
Jan H. Kirchner, Logan Smith, Jacques Thibodeau, Kyle McDonell, Laria Reynolds
Artificial Intelligence AI Alignment Research Trend Unsupervised Evaluation

May 9, 2022

Aligned with Whom? Direct and social goals for AI systems
Anton Korinek, Avital Balwit
Artificial Intelligence System AI Alignment Artificial Intelligence Governance Social Alignment

December 19, 2021

Demanding and Designing Aligned Cognitive Architectures
Koen Holtman
AI System Cognitive Architecture AI Alignment Reward Maximization Consumer Complaint