AI Alignment

AI alignment focuses on ensuring artificial intelligence systems act in accordance with human values and intentions, addressing potential risks from misaligned goals. Current research emphasizes diverse approaches, including reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO), often applied to large language models (LLMs), to achieve alignment through various methods like reward shaping and preference aggregation. This field is crucial for responsible AI development, impacting both the safety and ethical implications of increasingly capable AI systems across numerous applications.

Papers

May 31, 2024

The AI Alignment Paradox
Robert West, Roland Aydin
Artificial Intelligence Artificial Intelligence Model AI Alignment

May 28, 2024

AI Alignment with Changing and Influenceable Reward Functions
Micah Carroll, Davis Foote, Anand Siththaranjan, Stuart Russell, Anca Dragan
Pre Change Information AI Alignment AI Harm

May 23, 2024

Axioms for AI Alignment from Human Feedback
Luise Ge, Daniel Halpern, Evi Micha, Ariel D. Procaccia, Itai Shapira, Yevgeniy Vorobeychik, Junlin Wu
Reinforcement Learning Human Feedback Reward Function AI Alignment Axiomatic Approach Social Choice Random Utility

May 16, 2024

How Far Are We From AGI: Are LLMs All We Need?
Tao Feng, Chuanyang Jin, Jingyu Liu, Kunlun Zhu, Haoqin Tu, Zirui Cheng, Guanyu Lin, Jiaxuan You
Artificial Intelligence Artificial General Intelligence AI Alignment

May 14, 2024

Understanding the performance gap between online and offline alignment algorithms
Yunhao Tang, Daniel Zhaohan Guo, Zeyu Zheng, Daniele Calandriello, Yuan Cao, Eugene Tarassov, Rémi Munos, Bernardo Ávila Pires, Michal Valko, Yong Cheng, Will Dabney
Reinforcement Learning AI Alignment Offline Policy Large Language Model Alignment Performance Gap Policy Sampling Online Alignment

May 9, 2024

Beyond Prompts: Learning from Human Communication for Enhanced AI Intent Alignment
Yoonsu Kim, Kihoon Son, Seoyoung Kim, Juho Kim
Complex Prompt Human AI Interaction Human Centered AI Alignment Human Communication

April 30, 2024

HCC Is All You Need: Alignment-The Sensible Kind Anyway-Is Just Human-Centered Computing
Eric Gilbert
AI Alignment Article Centered Factor Human Centered Design Human Alignment Liver Tumor

April 16, 2024

Social Choice Should Guide AI Alignment in Dealing with Diverse Human Feedback
Vincent Conitzer, Rachel Freedman, Jobst Heitzig, Wesley H. Holliday, Bob M. Jacobs, Nathan Lambert, Milan Mossé, Eric Pacuit, Stuart Russell, Hailey Schoelkopf, Emanuel Tewolde, William S. Zwicker
Reinforcement Learning AI Alignment AI Ethic Social Choice Expert Feedback Collective Decision Constitutional AI

March 14, 2024

Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision
Zhiqing Sun, Longhui Yu, Yikang Shen, Weiyang Liu, Yiming Yang, Sean Welleck, Chuang Gan
Reward Model Complex Reasoning Task AI Alignment Easy to Hard Generalization

February 22, 2024

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs
Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, Sara Hooker
Reinforcement Learning LeArning Abstract Medical LLM Human Feedback Proximal Policy Optimization Style Representation Online Reinforcement Learning AI Alignment Basic Concept

February 20, 2024

Incentive Compatibility for AI Alignment in Sociotechnical Systems: Positions and Prospects
Zhaowei Zhang, Fengshuo Bai, Mingzhi Wang, Haoyang Ye, Chengdong Ma, Yaodong Yang
Visionary ProSpect Product Specific Position Information Mechanism Design AI Alignment Socio Technical System Incentive Compatibility Consensus Ranking

January 9, 2024

Concept Alignment
Sunayana Rane, Polyphony J. Bruna, Ilia Sucholutsky, Christopher Kello, Thomas L. Griffiths
AI System Artificial Intelligence Research AI Alignment Value Alignment Concept Alignment

January 8, 2024

Polynomial Precision Dependence Solutions to Alignment Research Center Matrix Completion Problems
Rico Angell
Matrix Completion Semidefinite Programming AI Alignment Spectral Method

December 10, 2023

Cross Fertilizing Empathy from Brain to Machine as a Value Alignment Strategy
Devin Gonier, Adrian Adduci, Cassidy LoCascio
New Machine Brain Function AI Alignment Value Alignment Cognitive Empathy Deduction System Moral Philosophy Cross Fertilizing Empathy

November 28, 2023

Foundational Moral Values for AI Alignment
Betty Li Hou, Brian Patrick Green
AI System Alignment Problem AI Alignment Moral Foundation Moral Philosophy Alignment Task

November 18, 2023

Case Repositories: Towards Case-Based Reasoning for AI Alignment
K. J. Kevin Feng, Quan Ze Chen, Inyoung Cheong, King Xia, Amy X. Zhang
AI Alignment Case Based Reasoning Constitutional AI

November 16, 2023

Bergeron: Combating Adversarial Attacks through a Conscience-Based Alignment Framework
Matthew Pisano, Peter Ly, Abraham Sanders, Bingsheng Yao, Dakuo Wang, Tomek Strzalkowski, Mei Si
Adversarial Attack Native Robustness Open Source Large Language Model AI Alignment Alignment Training Alignment Model

November 11, 2023

Intentional Biases in LLM Responses
Nicklaus Badyal, Derek Jacoby, Yvonne Coady
Large Language Model OpenAI Codex AI Alignment LLM Response Common Bias

November 9, 2023

Kantian Deontology Meets AI Alignment: Towards Morally Grounded Fairness Metrics
Carlos Mougan, Joshua Brand
AI Alignment AI Fairness Moral Theory Multiple Fairness Metric

October 30, 2023

AI Alignment: A Comprehensive Survey
Jiaming Ji, Tianyi Qiu, Boyuan Chen, Borong Zhang, Hantao Lou, Kaile Wang, Yawen Duan, Zhonghao He, Jiayi Zhou, Zhaowei Zhang, Fanzhi Zeng, Kwan Yee Ng, Juntao Dai, Xuehai Pan, Aidan O'Gara, Yingshan Lei, Hua Xu, Brian Tse, Jie Fu, Stephen McAleer, Yaodong Yang, Yizhou Wang, Song-Chun Zhu, Yike Guo, Wen Gao
Comprehensive Survey AI System Alignment Problem AI Alignment Alignment Training