Value Alignment

Value alignment in artificial intelligence focuses on ensuring that AI systems, particularly large language models (LLMs), behave in accordance with human values and ethical principles. Current research emphasizes developing robust methods for measuring and improving alignment, exploring techniques like reinforcement learning from human feedback (RLHF), inverse reinforcement learning (IRL), and various parameter-efficient fine-tuning methods to bridge the gap between AI behavior and human preferences. This crucial area of research aims to mitigate potential risks associated with increasingly autonomous AI systems and is driving the development of new evaluation benchmarks and frameworks for assessing alignment across diverse cultural and ethical contexts. The ultimate goal is to build trustworthy and beneficial AI systems that reliably reflect and respect human values.

Papers

March 7, 2024

On the Essence and Prospect: An Investigation of Alignment Approaches for Big Models
Xinpeng Wang, Shitong Duan, Xiaoyuan Yi, Jing Yao, Shanlin Zhou, Zhihua Wei, Peng Zhang, Dongkuan Xu, Maosong Sun, Xing Xie
Large Model Comprehensive Investigation Visionary ProSpect Value Alignment Alignment Approach Alignment Performance Alignment Objective Optimal Alignment Alignment Algorithm

March 5, 2024

Shapley Values-Powered Framework for Fair Reward Split in Content Produced by GenAI
Alex Glinsky, Alexey Sokolsky
Artificial Intelligence Generative Model Shapley Value Client Contribution Value Alignment GenAI Integration Fair Division Model Developer

February 28, 2024

Exploring Multilingual Concepts of Human Value in Large Language Models: Is Value Alignment Consistent, Transferable and Controllable across Languages?
Shaoyang Xu, Weilong Dong, Zishan Guo, Xinwei Wu, Deyi Xiong
Unknown Language Value Alignment Human Value Multilingual Data Lingual Alignment

February 21, 2024

KorNAT: LLM Alignment Benchmark for Korean Social Values and Common Knowledge
Jiyoung Lee, Minwoo Kim, Seungho Kim, Junghwan Kim, Seunghyun Won, Hwaran Lee, Edward Choi
Value Alignment LLM Alignment Knowledge Alignment New Knowledge Cultural Value

February 12, 2024

A Hormetic Approach to the Value-Loading Problem: Preventing the Paperclip Apocalypse?
Nathan I. N. Henry, Mangor Pedersen, Matt Williams, Jamin L. B. Martin, Liesje Donkin
Value Alignment Human Value Holistic Approach Frequency Response Behavior Sharing

February 9, 2024

Modelling Human Values for AI Reasoning
Nardine Osman, Mark d'Inverno
Artificial Intelligence Value Alignment Human Value AI Reasoning

January 9, 2024

Concept Alignment
Sunayana Rane, Polyphony J. Bruna, Ilia Sucholutsky, Christopher Kello, Thomas L. Griffiths
AI System Artificial Intelligence Research AI Alignment Value Alignment Concept Alignment

December 23, 2023

Measuring Value Alignment
Fazl Barez, Philip Torr
Artificial Intelligence Markov Decision Process Value Alignment Human Value Normative System Artificial Intelligence Decision

December 21, 2023

Learning Human-like Representations to Enable Learning Human Values
Andrea Wynn, Ilia Sucholutsky, Thomas L. Griffiths
LeArning Abstract AI Agent Value Alignment Human Value Representation Alignment Human Representation

December 10, 2023

Cross Fertilizing Empathy from Brain to Machine as a Value Alignment Strategy
Devin Gonier, Adrian Adduci, Cassidy LoCascio
New Machine Brain Function AI Alignment Value Alignment Cognitive Empathy Deduction System Moral Philosophy Cross Fertilizing Empathy

December 9, 2023

Aligner: One Global Token is Worth Millions of Parameters When Aligning Large Language Models
Zhou Ziheng, Yingnian Wu, Song-Chun Zhu, Demetri Terzopoulos
Parameter Efficient Fine Tuning Many Parameter Value Alignment LLM Adaptation Worth Multiple Word Aligner Model

November 27, 2023

Evaluating the Impact of Personalized Value Alignment in Human-Robot Interaction: Insights into Trust and Team Performance Outcomes
Shreyas Bhat, Joseph B. Lyons, Cong Shi, X. Jessie Yang
Human Robot Interaction Reward Function DCU Insight AQ Appropriate Trust Human Behavior Value Alignment Human Robot Team Team Performance Trust Aware

November 15, 2023

Value FULCRA: Mapping Large Language Models to the Multidimensional Spectrum of Basic Human Values
Jing Yao, Xiaoyuan Yi, Xiting Wang, Yifan Gong, Xing Xie
Net Present Value Value Alignment Human Value Vector Valued Function Space Spectral Energy Distribution

November 12, 2023

Flames: Benchmarking Value Alignment of LLMs in Chinese
Kexin Huang, Xiangyang Liu, Qianyu Guo, Tianxiang Sun, Jiawei Sun, Yaru Wang, Zeyang Zhou, Yixu Wang, Yan Teng, Xipeng Qiu, Yingchun Wang, Dahua Lin
Large Language Model Chinese Character Adversarial Prompt Value Alignment Better Alignment

October 30, 2023

Concept Alignment as a Prerequisite for Value Alignment
Sunayana Rane, Mark Ho, Ilia Sucholutsky, Thomas L. Griffiths
Inverse Reinforcement Learning Prior Knowledge Value Alignment Concept Alignment

October 12, 2023

Multi-Value Alignment in Normative Multi-Agent System: An Evolutionary Optimisation Approach
Maha Riad, Vinicius de Carvalho, Fatemeh Golpayegani
Value Alignment Heterogeneous Agent Evolutionary Optimization Multi Reference

October 9, 2023

Dynamic value alignment through preference aggregation of multiple objectives
Marcin Korecki, Damian Dailisan, Cesare Carissimo
Multi Objective Value Alignment Multiple Objective Q$ Learning Preference Aggregation

August 23, 2023

From Instructions to Intrinsic Human Values -- A Survey of Alignment Goals for Big Models
Jing Yao, Xiaoyuan Yi, Xiting Wang, Jindong Wang, Xing Xie
Timely Survey Large Model Human Instruction Value Alignment Human Value Alignment Objective

June 20, 2023

TrustGPT: A Benchmark for Trustworthy and Responsible Large Language Models
Yue Huang, Qihui Zhang, Philip S. Y, Lichao Sun
Language Model New Benchmark Dialogue Generation Value Alignment

May 26, 2023

Heterogeneous Value Alignment Evaluation for Large Language Models
Zhaowei Zhang, Ceyao Zhang, Nian Liu, Siyuan Qi, Ziqi Rong, Song-Chun Zhu, Shuguang Cui, Yaodong Yang
Heterogeneous Data Value Alignment Heterogeneous Preference Value Orientation