LLM Alignment

LLM alignment focuses on aligning large language models' behavior with human values and preferences, aiming to mitigate harmful outputs like biases, misinformation, and unsafe instructions. Current research emphasizes developing more efficient and robust alignment techniques, including methods like Direct Preference Optimization (DPO) and Proximal Policy Optimization (PPO), often incorporating personalized preferences and addressing the unreliability of human feedback. This field is crucial for ensuring the safe and beneficial deployment of LLMs, impacting both the development of more trustworthy AI systems and the broader societal implications of advanced language technologies.

Papers

October 10, 2024

HyperDPO: Conditioned One-Shot Multi-Objective Fine-Tuning Framework
Yinuo Ren, Tesi Xiao, Michael Shavlovsky, Lexing Ying, Holakou Rahmanian
Direct Preference Optimization LLM Alignment

October 9, 2024

The Moral Turing Test: Evaluating Human-LLM Alignment in Moral Decision-Making
Basile Garcia, Crystal Qian, Stefano Palminteri
LLM Alignment LLM Evaluation Moral Dilemma Artificial Intelligence Bias Machine Perception Moral Machine Moral Decision

October 7, 2024

As Simple as Fine-tuning: LLM Alignment via Bidirectional Negative Feedback Loss
Xin Mao, Feng-Lin Li, Huimin Xu, Wei Zhang, Wang Chen, Anh Tuan Luu
Fine Tuning Proximal Policy Optimization Direct Preference Optimization Preference Optimization LLM Alignment Semantic Loss

October 5, 2024

PAD: Personalized Alignment of LLMs at Decoding-Time
Ruizhe Chen, Xiaotian Zhang, Meng Luo, Wenhao Chai, Zuozhu Liu
LLM Alignment User Preference Personalized Alignment Decoding Time

October 2, 2024

How Reliable Is Human Feedback For Aligning Large Language Models?
Min-Hsuan Yeh, Leitian Tao, Jeffrey Wang, Xuefeng Du, Yixuan Li
Human Feedback LLM Alignment Human Answer Mistake

August 30, 2024

Safety Layers in Aligned Large Language Models: The Key to LLM Security
Shen Li, Liuyi Yao, Lan Zhang, Yaliang Li
Language Model LLM Safety LLM Alignment Backdoor Data Safety Layer

August 28, 2024

CBF-LLM: Safe Control for LLM Alignment
Yuya Miyaoka, Masaki Inoue
Large Language Model Text Generation Control Barrier Function LLM Alignment Safe Control Alignment Task

August 23, 2024

Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates
Hui Wei, Shenghua He, Tian Xia, Andy Wong, Jingyang Lin, Mei Han
Large Language Model Evaluation Metric LLM Alignment Alignment Approach LLM a a Judge Commercial Large Language Model Prompt Template Explainable Metric

August 9, 2024

Order Matters in Hallucination: Reasoning Order as Benchmark and Reflexive Prompting for Large-Language-Models
Zikai Xie
Large Language Model Complex Reasoning Yes No Question Content Hallucination Order Matter LLM Alignment Reasoning Process Logic Based

July 23, 2024

A Comprehensive Survey of LLM Alignment Techniques: RLHF, RLAIF, PPO, DPO and More
Zhichao Wang, Bin Bi, Shiva Kumar Pentyala, Kiran Ramnath, Sougata Chaudhuri, Shubham Mehrotra, Zixu, Zhu, Xiang-Bo Mao, Sitaram Asur, Na, Cheng
Large Language Model Self Supervised Learning Training Data LLM Alignment Instruction Fine Tuning Pre Training Corpus RLHF V PPO Algorithm

July 8, 2024

Exposing Privacy Gaps: Membership Inference Attack on Preference Data for LLM Alignment
Qizhang Feng, Siva Rajesh Kasa, Hyokun Yun, Choon Hui Teo, Sravan Babu Bodapati
Membership Inference Attack Transparency Index Human Preference LLM Alignment

July 3, 2024

Single Character Perturbations Break LLM Alignment
Leon Lin, Hannah Brown, Kenji Kawaguchi, Michael Shieh
Potential Harm LLM Alignment Model Level Defense

June 21, 2024

SAIL: Self-Improving Efficient Online Alignment of Large Language Models
Mucong Ding, Souradip Chakraborty, Vibhu Agrawal, Zora Che, Alec Koppel, Mengdi Wang, Amrit Bedi, Furong Huang
Reinforcement Learning Reinforcement Learning From Human Feedback LLM Alignment Model Alignment Online Alignment

June 17, 2024

Is poisoning a real threat to LLM alignment? Maybe more so than you think
Pankayaraj Pathmanathan, Souradip Chakraborty, Xiangyu Liu, Yongyuan Liang, Furong Huang
Reinforcement Learning Backdoor Attack Proximal Policy Optimization LLM Alignment Direct Policy New Threat

June 16, 2024

Toward Optimal LLM Alignments Using Two-Player Games
Rui Zheng, Hongyi Guo, Zhihan Liu, Xiaoying Zhang, Yuanshun Yao, Xiaojun Xu, Zhaoran Wang, Zhiheng Xi, Tao Gui, Qi Zhang, Xuanjing Huang, Hang Li, Yang Liu
LLM Alignment Adversarial Agent Two Player Standard Reinforcement Learning Iterative Interaction Two Agent

June 9, 2024

Distributional Preference Alignment of LLMs via Optimal Transport
Igor Melnyk, Youssef Mroueh, Brian Belgodere, Mattia Rigotti, Apoorva Nitsure, Mikhail Yurochkin, Kristjan Greenewald, Jiri Navratil, Jerret Ross
Medical LLM Optimal Transport LLM Alignment Alignment Dataset Stochastic Dominance

June 7, 2024

A Deep Dive into the Trade-Offs of Parameter-Efficient Preference Alignment Techniques
Megh Thakkar, Quentin Fournier, Matthew D Riemer, Pin-Yu Chen, Amal Zouaq, Payel Das, Sarath Chandar
Large Language Model Preference Alignment LLM Alignment Deep Dive Preference Fine Tuning Alignment Dataset

June 3, 2024

Do Large Language Models Perform the Way People Expect? Measuring the Human Generalization Function
Keyon Vafa, Ashesh Rambachan, Sendhil Mullainathan
Large Language Model Natural Language Processing Strong Generalization New Way LLM Alignment Systematic Generalization

May 30, 2024

Transfer Q Star: Principled Decoding for LLM Alignment
Souradip Chakraborty, Soumya Suvra Ghosal, Ming Yin, Dinesh Manocha, Mengdi Wang, Amrit Singh Bedi, Furong Huang
Fine Tuning State of the Art Formality Transfer Accurate Decoding LLM Alignment Q Function Sub Optimality Gap LIF Model

May 28, 2024

Getting More Juice Out of the SFT Data: Reward Learning from Human Demonstration Improves SFT for LLM Alignment
Jiaxiang Li, Siliang Zeng, Hoi-To Wai, Chenliang Li, Alfredo Garcia, Mingyi Hong
Reinforcement Learning Preference Learning Reward Learning Human Demonstration LLM Alignment