LLM Alignment
LLM alignment focuses on aligning large language models' behavior with human values and preferences, aiming to mitigate harmful outputs like biases, misinformation, and unsafe instructions. Current research emphasizes developing more efficient and robust alignment techniques, including methods like Direct Preference Optimization (DPO) and Proximal Policy Optimization (PPO), often incorporating personalized preferences and addressing the unreliability of human feedback. This field is crucial for ensuring the safe and beneficial deployment of LLMs, impacting both the development of more trustworthy AI systems and the broader societal implications of advanced language technologies.
Papers
AuditLLM: A Tool for Auditing Large Language Models Using Multiprobe Approach
Maryam Amirizaniani, Elias Martin, Tanya Roosta, Aman Chadha, Chirag Shah
Attacking Large Language Models with Projected Gradient Descent
Simon Geisler, Tom Wollschläger, M. H. I. Abdalla, Johannes Gasteiger, Stephan Günnemann