Model Alignment
Model alignment focuses on ensuring that large language models (LLMs) behave as intended, aligning their outputs with human values and preferences. Current research emphasizes improving reward models through techniques like reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO), often incorporating more nuanced feedback than simple binary preferences. These advancements aim to mitigate issues like format bias, safety vulnerabilities arising from fine-tuning, and the propagation of errors from unreliable training data, ultimately leading to more reliable and trustworthy AI systems. The impact of this work is significant, improving the safety and usability of LLMs across diverse applications, from medical decision support to code generation.
Papers
PopAlign: Diversifying Contrasting Patterns for a More Comprehensive Alignment
Zekun Moore Wang, Shawn Wang, Kang Zhu, Jiaheng Liu, Ke Xu, Jie Fu, Wangchunshu Zhou, Wenhao Huang
Anchored Alignment for Self-Explanations Enhancement
Luis Felipe Villa-Arenas, Ata Nizamoglu, Qianli Wang, Sebastian Möller, Vera Schmitt