Alignment Tuning

Alignment tuning aims to modify large language models (LLMs) to better align with human preferences and avoid harmful outputs. Current research focuses on improving the robustness and efficiency of alignment techniques, exploring methods like direct preference optimization (DPO) and its variations, as well as novel approaches such as subspace projection and in-context learning to reduce reliance on extensive fine-tuning. These efforts are crucial for mitigating risks associated with LLMs and enhancing their trustworthiness and usability in various applications, including medical diagnosis and other sensitive domains.

Papers