Proximal Policy Optimization
Proximal Policy Optimization (PPO) is a reinforcement learning algorithm used to train agents to make optimal decisions in complex environments, with a current research focus on improving its efficiency and robustness. Recent work explores enhancements such as refined credit assignment methods (e.g., VinePPO), incorporation of human feedback and safety mechanisms (e.g., HI-PPO, PRPO), and addressing challenges in high-dimensional spaces and sample efficiency through techniques like diffusion model integration. These advancements are significant for various applications, including robotics, autonomous systems, and large language model alignment, where PPO's ability to learn effective policies from interactions with the environment is crucial.
Papers
Is poisoning a real threat to LLM alignment? Maybe more so than you think
Pankayaraj Pathmanathan, Souradip Chakraborty, Xiangyu Liu, Yongyuan Liang, Furong Huang
P-TA: Using Proximal Policy Optimization to Enhance Tabular Data Augmentation via Large Language Models
Shuo Yang, Chenchen Yuan, Yao Rong, Felix Steinbauer, Gjergji Kasneci
SeRTS: Self-Rewarding Tree Search for Biomedical Retrieval-Augmented Generation
Minda Hu, Licheng Zong, Hongru Wang, Jingyan Zhou, Jingjing Li, Yichen Gao, Kam-Fai Wong, Yu Li, Irwin King