Contrastive Reward

Contrastive reward learning enhances reinforcement learning (RL) by training models to distinguish between desirable and undesirable outputs, improving performance beyond standard reward methods. Current research focuses on applying this technique to various tasks, including large language model alignment, image captioning, and abstractive summarization, often leveraging transformer-based architectures and incorporating techniques like Proximal Policy Optimization (PPO). This approach addresses challenges like reward model fragility and hallucination in generated content, leading to more robust and accurate models with improved human evaluation scores across diverse applications.

Papers