Dueling Bandit
Dueling bandits is a machine learning framework addressing the problem of learning optimal choices from pairwise comparisons, rather than direct reward signals. Current research focuses on extending the basic framework to handle contextual information, delayed or adversarial feedback, and non-stationary preferences, often employing algorithms based on upper confidence bounds, Thompson sampling, or EXP3 variants, sometimes augmented with neural networks or large language models. These advancements are crucial for improving decision-making in applications like recommendation systems, online advertising, and human-in-the-loop optimization, where direct reward signals are unavailable or costly to obtain. The field is actively developing theoretically sound and computationally efficient algorithms to address the challenges posed by real-world complexities.