Standard Bandit

Standard bandit problems address the challenge of sequentially selecting the best option from multiple alternatives with uncertain rewards, aiming to maximize cumulative reward. Current research focuses on improving efficiency in various contexts, including incorporating contextual information (e.g., using Bayesian updates from online reviews), handling batching and delayed feedback in real-world applications, and adapting to non-stationary reward distributions through techniques like learning rate-free reinforcement learning and hierarchical expert systems. These advancements enhance the applicability of bandit algorithms across diverse fields, from dynamic pricing and robotic exploration to large-scale adaptive experimentation.

Papers