Armed Bandit

The multi-armed bandit problem models sequential decision-making under uncertainty, aiming to maximize cumulative rewards by strategically selecting actions (arms) with unknown payoff distributions. Current research emphasizes refining algorithms like Thompson Sampling, Upper Confidence Bound (UCB), and Follow-the-Perturbed-Leader (FTPL), often incorporating advanced techniques such as distribution matching, and addressing complexities like delayed feedback, asynchronous agents, and non-linear reward structures. These advancements are crucial for optimizing applications across diverse fields, including online advertising, clinical trials, recommendation systems, and materials discovery, where efficient exploration and exploitation are paramount.

Papers