Bandit Policy

Bandit policies are algorithms designed to optimize sequential decision-making under uncertainty, balancing exploration of options with exploitation of known rewards. Current research focuses on improving efficiency and robustness in various settings, including contextual bandits (where decisions depend on observed information), partially observable contexts, and high-dimensional action spaces like those found in slate recommendation systems. Prominent approaches include Thompson sampling, inverse contextual bandit methods, and algorithms leveraging neural networks or Gaussian processes to model reward functions. These advancements have significant implications for applications such as recommender systems, online advertising, and resource allocation, offering improved performance and fairness in dynamic environments.

Papers