Bandit Feedback
Bandit feedback, where only the reward of the chosen action is observed, presents a significant challenge in online learning and optimization problems. Current research focuses on developing efficient algorithms for various settings, including constrained Markov decision processes (CMDPs), combinatorial bandits, and linear MDPs, often employing techniques like Thompson sampling, optimistic algorithms, and Frank-Wolfe methods to address the exploration-exploitation dilemma inherent in bandit feedback. These advancements are crucial for tackling real-world problems with limited feedback, such as online advertising, recommendation systems, and network optimization, where obtaining full information is impractical or costly. The development of algorithms with provable regret bounds and efficient computational complexity is a major focus, driving progress in both theoretical understanding and practical applications.
Papers
A Framework for Adapting Offline Algorithms to Solve Combinatorial Multi-Armed Bandit Problems with Bandit Feedback
Guanyu Nie, Yididiya Y Nadew, Yanhui Zhu, Vaneet Aggarwal, Christopher John Quinn
Autobidders with Budget and ROI Constraints: Efficiency, Regret, and Pacing Dynamics
Brendan Lucier, Sarath Pattathil, Aleksandrs Slivkins, Mengxiao Zhang
Improved Regret for Efficient Online Reinforcement Learning with Linear Function Approximation
Uri Sherman, Tomer Koren, Yishay Mansour