Bandit Feedback
Bandit feedback, where only the reward of the chosen action is observed, presents a significant challenge in online learning and optimization problems. Current research focuses on developing efficient algorithms for various settings, including constrained Markov decision processes (CMDPs), combinatorial bandits, and linear MDPs, often employing techniques like Thompson sampling, optimistic algorithms, and Frank-Wolfe methods to address the exploration-exploitation dilemma inherent in bandit feedback. These advancements are crucial for tackling real-world problems with limited feedback, such as online advertising, recommendation systems, and network optimization, where obtaining full information is impractical or costly. The development of algorithms with provable regret bounds and efficient computational complexity is a major focus, driving progress in both theoretical understanding and practical applications.