Rate Optimal Regret

Rate-optimal regret in reinforcement learning (RL) focuses on designing algorithms that minimize the cumulative difference between an agent's performance and that of an optimal policy, achieving the best possible scaling with respect to problem parameters like time horizon and state/action space size. Current research emphasizes developing efficient algorithms for various settings, including linear Markov Decision Processes (MDPs), contextual bandits (with preference or direct feedback), and zero-inflated reward structures, often employing techniques like upper confidence bounds (UCB), Thompson sampling, and information-directed sampling. These advancements are crucial for improving the sample efficiency and robustness of RL algorithms in practical applications, particularly in scenarios with limited data or complex reward functions, such as those involving human feedback.

Papers