Regret of exploratory policy improvement and $q$-learning [2411.01302]