Paper ID: 2409.14557

Exploiting Exogenous Structure for Sample-Efficient Reinforcement Learning

Jia Wan, Sean R. Sinclair, Devavrat Shah, Martin J. Wainwright

We study a class of structured Markov Decision Processes (MDPs) known as Exo-MDPs. They are characterized by a partition of the state space into two components: the exogenous states evolve stochastically in a manner not affected by the agent's actions, whereas the endogenous states can be affected by actions, and evolve according to deterministic dynamics involving both the endogenous and exogenous states. Exo-MDPs provide a natural model for various applications, including inventory control, portfolio management, power systems, and ride-sharing, among others. While seemingly restrictive on the surface, our first result establishes that any discrete MDP can be represented as an Exo-MDP. The underlying argument reveals how transition and reward dynamics can be written as linear functions of the exogenous state distribution, showing how Exo-MDPs are instances of linear mixture MDPs, thereby showing a representational equivalence between discrete MDPs, Exo-MDPs, and linear mixture MDPs. The connection between Exo-MDPs and linear mixture MDPs leads to algorithms that are near sample-optimal, with regret guarantees scaling with the (effective) size of the exogenous state space $d$, independent of the sizes of the endogenous state and action spaces, even when the exogenous state is {\em unobserved}. When the exogenous state is unobserved, we establish a regret upper bound of $O(H^{3/2}d\sqrt{K})$ with $K$ trajectories of horizon $H$ and unobserved exogenous state of dimension $d$. We also establish a matching regret lower bound of $\Omega(H^{3/2}d\sqrt{K})$ for non-stationary Exo-MDPs and a lower bound of $\Omega(Hd\sqrt{K})$ for stationary Exo-MDPs. We complement our theoretical findings with an experimental study on inventory control problems.

Submitted: Sep 22, 2024