Paper ID: 2401.06738
(Accelerated) Noise-adaptive Stochastic Heavy-Ball Momentum
Anh Dang, Reza Babanezhad, Sharan Vaswani
Stochastic heavy ball momentum (SHB) is commonly used to train machine learning models, and often provides empirical improvements over stochastic gradient descent. By primarily focusing on strongly-convex quadratics, we aim to better understand the theoretical advantage of SHB and subsequently improve the method. For strongly-convex quadratics, Kidambi et al. (2018) show that SHB (with a mini-batch of size $1$) cannot attain accelerated convergence, and hence has no theoretical benefit over SGD. They conjecture that the practical gain of SHB is a by-product of using larger mini-batches. We first substantiate this claim by showing that SHB can attain an accelerated rate when the mini-batch size is larger than a threshold $b^*$ that depends on the condition number $\kappa$. Specifically, we prove that with the same step-size and momentum parameters as in the deterministic setting, SHB with a sufficiently large mini-batch size results in an $O\left(\exp(-\frac{T}{\sqrt{\kappa}}) + \sigma \right)$ convergence, where $T$ is the number of iterations and $\sigma^2$ is the variance in the stochastic gradients. We prove a lower-bound which demonstrates that a $\kappa$ dependence in $b^*$ is necessary. To ensure convergence to the minimizer, we design a noise-adaptive multi-stage algorithm that results in an $O\left(\exp\left(-\frac{T}{\sqrt{\kappa}}\right) + \frac{\sigma}{T}\right)$ rate. We also consider the general smooth, strongly-convex setting and propose the first noise-adaptive SHB variant that converges to the minimizer at an $O(\exp(-\frac{T}{\kappa}) + \frac{\sigma^2}{T})$ rate. We empirically demonstrate the effectiveness of the proposed algorithms.
Submitted: Jan 12, 2024