Stochastic Gradient Descent
Stochastic Gradient Descent (SGD) is an iterative optimization algorithm used to find the minimum of a function, particularly useful in machine learning for training large models where computing the exact gradient is computationally prohibitive. Current research focuses on improving SGD's efficiency and convergence properties, exploring variations like Adam, incorporating techniques such as momentum, adaptive learning rates, and line search methods, and analyzing its behavior in high-dimensional and non-convex settings. These advancements are crucial for training complex models like deep neural networks and improving the performance of various machine learning applications, impacting fields ranging from natural language processing to healthcare.
Papers
On the Convergence of Gradient Descent for Large Learning Rates
Alexandru Crăciun, Debarghya Ghoshdastidar
SGD with Clipping is Secretly Estimating the Median Gradient
Fabian Schaipp, Guillaume Garrigos, Umut Simsekli, Robert Gower
Training Artificial Neural Networks by Coordinate Search Algorithm
Ehsan Rokhsatyazdi, Shahryar Rahnamayan, Sevil Zanjani Miyandoab, Azam Asilian Bidgoli, H. R. Tizhoosh
Stochastic Hessian Fittings with Lie Groups
Xi-Lin Li
Communication-Efficient Distributed Learning with Local Immediate Error Compensation
Yifei Cheng, Li Shen, Linli Xu, Xun Qian, Shiwei Wu, Yiming Zhou, Tie Zhang, Dacheng Tao, Enhong Chen
Diagonalisation SGD: Fast & Convergent SGD for Non-Differentiable Models via Reparameterisation and Smoothing
Dominik Wagner, Basim Khajwal, C. -H. Luke Ong
RQP-SGD: Differential Private Machine Learning through Noisy SGD and Randomized Quantization
Ce Feng, Parv Venkitasubramaniam
Convergence of a L2 regularized Policy Gradient Algorithm for the Multi Armed Bandit
Stefana Anita, Gabriel Turinici
How Uniform Random Weights Induce Non-uniform Bias: Typical Interpolating Neural Networks Generalize with Narrow Teachers
Gon Buzaglo, Itamar Harel, Mor Shpigel Nacson, Alon Brutzkus, Nathan Srebro, Daniel Soudry
AdaBatchGrad: Combining Adaptive Batch Size and Adaptive Step Size
Petr Ostroukhov, Aigerim Zhumabayeva, Chulu Xiang, Alexander Gasnikov, Martin Takáč, Dmitry Kamzolov
Non-convergence to global minimizers for Adam and stochastic gradient descent optimization and constructions of local minimizers in the training of artificial neural networks
Arnulf Jentzen, Adrian Riekert