Distributed Training
Distributed training aims to accelerate machine learning model training by distributing the workload across multiple computing nodes. Current research focuses on improving efficiency by addressing challenges like straggler nodes (slow devices), communication bottlenecks (especially in architectures like Parameter Servers and AllReduce), and the unique needs of specific model types such as Graph Neural Networks. These advancements are crucial for training increasingly large and complex models, enabling faster development and deployment of AI applications across various domains, from recommendation systems to power grid optimization.
Papers
December 10, 2024
July 29, 2024
April 15, 2024
November 12, 2023
July 13, 2023
December 15, 2022
February 7, 2022