Straggler Mitigation
Straggler mitigation addresses the performance bottleneck in distributed machine learning caused by slow or failing worker nodes, aiming to improve training efficiency and reduce overall computation time. Current research focuses on techniques like gradient coding, asynchronous algorithms (e.g., asynchronous federated learning), and dynamic resource allocation strategies, often tailored to specific architectures (e.g., parameter servers, hierarchical systems, serverless environments). Effective straggler mitigation is crucial for scaling machine learning to larger datasets and more complex models, impacting both the speed and feasibility of training across diverse applications, from deep learning to federated learning.
Papers
October 27, 2024
September 14, 2024
June 16, 2024
April 15, 2024
March 22, 2024
March 19, 2024
March 14, 2024
February 6, 2024
July 28, 2023
July 5, 2023
February 23, 2023
December 22, 2022
November 24, 2022
November 10, 2022
October 6, 2022
August 30, 2022
June 6, 2022
March 16, 2022