Cluster Scheduling
Cluster scheduling optimizes the allocation of computational resources (like GPUs) across multiple jobs in a high-performance computing environment, aiming to minimize job completion times and maximize resource utilization. Recent research focuses on integrating advanced techniques like reinforcement learning to create more efficient and adaptable scheduling policies, addressing challenges such as network contention and the need for interpretable models. These improvements are crucial for accelerating computationally intensive tasks, particularly in deep learning and large-scale scientific simulations, leading to faster research and development cycles and enhanced productivity in various fields.
Papers
Interpretable Modeling of Deep Reinforcement Learning Driven Scheduling
Boyang Li, Zhiling Lan, Michael E. Papka
A Codesign of Scheduling and Parallelization for Large Model Training in Heterogeneous Clusters
Chunyu Xue, Weihao Cui, Han Zhao, Quan Chen, Shulai Zhang, Pengyu Yang, Jing Yang, Shaobo Li, Minyi Guo