Computing Cluster

Computing clusters are high-performance computing systems composed of interconnected nodes, primarily used for parallel processing of computationally intensive tasks. Current research focuses on optimizing cluster utilization through improved scheduling algorithms (e.g., leveraging machine learning for queue time prediction and resource allocation), efficient resource partitioning strategies (including hierarchical approaches for GPUs), and minimizing communication bottlenecks in distributed training of large models (like LLMs) via techniques such as sparsity and asynchronous methods. These advancements are crucial for accelerating scientific discovery across diverse fields, from AI model training and large-scale simulations to medical image analysis and materials science, by enabling faster and more efficient computation at scale.

Papers