High Performance Computing System
High-performance computing (HPC) systems are crucial for tackling computationally intensive scientific problems and increasingly, for powering large-scale AI applications. Current research focuses on optimizing HPC system design for AI workloads, including efficient resource allocation, novel cooling strategies (like liquid cooling), and improved performance modeling for various AI architectures (e.g., transformers, large language models). This involves developing advanced techniques like machine learning-driven auto-tuning, optimized communication frameworks, and efficient I/O management to address the unique demands of AI and scientific simulations, ultimately enabling faster, more energy-efficient, and reliable computations for a wide range of scientific disciplines.
Papers
Perspectives on AI Architectures and Co-design for Earth System Predictability
Maruti K. Mudunuru, James A. Ang, Mahantesh Halappanavar, Simon D. Hammond, Maya B. Gokhale, James C. Hoe, Tushar Krishna, Sarat S. Sreepathi, Matthew R. Norman, Ivy B. Peng, Philip W. Jones
ParaGraph: Weighted Graph Representation for Performance Optimization of HPC Kernels
Ali TehraniJamsaz, Alok Mishra, Akash Dutta, Abid M. Malik, Barbara Chapman, Ali Jannesari
Machine Learning-Driven Adaptive OpenMP For Portable Performance on Heterogeneous Systems
Giorgis Georgakoudis, Konstantinos Parasyris, Chunhua Liao, David Beckingsale, Todd Gamblin, Bronis de Supinski
MCR-DL: Mix-and-Match Communication Runtime for Deep Learning
Quentin Anthony, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He, Aamir Shafi, Mustafa Abduljabbar, Hari Subramoni, Dhabaleswar Panda