Single GPU
Single-GPU computing remains a crucial area of research, focusing on optimizing the performance and energy efficiency of various machine learning tasks, including large language model (LLM) inference, image generation, and other computationally intensive algorithms. Current research emphasizes efficient memory management, novel attention mechanisms (like linear attention), and optimized kernel designs to maximize throughput and minimize latency, often targeting specific model architectures like Transformers and diffusion models. These advancements are significant because they enable cost-effective deployment of powerful AI models on readily available hardware, broadening access to advanced computational capabilities and accelerating progress across diverse scientific and industrial applications.
Papers
Forecasting GPU Performance for Deep Learning Training and Inference
Seonho Lee, Amar Phanishayee, Divya Mahajan
LiNR: Model Based Neural Retrieval on GPUs at LinkedIn
Fedor Borisyuk, Qingquan Song, Mingzhou Zhou, Ganesh Parameswaran, Madhu Arun, Siva Popuri, Tugrul Bingol, Zhuotao Pei, Kuang-Hsuan Lee, Lu Zheng, Qizhan Shao, Ali Naqvi, Sen Zhou, Aman Gupta
Efficiently Training 7B LLM with 1 Million Sequence Length on 8 GPUs
Pinxue Zhao, Hailin Zhang, Fangcheng Fu, Xiaonan Nie, Qibin Liu, Fang Yang, Yuanbo Peng, Dian Jiao, Shuaipeng Li, Jinbao Xue, Yangyu Tao, Bin Cui
Characterizing and Understanding HGNN Training on GPUs
Dengke Han, Mingyu Yan, Xiaochun Ye, Dongrui Fan
Is One GPU Enough? Pushing Image Generation at Higher-Resolutions with Foundation Models
Athanasios Tragakis, Marco Aversa, Chaitanya Kaul, Roderick Murray-Smith, Daniele Faccio
FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion
Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Chengji Yao, Ziheng Jiang, Haibin Lin, Xin Jin, Xin Liu
NonGEMM Bench: Understanding the Performance Horizon of the Latest ML Workloads with NonGEMM Workloads
Rachid Karami, Hemanth Kota, Sheng-Chun Kao, Hyoukjun Kwon
MHLR: Moving Haar Learning Rate Scheduler for Large-scale Face Recognition Training with One GPU
Xueyuan Gong, Yain-whar Si, Zheng Zhang, Xiaochen Yuan, Ke Wang, Xinyuan Zhang, Cong Lin, Xiaoxiang Liu