Paper ID: 2402.06787
ForestColl: Efficient Collective Communications on Heterogeneous Network Fabrics
Liangyu Zhao, Saeed Maleki, Aashaka Shah, Ziyue Yang, Hossein Pourreza, Arvind Krishnamurthy
As modern DNN models grow ever larger, collective communications between the accelerators (allreduce, etc.) emerge as a significant performance bottleneck. Designing efficient communication schedules is challenging, given today's highly diverse and heterogeneous network fabrics. In this paper, we present ForestColl, a tool that generates performant schedules for any network topology. ForestColl constructs broadcast/aggregation spanning trees as the communication schedule, achieving theoretically optimal throughput. Its schedule generation runs in strongly polynomial time and is highly scalable. ForestColl supports any network fabric, including both switching fabrics and direct connections. We evaluated ForestColl on multi-box AMD MI250 and NVIDIA DGX A100 platforms. ForestColl's schedules delivered up to 130% higher performance compared to the vendors' own optimized communication libraries, RCCL and NCCL, and achieved a 20% speedup in LLM training. ForestColl also outperforms other state-of-the-art schedule generation techniques with both up to 61% more efficient generated schedules and orders of magnitude faster schedule generation speed.
Submitted: Feb 9, 2024