Expert Parallelism

Expert parallelism aims to accelerate the training and inference of large-scale machine learning models, particularly those with Mixture-of-Experts (MoE) architectures, by distributing computation across multiple processing units. Current research focuses on optimizing communication overhead within these parallel systems, exploring techniques like novel scheduling algorithms, adaptive expert placement, and optimized communication patterns to improve efficiency. These advancements are crucial for enabling the training and deployment of increasingly complex models, impacting both the scalability of research in areas like large language models and the performance of real-world applications requiring high-throughput inference.

Papers