Tensor Parallelism

Tensor parallelism is a technique for distributing the computation of large deep learning models across multiple devices, primarily to overcome memory limitations and accelerate training and inference. Current research focuses on optimizing tensor parallelism for large language models (LLMs), particularly addressing communication bottlenecks through techniques like computation-communication overlap and efficient all-reduce algorithms, and exploring its integration with other parallelism strategies (e.g., data, pipeline, expert parallelism) and model architectures (e.g., Mixture-of-Experts). These advancements are crucial for enabling the efficient training and deployment of increasingly massive models, impacting both the scalability of scientific research and the development of practical applications requiring high-performance AI.

Papers