Parallel Inference

Parallel inference aims to accelerate the computationally intensive process of running machine learning models, particularly large language models (LLMs) and diffusion models, by distributing the workload across multiple processors. Current research focuses on optimizing various parallel strategies, including tensor parallelism, pipeline parallelism, and non-autoregressive decoding, as well as developing efficient routing and aggregation techniques for models like Mixture-of-Experts. These advancements are crucial for deploying complex models on resource-constrained devices like mobile phones and for enabling real-time applications requiring high throughput, such as interactive image generation and real-time language processing.

Papers