Parallel Inference
Parallel inference aims to accelerate the computationally intensive process of running machine learning models, particularly large language models (LLMs) and diffusion models, by distributing the workload across multiple processors. Current research focuses on optimizing various parallel strategies, including tensor parallelism, pipeline parallelism, and non-autoregressive decoding, as well as developing efficient routing and aggregation techniques for models like Mixture-of-Experts. These advancements are crucial for deploying complex models on resource-constrained devices like mobile phones and for enabling real-time applications requiring high throughput, such as interactive image generation and real-time language processing.
Papers
November 11, 2024
November 4, 2024
October 16, 2024
October 1, 2024
July 2, 2024
June 19, 2024
May 19, 2024
May 10, 2024
May 3, 2024
April 22, 2024
February 29, 2024
February 3, 2024
January 16, 2024
November 10, 2023
December 8, 2022