Inference Acceleration
Inference acceleration focuses on optimizing the speed and efficiency of running large-scale machine learning models, such as large language models (LLMs), vision-language models (LVLMs), and graph neural networks (GNNs), without sacrificing accuracy. Current research emphasizes techniques like model pruning, efficient attention mechanisms, parallelization strategies (including CPU-based acceleration and distributed inference), and novel decoding methods (e.g., multi-head decoding and lookahead strategies). These advancements are crucial for deploying these computationally intensive models in resource-constrained environments and for improving the responsiveness and scalability of AI applications across various domains.
Papers
October 30, 2024
October 29, 2024
October 28, 2024
October 14, 2024
August 22, 2024
July 29, 2024
March 11, 2024
March 4, 2024
January 19, 2024
December 20, 2023
October 19, 2023
October 17, 2023
October 10, 2023
June 21, 2023
May 22, 2023
October 27, 2022
August 16, 2022