Inference Speedup
Inference speedup in large language models (LLMs) and other deep learning architectures is a crucial research area aiming to reduce computational costs and latency without sacrificing accuracy. Current efforts focus on techniques like structured pruning, prompt engineering (internalizing prompts, skill-localized tuning), and novel decoding methods (early exit, speculative parallel decoding) to achieve significant speed improvements. These advancements are vital for making these powerful models more accessible and efficient for real-world applications, impacting fields ranging from natural language processing and computer vision to speech recognition and causal inference.
Papers
December 15, 2024
October 30, 2024
October 7, 2024
July 2, 2024
April 25, 2024
April 18, 2024
March 2, 2024
February 19, 2024
November 14, 2023
October 5, 2023
May 22, 2023
April 11, 2023
February 24, 2023
November 1, 2022
September 26, 2022