Inference Engine
Inference engines are software systems designed to efficiently execute machine learning models, particularly large language models (LLMs), optimizing for speed, memory usage, and accuracy. Current research emphasizes improving inference speed through techniques like model quantization (reducing the precision of model parameters), optimized hardware acceleration (e.g., using specialized kernels and FPGAs), and novel architectural approaches such as sparse attention and early exits. These advancements are crucial for deploying LLMs in resource-constrained environments (like embedded systems and mobile devices) and for enabling real-time applications requiring high throughput and low latency.
Papers
October 26, 2024
October 14, 2024
September 28, 2024
August 20, 2024
August 2, 2024
July 25, 2024
June 12, 2024
January 30, 2024
January 16, 2024
December 8, 2023
November 16, 2023
October 23, 2023
September 8, 2023
May 30, 2023
October 26, 2022
September 19, 2022
May 10, 2022
February 21, 2022
February 10, 2022