Low Latency
Low latency, the minimization of delay in information processing, is a critical objective across diverse fields, driving research into efficient algorithms and hardware architectures. Current efforts focus on optimizing large language models (LLMs) for faster inference through techniques like speculative decoding and efficient resource allocation on GPUs, as well as developing low-latency solutions for speech processing, image recognition, and other real-time applications using spiking neural networks and specialized hardware like FPGAs. Achieving low latency is crucial for enabling real-time responsiveness in applications ranging from autonomous vehicles and interactive virtual reality to hearing aids and industrial IoT systems, significantly impacting performance and user experience.
Papers
SpikeCP: Delay-Adaptive Reliable Spiking Neural Networks via Conformal Prediction
Jiechen Chen, Sangwoo Park, Osvaldo Simeone
Quiver: Supporting GPUs for Low-Latency, High-Throughput GNN Serving with Workload Awareness
Zeyuan Tan, Xiulong Yuan, Congjie He, Man-Kit Sit, Guo Li, Xiaoze Liu, Baole Ai, Kai Zeng, Peter Pietzuch, Luo Mai