Based Inference

Based inference, the process of running machine learning models, particularly large language models (LLMs), is undergoing significant optimization efforts to reduce computational demands and latency. Current research focuses on developing lightweight LLMs, employing techniques like optimized attention mechanisms and quantization, and designing frameworks for efficient inference on resource-constrained devices such as smartphones and microcontrollers. These advancements are crucial for expanding the accessibility and applicability of LLMs beyond high-performance computing environments, impacting fields ranging from natural language processing to embedded systems. Furthermore, research explores methods for secure and efficient inference, including partially oblivious techniques that balance security and performance.

Papers