Paper ID: 2405.15198

RAEE: A Robust Retrieval-Augmented Early Exiting Framework for Efficient Inference

Lianming Huang, Shangyu Wu, Yufei Cui, Ying Xiong, Xue Liu, Tei-Wei Kuo, Nan Guan, Chun Jason Xue

Deploying large language model inference remains challenging due to their high computational overhead. Early exiting optimizes model inference by adaptively reducing the number of inference layers. Existing methods typically train internal classifiers to determine whether to exit at intermediate layers. However, such classifier-based early exiting frameworks require significant effort to train the classifiers while can only achieve comparable performance at best. To address these limitations, this paper proposes RAEE, a robust Retrieval-Augmented Early Exiting framework for efficient inference. First, this paper demonstrates that the early exiting problem can be modeled as a distribution prediction problem, where the distribution is approximated using similar data's exiting information. Then, this paper details the process of collecting exiting information to build the retrieval database. Finally, based on the pre-built retrieval database, RAEE leverages the retrieved similar data's exiting information to guide the backbone model to exit at the layer, which is predicted by the approximated distribution. Experimental results demonstrate that the proposed RAEE can significantly accelerate inference. More importantly, RAEE can also achieve a robust zero-shot performance on 8 downstream tasks.

Submitted: May 24, 2024