Token Compression
Token compression aims to accelerate and reduce the cost of large language models (LLMs) and other transformer-based architectures by reducing the number of input tokens processed. Current research focuses on developing efficient compression algorithms, including those based on sentence-level encoding, token pruning, and merging, often integrated with attention mechanisms to prioritize important information. These techniques are being applied across various domains, such as question answering, 3D object detection, and multimodal document understanding, demonstrating significant improvements in inference speed and resource efficiency while maintaining comparable performance. The resulting advancements have substantial implications for deploying these computationally intensive models in resource-constrained environments and scaling them to handle increasingly large inputs.
Papers
PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
Chenyu Yang, Xuan Dong, Xizhou Zhu, Weijie Su, Jiahao Wang, Hao Tian, Zhe Chen, Wenhai Wang, Lewei Lu, Jifeng Dai
Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM
Han Wang, Yuxiang Nie, Yongjie Ye, Deng GuanYu, Yanjie Wang, Shuai Li, Haiyang Yu, Jinghui Lu, Can Huang
LLaVA-Zip: Adaptive Visual Token Compression with Intrinsic Image Information
Ke Wang, Hong Xuan
SweetTokenizer: Semantic-Aware Spatial-Temporal Tokenizer for Compact Visual Discretization
Zhentao Tan, Ben Xue, Jian Jia, Junhao Wang, Wencai Ye, Shaoyun Shi, Mingjie Sun, Wenjin Wu, Quan Chen, Peng Jiang