Token Representation
Token representation in natural language processing (NLP) and computer vision focuses on encoding textual or visual information into discrete units for efficient processing by machine learning models. Current research emphasizes improving token representations to address issues like bias mitigation, copyright protection, and efficient computation, often employing transformer architectures and contrastive learning methods. These advancements are crucial for enhancing model performance, interpretability, and fairness across various applications, including machine translation, hate speech detection, and visual tracking. Furthermore, research is actively exploring optimal tokenization strategies and efficient encoding techniques to reduce computational costs while maintaining accuracy.
Papers
Collapsed Language Models Promote Fairness
Jingxuan Xu, Wuyang Chen, Linyi Li, Yao Zhao, Yunchao Wei
CopyLens: Dynamically Flagging Copyrighted Sub-Dataset Contributions to LLM Outputs
Qichao Ma, Rui-Jie Zhu, Peiye Liu, Renye Yan, Fahong Zhang, Ling Liang, Meng Li, Zhaofei Yu, Zongwei Wang, Yimao Cai, Tiejun Huang