Instruction Dataset
Instruction datasets are collections of (instruction, output) pairs used to fine-tune large language models (LLMs), improving their ability to follow diverse instructions and generalize to new tasks. Current research focuses on efficiently creating high-quality datasets, including methods for programmatic generation, data selection based on diversity and quality metrics (e.g., using k-means clustering and gradient analysis), and translation from existing English datasets to other languages. These advancements are significant because they reduce the reliance on expensive human annotation, enabling the development of more capable and adaptable LLMs across various domains and languages, ultimately impacting the broader field of natural language processing and its applications.
Papers
Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models
Yulei Qin, Yuncheng Yang, Pengcheng Guo, Gang Li, Hang Shao, Yuchen Shi, Zihan Xu, Yun Gu, Ke Li, Xing Sun
LLaSA: Large Language and E-Commerce Shopping Assistant
Shuo Zhang, Boci Peng, Xinping Zhao, Boren Hu, Yun Zhu, Yanjia Zeng, Xuming Hu