Instruction Datasets

Instruction datasets are collections of task instructions and corresponding desired outputs used to fine-tune large language models (LLMs), improving their ability to follow diverse user instructions. Current research emphasizes creating larger, higher-quality datasets, often through automated generation techniques, and optimizing their composition for specific tasks or model architectures, including exploring methods like curriculum learning and submodular optimization for data selection. This work is crucial for advancing LLMs across various domains, from improving voice assistants and enhancing multimodal models to enabling more effective applications in specialized fields like biomedicine and cybersecurity.

Papers