High Quality
High-quality data is paramount for the success of machine learning models, driving research into efficient and reliable methods for data creation, curation, and evaluation. Current efforts focus on developing novel algorithms and model architectures, such as diffusion models, generative adversarial networks (GANs), and large language models (LLMs), to improve data quality across diverse domains, including image generation, speech processing, and natural language processing. These advancements are crucial for enhancing the performance and reliability of machine learning systems and enabling new applications in various fields, from medical imaging to robotics. The development of robust evaluation metrics and automated quality control methods is also a key area of focus.
Papers
ATHAR: A High-Quality and Diverse Dataset for Classical Arabic to English Translation
Mohammed Khalil, Mohammed Sabry
CollectiveSFT: Scaling Large Language Models for Chinese Medical Benchmark with Collective Instructions in Healthcare
Jingwei Zhu, Minghuan Tan, Min Yang, Ruixue Li, Hamid Alinejad-Rokny
MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity
Yangzhou Liu, Yue Cao, Zhangwei Gao, Weiyun Wang, Zhe Chen, Wenhai Wang, Hao Tian, Lewei Lu, Xizhou Zhu, Tong Lu, Yu Qiao, Jifeng Dai
OMoS-QA: A Dataset for Cross-Lingual Extractive Question Answering in a German Migration Context
Steffen Kleinle, Jakob Prange, Annemarie Friedrich
APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets
Zuxin Liu, Thai Hoang, Jianguo Zhang, Ming Zhu, Tian Lan, Shirley Kokane, Juntao Tan, Weiran Yao, Zhiwei Liu, Yihao Feng, Rithesh Murthy, Liangwei Yang, Silvio Savarese, Juan Carlos Niebles, Huan Wang, Shelby Heinecke, Caiming Xiong
Evaluating Quality of Answers for Retrieval-Augmented Generation: A Strong LLM Is All You Need
Yang Wang, Alberto Garcia Hernandez, Roman Kyslyi, Nicholas Kersting