Dataset Creation
Dataset creation for machine learning, particularly in complex domains like natural language processing and computer vision, is a critical area of research focusing on improving data quality, efficiency, and representativeness. Current efforts involve developing automated pipelines for data generation and annotation, leveraging large language models to streamline the process, and employing novel techniques like auction mechanisms to optimize resource allocation. These advancements are crucial for enhancing the reliability and generalizability of machine learning models, impacting various fields from legal tech and finance to healthcare and industrial automation.
Papers
Instruction-based Image Manipulation by Watching How Things Move
Mingdeng Cao, Xuaner Zhang, Yinqiang Zheng, Zhihao Xia
Beyond Dataset Creation: Critical View of Annotation Variation and Bias Probing of a Dataset for Online Radical Content Detection
Arij Riabi, Virginie Mouilleron, Menel Mahamdi, Wissam Antoun, Djamé Seddah