Data Set
Datasets are crucial for training and evaluating machine learning models, particularly in areas like natural language processing, computer vision, and audio analysis. Current research emphasizes creating diverse and high-quality datasets addressing specific challenges, such as data imbalance, cross-lingual inconsistencies, and the need for realistic representations of real-world scenarios. This involves developing novel annotation techniques, incorporating multiple data modalities (e.g., text, images, audio), and employing various model architectures (e.g., transformers, convolutional neural networks) for analysis and benchmark creation. The availability of well-designed datasets directly impacts the development of robust and reliable machine learning models, ultimately advancing scientific understanding and improving practical applications across numerous fields.
Papers
Does It Look Sequential? An Analysis of Datasets for Evaluation of Sequential Recommendations
Anton Klenitskiy, Anna Volodkevich, Anton Pembek, Alexey Vasilev
Toward Enhancing Vehicle Color Recognition in Adverse Conditions: A Dataset and Benchmark
Gabriel E. Lima, Rayson Laroca, Eduardo Santos, Eduil Nascimento Jr., David Menotti
A Dataset for Mechanical Mechanisms
Farshid Ghezelbash, Amir Hossein Eskandari, Amir J Bidhendi
Towards a Benchmark for Colorectal Cancer Segmentation in Endorectal Ultrasound Videos: Dataset and Model Development
Yuncheng Jiang, Yiwen Hu, Zixun Zhang, Jun Wei, Chun-Mei Feng, Xuemei Tang, Xiang Wan, Yong Liu, Shuguang Cui, Zhen Li
Sequential Federated Learning in Hierarchical Architecture on Non-IID Datasets
Xingrun Yan, Shiyuan Zuo, Rongfei Fan, Han Hu, Li Shen, Puning Zhao, Yong Luo
BnSentMix: A Diverse Bengali-English Code-Mixed Dataset for Sentiment Analysis
Sadia Alam, Md Farhan Ishmam, Navid Hasin Alvee, Md Shahnewaz Siddique, Md Azam Hossain, Abu Raihan Mostofa Kamal
Historical Printed Ornaments: Dataset and Tasks
Sayan Kumar Chaki, Zeynep Sonat Baltaci, Elliot Vincent, Remi Emonet, Fabienne Vial-Bonacci, Christelle Bahier-Porte, Mathieu Aubry, Thierry Fournel
RealMedQA: A pilot biomedical question answering dataset containing realistic clinical questions
Gregory Kell, Angus Roberts, Serge Umansky, Yuti Khare, Najma Ahmed, Nikhil Patel, Chloe Simela, Jack Coumbe, Julian Rozario, Ryan-Rhys Griffiths, Iain J. Marshall
Privacy-preserving datasets by capturing feature distributions with Conditional VAEs
Francesco Di Salvo, David Tafler, Sebastian Doerrich, Christian Ledig
Sentence-wise Speech Summarization: Task, Datasets, and End-to-End Modeling with LM Knowledge Distillation
Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Masato Mimura, Takatomo Kano, Atsunori Ogawa, Marc Delcroix
WAS: Dataset and Methods for Artistic Text Segmentation
Xudong Xie, Yuzhe Li, Yang Liu, Zhifei Zhang, Zhaowen Wang, Wei Xiong, Xiang Bai
StreetSurfaceVis: a dataset of crowdsourced street-level imagery annotated by road surface type and quality
Alexandra Kapp, Edith Hoffmann, Esther Weigmann, Helena Mihaljević
Benchmarking Multi-dimensional AIGC Video Quality Assessment: A Dataset and Unified Model
Zhichao Zhang, Wei Sun, Xinyue Li, Jun Jia, Xiongkuo Min, Zicheng Zhang, Chunyi Li, Zijian Chen, Puyi Wang, Fengyu Sun, Shangling Jui, Guangtao Zhai