Data Set
Datasets are crucial for training and evaluating machine learning models, particularly in areas like natural language processing, computer vision, and audio analysis. Current research emphasizes creating diverse and high-quality datasets addressing specific challenges, such as data imbalance, cross-lingual inconsistencies, and the need for realistic representations of real-world scenarios. This involves developing novel annotation techniques, incorporating multiple data modalities (e.g., text, images, audio), and employing various model architectures (e.g., transformers, convolutional neural networks) for analysis and benchmark creation. The availability of well-designed datasets directly impacts the development of robust and reliable machine learning models, ultimately advancing scientific understanding and improving practical applications across numerous fields.
Papers
Utilizing Large Language Models to Synthesize Product Desirability Datasets
John D. Hastings, Sherri Weitl-Harms, Joseph Doty, Zachary L. Myers, Warren Thompson
CAFE A Novel Code switching Dataset for Algerian Dialect French and English
Houssam Eddine-Othman Lachemat, Akli Abbas, Nourredine Oukas, Yassine El Kheir, Samia Haboussi, Absar Showdhury Shammur
The ADUULM-360 Dataset -- A Multi-Modal Dataset for Depth Estimation in Adverse Weather
Markus Schön, Jona Ruof, Thomas Wodtko, Michael Buchholz, Klaus Dietmayer
Reducing Label Dependency for Underwater Scene Understanding: A Survey of Datasets, Techniques and Applications
Scarlett Raine, Frederic Maire, Niko Suenderhauf, Tobias Fischer
A dataset of questions on decision-theoretic reasoning in Newcomb-like problems
Caspar Oesterheld, Emery Cooper, Miles Kodama, Linh Chi Nguyen, Ethan Perez
Improving the accuracy of automated labeling of specimen images datasets via a confidence-based process
Quentin Bateux, Jonathan Koss, Patrick W. Sweeney, Erika Edwards, Nelson Rios, Aaron M. Dollar
Can Features for Phishing URL Detection Be Trusted Across Diverse Datasets? A Case Study with Explainable AI
Maraz Mia, Darius Derakhshan, Mir Mehedi A. Pritom
Towards a Classification of Open-Source ML Models and Datasets for Software Engineering
Alexandra González, Xavier Franch, David Lo, Silverio Martínez-Fernández
Anomaly Detection in Large-Scale Cloud Systems: An Industry Case and Dataset
Mohammad Saiful Islam, Mohamed Sami Rakha, William Pourmajidi, Janakan Sivaloganathan, John Steinbacher, Andriy Miranskyy
High-resolution optical and acoustic remote sensing datasets of the Puck Lagoon, Southern Baltic
Łukasz Janowski, Dimitrios Skarlatos, Panagiotis Agrafiotis, Paweł Tysiąc, Andrzej Pydyn, Mateusz Popek, Anna M. Kotarba-Morley, Gottfried Mandlburger, Łukasz Gajewski, Mateusz Kołakowski, Alexandra Papadaki, Juliusz Gajewski
Graph Neural Networks in Supply Chain Analytics and Optimization: Concepts, Perspectives, Dataset and Benchmarks
Azmine Toushik Wasi, MD Shafikul Islam, Adipto Raihan Akib, Mahathir Mohammad Bappy
Harnessing Smartphone Sensors for Enhanced Road Safety: A Comprehensive Dataset and Review
Amith Khandakar, David G. Michelson, Mansura Naznine, Abdus Salam, Md. Nahiduzzaman, Khaled M. Khan, Ponnuthurai Nagaratnam Suganthan, Mohamed Arselene Ayari, Hamid Menouar, Julfikar Haider
JPEG AI Image Compression Visual Artifacts: Detection Methods and Dataset
Daria Tsereh, Mark Mirgaleev, Ivan Molodetskikh, Roman Kazantsev, Dmitriy Vatolin
ASL STEM Wiki: Dataset and Benchmark for Interpreting STEM Articles
Kayo Yin, Chinmay Singh, Fyodor O. Minakov, Vanessa Milan, Hal Daumé III, Cyril Zhang, Alex X. Lu, Danielle Bragg
Training objective drives the consistency of representational similarity across datasets
Laure Ciernik, Lorenz Linhardt, Marco Morik, Jonas Dippel, Simon Kornblith, Lukas Muttenthaler