Data Set
Datasets are crucial for training and evaluating machine learning models, particularly in areas like natural language processing, computer vision, and audio analysis. Current research emphasizes creating diverse and high-quality datasets addressing specific challenges, such as data imbalance, cross-lingual inconsistencies, and the need for realistic representations of real-world scenarios. This involves developing novel annotation techniques, incorporating multiple data modalities (e.g., text, images, audio), and employing various model architectures (e.g., transformers, convolutional neural networks) for analysis and benchmark creation. The availability of well-designed datasets directly impacts the development of robust and reliable machine learning models, ultimately advancing scientific understanding and improving practical applications across numerous fields.
Papers
A Novel Pathology Foundation Model by Mayo Clinic, Charité, and Aignostics
Maximilian Alber, Stephan Tietz, Jonas Dippel, Timo Milbich, Timothée Lesort, Panos Korfiatis, Moritz Krügener, Beatriz Perez Cancer, Neelay Shah, Alexander Möllers, Philipp Seegerer, Alexandra Carpen-Amarie, Kai Standvoss, Gabriel Dernbach, Edwin de Jong, Simon Schallenberg, Andreas Kunft, Helmut Hoffer von Ankershoffen, Gavin Schaeferle, Patrick Duffy, Matt Redlon, Philipp Jurmeister, David Horst, Lukas Ruff, Klaus-Robert Müller, Frederick Klauschen, Andrew Norgan
ParaRev: Building a dataset for Scientific Paragraph Revision annotated with revision instruction
Léane Jourdan, Nicolas Hernandez, Richard Dufour, Florian Boudin, Akiko Aizawa
Advancing ALS Applications with Large-Scale Pre-training: Dataset Development and Downstream Assessment
Haoyi Xiu, Xin Liu, Taehoon Kim, Kyoung-Sook Kim
SensorQA: A Question Answering Benchmark for Daily-Life Monitoring
Benjamin Reichman, Xiaofan Yu, Lanxiang Hu, Jack Truxal, Atishay Jain, Rushil Chandrupatla, Tajana Šimunić Rosing, Larry Heck
AutoFish: Dataset and Benchmark for Fine-grained Analysis of Fish
Stefan Hein Bengtson, Daniel Lehotský, Vasiliki Ismiroglou, Niels Madsen, Thomas B. Moeslund, Malte Pedersen
Advancing the Understanding of Fine-Grained 3D Forest Structures using Digital Cousins and Simulation-to-Reality: Methods and Datasets
Jing Liu, Duanchu Wang, Haoran Gong, Chongyu Wang, Jihua Zhu, Di Wang
SafeAug: Safety-Critical Driving Data Augmentation from Naturalistic Datasets
Zhaobin Mo, Yunlong Li, Xuan Di
AGGA: A Dataset of Academic Guidelines for Generative AI and Large Language Models
Junfeng Jiao, Saleh Afroogh, Kevin Chen, David Atkinson, Amit Dhurandhar
Reading Between the Lines: A dataset and a study on why some texts are tougher than others
Nouran Khallaf, Carlo Eugeni, Serge Sharoff
IUST_PersonReId: A New Domain in Person Re-Identification Datasets
Alireza Sedighi Moghaddam, Fatemeh Anvari, Mohammadjavad Mirshekari Haghighi, Mohammadali Fakhari, Mohammad Reza Mohammadi
Advancing NAM-to-Speech Conversion with Novel Methods and the MultiNAM Dataset
Neil Shah, Shirish Karande, Vineet Gandhi
An Overview and Discussion of the Suitability of Existing Speech Datasets to Train Machine Learning Models for Collective Problem Solving
Gnaneswar Villuri, Alex Doboli
Towards Global AI Inclusivity: A Large-Scale Multilingual Terminology Dataset
Jiarui Liu, Iman Ouzzani, Wenkai Li, Lechen Zhang, Tianyue Ou, Houda Bouamor, Zhijing Jin, Mona Diab