Data Set
Datasets are crucial for training and evaluating machine learning models, particularly in areas like natural language processing, computer vision, and audio analysis. Current research emphasizes creating diverse and high-quality datasets addressing specific challenges, such as data imbalance, cross-lingual inconsistencies, and the need for realistic representations of real-world scenarios. This involves developing novel annotation techniques, incorporating multiple data modalities (e.g., text, images, audio), and employing various model architectures (e.g., transformers, convolutional neural networks) for analysis and benchmark creation. The availability of well-designed datasets directly impacts the development of robust and reliable machine learning models, ultimately advancing scientific understanding and improving practical applications across numerous fields.
Papers
Cross-Target Stance Detection: A Survey of Techniques, Datasets, and Challenges
Parisa Jamadi Khiabani, Arkaitz Zubiaga
ALPEC: A Comprehensive Evaluation Framework and Dataset for Machine Learning-Based Arousal Detection in Clinical Practice
Stefan Kraft, Andreas Theissler, Vera Wienhausen-Wilke, Philipp Walter, Gjergji Kasneci
A quest through interconnected datasets: lessons from highly-cited ICASSP papers
Cynthia C. S. Liem, Doğa Taşcılar, Andrew M. Demetriou
Robust estimation of the intrinsic dimension of data sets with quantum cognition machine learning
Luca Candelori, Alexander G. Abanov, Jeffrey Berger, Cameron J. Hogan, Vahagn Kirakosyan, Kharen Musaelian, Ryan Samson, James E. T. Smith, Dario Villani, Martin T. Wells, Mengjia Xu
PoTATO: A Dataset for Analyzing Polarimetric Traces of Afloat Trash Objects
Luis Felipe Wolf Batista (UL), Salim Khazem, Mehran Adibi, Seth Hutchinson, Cedric Pradalier
Learnings from curating a trustworthy, well-annotated, and useful dataset of disordered English speech
Pan-Pan Jiang, Jimmy Tobin, Katrin Tomanek, Robert L. MacDonald, Katie Seaver, Richard Cave, Marilyn Ladewig, Rus Heywood, Jordan R. Green
E2MoCase: A Dataset for Emotional, Event and Moral Observations in News Articles on High-impact Legal Cases
Candida M. Greco, Lorenzo Zangari, Davide Picca, Andrea Tagarelli
L3Cube-IndicQuest: A Benchmark Question Answering Dataset for Evaluating Knowledge of LLMs in Indic Context
Pritika Rohera, Chaitrali Ginimav, Akanksha Salunke, Gayatri Sawant, Raviraj Joshi
Explaining Datasets in Words: Statistical Models with Natural Language Parameters
Ruiqi Zhong, Heng Wang, Dan Klein, Jacob Steinhardt