Data Set
Datasets are crucial for training and evaluating machine learning models, particularly in areas like natural language processing, computer vision, and audio analysis. Current research emphasizes creating diverse and high-quality datasets addressing specific challenges, such as data imbalance, cross-lingual inconsistencies, and the need for realistic representations of real-world scenarios. This involves developing novel annotation techniques, incorporating multiple data modalities (e.g., text, images, audio), and employing various model architectures (e.g., transformers, convolutional neural networks) for analysis and benchmark creation. The availability of well-designed datasets directly impacts the development of robust and reliable machine learning models, ultimately advancing scientific understanding and improving practical applications across numerous fields.
Papers
Fine-Tuning In-House Large Language Models to Infer Differential Diagnosis from Radiology Reports
Luoyao Chen, Revant Teotia, Antonio Verdone, Aidan Cardall, Lakshay Tyagi, Yiqiu Shen, Sumit Chopra
Data Processing for the OpenGPT-X Model Family
Nicolo' Brandizzi, Hammam Abdelwahab, Anirban Bhowmick, Lennard Helmer, Benny Jörg Stein, Pavel Denisov, Qasid Saleem, Michael Fromm, Mehdi Ali, Richard Rutmann, Farzad Naderi, Mohamad Saif Agy, Alexander Schwirjow, Fabian Küch, Luzian Hahn, Malte Ostendorff, Pedro Ortiz Suarez, Georg Rehm, Dennis Wegener, Nicolas Flores-Herr, Joachim Köhler, Johannes Leveling
ERVQA: A Dataset to Benchmark the Readiness of Large Vision Language Models in Hospital Environments
Sourjyadip Ray, Kushal Gupta, Soumi Kundu, Payal Arvind Kasat, Somak Aditya, Pawan Goyal
POLIPHONE: A Dataset for Smartphone Model Identification from Audio Recordings
Davide Salvi, Daniele Ugo Leonzio, Antonio Giganti, Claudio Eutizi, Sara Mandelli, Paolo Bestagini, Stefano Tubaro
Multi-Dialect Vietnamese: Task, Dataset, Baseline Models and Challenges
Nguyen Van Dinh, Thanh Chi Dang, Luan Thanh Nguyen, Kiet Van Nguyen
CoCoLoFa: A Dataset of News Comments with Common Logical Fallacies Written by LLM-Assisted Crowds
Min-Hsuan Yeh, Ruyuan Wan, Ting-Hao 'Kenneth' Huang
CoCoHD: Congress Committee Hearing Dataset
Arnav Hiray, Yunsong Liu, Mingxiao Song, Agam Shah, Sudheer Chava
Multilingual Topic Classification in X: Dataset and Analysis
Dimosthenis Antypas, Asahi Ushio, Francesco Barbieri, Jose Camacho-Collados
Heterogeneous sound classification with the Broad Sound Taxonomy and Dataset
Panagiota Anastasopoulou, Jessica Torrey, Xavier Serra, Frederic Font
RAD: A Dataset and Benchmark for Real-Life Anomaly Detection with Robotic Observations
Kaichen Zhou, Yang Cao, Taewhan Kim, Hao Zhao, Hao Dong, Kai Ming Ting, Ye Zhu
CXPMRG-Bench: Pre-training and Benchmarking for X-ray Medical Report Generation on CheXpert Plus Dataset
Xiao Wang, Fuling Wang, Yuehang Li, Qingchuan Ma, Shiao Wang, Bo Jiang, Chuanfu Li, Jin Tang