Data Set
Datasets are crucial for training and evaluating machine learning models, particularly in areas like natural language processing, computer vision, and audio analysis. Current research emphasizes creating diverse and high-quality datasets addressing specific challenges, such as data imbalance, cross-lingual inconsistencies, and the need for realistic representations of real-world scenarios. This involves developing novel annotation techniques, incorporating multiple data modalities (e.g., text, images, audio), and employing various model architectures (e.g., transformers, convolutional neural networks) for analysis and benchmark creation. The availability of well-designed datasets directly impacts the development of robust and reliable machine learning models, ultimately advancing scientific understanding and improving practical applications across numerous fields.
Papers
CHEW: A Dataset of CHanging Events in Wikipedia
Hsuvas Borkakoty, Luis Espinosa-Anke
Sonnet or Not, Bot? Poetry Evaluation for Large Models and Datasets
Melanie Walsh, Anna Preus, Maria Antoniak
360 in the Wild: Dataset for Depth Prediction and View Synthesis
Kibaek Park, Francois Rameau, Jaesik Park, In So Kweon
CLERC: A Dataset for Legal Case Retrieval and Retrieval-Augmented Analysis Generation
Abe Bohan Hou, Orion Weller, Guanghui Qin, Eugene Yang, Dawn Lawrie, Nils Holzenberger, Andrew Blair-Stanek, Benjamin Van Durme
USDC: A Dataset of $\underline{U}$ser $\underline{S}$tance and $\underline{D}$ogmatism in Long $\underline{C}$onversations
Mounika Marreddy, Subba Reddy Oota, Venkata Charan Chinni, Manish Gupta, Lucie Flek
EMVD dataset: a dataset of extreme vocal distortion techniques used in heavy metal
Modan Tailleur, Julien Pinquier, Laurent Millot, Corsin Vogel, Mathieu Lagrange
EHRCon: Dataset for Checking Consistency between Unstructured Notes and Structured Tables in Electronic Health Records
Yeonsu Kwon, Jiho Kim, Gyubok Lee, Seongsu Bae, Daeun Kyung, Wonchul Cha, Tom Pollard, Alistair Johnson, Edward Choi
SynDARin: Synthesising Datasets for Automated Reasoning in Low-Resource Languages
Gayane Ghazaryan, Erik Arakelyan, Pasquale Minervini, Isabelle Augenstein
How to design a dataset compliant with an ML-based system ODD?
Cyril Cappi, Noémie Cohen, Mélanie Ducoffe, Christophe Gabreau, Laurent Gardes, Adrien Gauffriau, Jean-Brice Ginestet, Franck Mamalet, Vincent Mussot, Claire Pagetti, David Vigouroux
Towards Holistic Language-video Representation: the language model-enhanced MSR-Video to Text Dataset
Yuchen Yang, Yingxuan Duan
Is AI fun? HumorDB: a curated dataset and benchmark to investigate graphical humor
Veedant Jain, Felipe dos Santos Alves Feitosa, Gabriel Kreiman
Towards Robust Evaluation: A Comprehensive Taxonomy of Datasets and Metrics for Open Domain Question Answering in the Era of Large Language Models
Akchay Srivastava, Atif Memon
CU-Net: a U-Net architecture for efficient brain-tumor segmentation on BraTS 2019 dataset
Qimin Zhang, Weiwei Qi, Huili Zheng, Xinyu Shen
Multilingual Synopses of Movie Narratives: A Dataset for Vision-Language Story Understanding
Yidan Sun, Jianfei Yu, Boyang Li
Chumor 1.0: A Truly Funny and Challenging Chinese Humor Understanding Dataset from Ruo Zhi Ba
Ruiqi He, Yushu He, Longju Bai, Jiarui Liu, Zhenjie Sun, Zenghao Tang, He Wang, Hanchen Xia, Naihao Deng