Data Set
Datasets are crucial for training and evaluating machine learning models, particularly in areas like natural language processing, computer vision, and audio analysis. Current research emphasizes creating diverse and high-quality datasets addressing specific challenges, such as data imbalance, cross-lingual inconsistencies, and the need for realistic representations of real-world scenarios. This involves developing novel annotation techniques, incorporating multiple data modalities (e.g., text, images, audio), and employing various model architectures (e.g., transformers, convolutional neural networks) for analysis and benchmark creation. The availability of well-designed datasets directly impacts the development of robust and reliable machine learning models, ultimately advancing scientific understanding and improving practical applications across numerous fields.
Papers
Using generative AI to investigate medical imagery models and datasets
Oran Lang, Doron Yaya-Stupp, Ilana Traynis, Heather Cole-Lewis, Chloe R. Bennett, Courtney Lyles, Charles Lau, Christopher Semturs, Dale R. Webster, Greg S. Corrado, Avinatan Hassidim, Yossi Matias, Yun Liu, Naama Hammel, Boris Babenko
Sea Ice Extraction via Remote Sensed Imagery: Algorithms, Datasets, Applications and Challenges
Anzhu Yu, Wenjun Huang, Qing Xu, Qun Sun, Wenyue Guo, Song Ji, Bowei Wen, Chunping Qiu
Datasets for Portuguese Legal Semantic Textual Similarity: Comparing weak supervision and an annotation process approaches
Daniel da Silva Junior, Paulo Roberto dos S. Corval, Aline Paes, Daniel de Oliveira
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Sihan Chen, Handong Li, Qunbo Wang, Zijia Zhao, Mingzhen Sun, Xinxin Zhu, Jing Liu
PubChemQC B3LYP/6-31G*//PM6 dataset: the Electronic Structures of 86 Million Molecules using B3LYP/6-31G* calculations
Maho Nakata, Toshiyuki Maeda
KoSBi: A Dataset for Mitigating Social Bias Risks Towards Safer Large Language Model Application
Hwaran Lee, Seokhee Hong, Joonsuk Park, Takyoung Kim, Gunhee Kim, Jung-Woo Ha
HaVQA: A Dataset for Visual Question Answering and Multimodal Research in Hausa Language
Shantipriya Parida, Idris Abdulmumin, Shamsuddeen Hassan Muhammad, Aneesh Bose, Guneet Singh Kohli, Ibrahim Said Ahmad, Ketan Kotwal, Sayan Deb Sarkar, Ondřej Bojar, Habeebah Adamu Kakudi
Exploiting Large Neuroimaging Datasets to Create Connectome-Constrained Approaches for more Robust, Efficient, and Adaptable Artificial Intelligence
Erik C. Johnson, Brian S. Robinson, Gautam K. Vallabha, Justin Joyce, Jordan K. Matelsky, Raphael Norman-Tenazas, Isaac Western, Marisel Villafañe-Delgado, Martha Cervantes, Michael S. Robinette, Arun V. Reddy, Lindsey Kitchell, Patricia K. Rivlin, Elizabeth P. Reilly, Nathan Drenkow, Matthew J. Roos, I-Jeng Wang, Brock A. Wester, William R. Gray-Roncal, Joan A. Hoffmann
DataFinder: Scientific Dataset Recommendation from Natural Language Descriptions
Vijay Viswanathan, Luyu Gao, Tongshuang Wu, Pengfei Liu, Graham Neubig
Assessing Linguistic Generalisation in Language Models: A Dataset for Brazilian Portuguese
Rodrigo Wilkens, Leonardo Zilio, Aline Villavicencio
IfQA: A Dataset for Open-domain Question Answering under Counterfactual Presuppositions
Wenhao Yu, Meng Jiang, Peter Clark, Ashish Sabharwal
S\={a}mayik: A Benchmark and Dataset for English-Sanskrit Translation
Ayush Maheshwari, Ashim Gupta, Amrith Krishna, Atul Kumar Singh, Ganesh Ramakrishnan, G. Anil Kumar, Jitin Singla
Human Body Pose Estimation for Gait Identification: A Comprehensive Survey of Datasets and Models
Luke K. Topham, Wasiq Khan, Dhiya Al-Jumeily, Abir Hussain
TeCS: A Dataset and Benchmark for Tense Consistency of Machine Translation
Yiming Ai, Zhiwei He, Kai Yu, Rui Wang