Data Set
Datasets are crucial for training and evaluating machine learning models, particularly in areas like natural language processing, computer vision, and audio analysis. Current research emphasizes creating diverse and high-quality datasets addressing specific challenges, such as data imbalance, cross-lingual inconsistencies, and the need for realistic representations of real-world scenarios. This involves developing novel annotation techniques, incorporating multiple data modalities (e.g., text, images, audio), and employing various model architectures (e.g., transformers, convolutional neural networks) for analysis and benchmark creation. The availability of well-designed datasets directly impacts the development of robust and reliable machine learning models, ultimately advancing scientific understanding and improving practical applications across numerous fields.
Papers
PASSION for Dermatology: Bridging the Diversity Gap with Pigmented Skin Images from Sub-Saharan Africa
Philippe Gottfrois, Fabian Gröger, Faly Herizo Andriambololoniaina, Ludovic Amruthalingam, Alvaro Gonzalez-Jimenez, Christophe Hsu, Agnes Kessy, Simone Lionetti, Daudi Mavura, Wingston Ng'ambi, Dingase Faith Ngongonda, Marc Pouly, Mendrika Fifaliana Rakotoarisaona, Fahafahantsoa Rapelanoro Rabenja, Ibrahima Traoré, Alexander A. Navarini
Intellectual Property Protection for Deep Learning Model and Dataset Intelligence
Yongqi Jiang, Yansong Gao, Chunyi Zhou, Hongsheng Hu, Anmin Fu, Willy Susilo
UEVAVD: A Dataset for Developing UAV's Eye View Active Object Detection
Xinhua Jiang, Tianpeng Liu, Li Liu, Zhen Liu, Yongxiang Liu
VLA-3D: A Dataset for 3D Semantic Scene Understanding and Navigation
Haochen Zhang, Nader Zantout, Pujith Kachana, Zongyuan Wu, Ji Zhang, Wenshan Wang
GarVerseLOD: High-Fidelity 3D Garment Reconstruction from a Single In-the-Wild Image using a Dataset with Levels of Details
Zhongjin Luo, Haolin Liu, Chenghong Li, Wanghao Du, Zirong Jin, Wanhu Sun, Yinyu Nie, Weikai Chen, Xiaoguang Han
Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent
Yangning Li, Yinghui Li, Xinyu Wang, Yong Jiang, Zhen Zhang, Xinran Zheng, Hui Wang, Hai-Tao Zheng, Pengjun Xie, Philip S. Yu, Fei Huang, Jingren Zhou
CmdCaliper: A Semantic-Aware Command-Line Embedding Model and Dataset for Security Research
Sian-Yao Huang, Cheng-Lin Yang, Che-Yu Lin, Chun-Ying Huang
MIC: Medical Image Classification Using Chest X-ray (COVID-19 and Pneumonia) Dataset with the Help of CNN and Customized CNN
Nafiz Fahad, Fariha Jahan, Md Kishor Morol, Rasel Ahmed, Md. Abdullah-Al-Jubair
MEDS-Tab: Automated tabularization and baseline methods for MEDS datasets
Nassim Oufattole, Teya Bergamaschi, Aleksia Kolo, Hyewon Jeong, Hanna Gaggin, Collin M. Stultz, Matthew B.A. McDermott
Pedestrian Trajectory Prediction with Missing Data: Datasets, Imputation, and Benchmarking
Pranav Singh Chib, Pravendra Singh
AIDOVECL: AI-generated Dataset of Outpainted Vehicles for Eye-level Classification and Localization
Amir Kazemi, Qurat ul ain Fatima, Volodymyr Kindratenko, Christopher Tessum
GS-Blur: A 3D Scene-Based Dataset for Realistic Image Deblurring
Dongwoo Lee, Joonkyu Park, Kyoung Mu Lee
How Do Flow Matching Models Memorize and Generalize in Sample Data Subspaces?
Weiguo Gao, Ming Li
Emory Knee Radiograph (MRKR) Dataset
Brandon Price, Jason Adleberg, Kaesha Thomas, Zach Zaiman, Aawez Mansuri, Beatrice Brown-Mulry, Chima Okecheukwu, Judy Gichoya, Hari Trivedi
Public Domain 12M: A Highly Aesthetic Image-Text Dataset with Novel Governance Mechanisms
Jordan Meyer, Nick Padgett, Cullen Miller, Laura Exline
High-Fidelity Document Stain Removal via A Large-Scale Real-World Dataset and A Memory-Augmented Transformer
Mingxian Li, Hao Sun, Yingtie Lei, Xiaofeng Zhang, Yihang Dong, Yilin Zhou, Zimeng Li, Xuhang Chen
Analysis of Classifier Training on Synthetic Data for Cross-Domain Datasets
Andoni Cortés, Clemente Rodríguez, Gorka Velez, Javier Barandiarán, Marcos Nieto