Available Datasets
Available datasets are crucial for training and evaluating machine learning models, particularly in computer vision, natural language processing, and other data-driven fields. Current research emphasizes addressing biases, ensuring privacy, and improving the quality and diversity of datasets across various domains, including healthcare, agriculture, and environmental monitoring. This work often involves developing novel algorithms and model architectures, such as those based on transformers and variational autoencoders, to better utilize existing data and create new synthetic datasets. The availability of high-quality, ethically sourced datasets is essential for advancing machine learning research and its practical applications, fostering reproducibility and mitigating potential societal harms.
Papers
Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity
Chih-Hsuan Yang, Benjamin Feuer, Zaki Jubery, Zi K. Deng, Andre Nakkab, Md Zahid Hasan, Shivani Chiranjeevi, Kelly Marshall, Nirmal Baishnab, Asheesh K Singh, Arti Singh, Soumik Sarkar, Nirav Merchant, Chinmay Hegde, Baskar Ganapathysubramanian
Software Model Evolution with Large Language Models: Experiments on Simulated, Public, and Industrial Datasets
Christof Tinnes, Alisa Welter, Sven Apel