Benchmark Dataset

Benchmark datasets are curated collections of data designed to rigorously evaluate the performance of algorithms and models across various scientific domains. Current research focuses on developing datasets for diverse tasks, including multimodal data analysis (e.g., combining image, text, and audio data), challenging scenarios like low-resource languages or complex biological images, and addressing issues like model hallucinations and bias. These datasets are crucial for fostering objective comparisons, identifying limitations in existing methods, and driving advancements in machine learning and related fields, ultimately leading to more robust and reliable applications in diverse sectors.

Papers

June 13, 2024

DefAn: Definitive Answer Dataset for LLMs Hallucination Evaluation
A B M Ashikur Rahman, Saeed Anwar, Muhammad Usman, Ajmal Mian
Benchmark Dataset Model Hallucination Language Model Hallucination Generative Prowess

June 12, 2024

M3T: A New Benchmark Dataset for Multi-Modal Document-Level Machine Translation
Benjamin Hsu, Xiaoyu Liu, Huayang Li, Yoshinari Fujinuma, Maria Nadejde, Xing Niu, Yair Kittenplon, Ron Litman, Raghavendra Pappagari
Neural Machine Translation Benchmark Dataset Modal Translation Document Level Neural Machine Translation Document Translation

June 9, 2024

TTM-RE: Memory-Augmented Document-Level Relation Extraction
Chufan Gao, Xuan Wang, Jimeng Sun
Training Data Benchmark Dataset Document Level Relation Extraction

June 7, 2024

The ULS23 Challenge: a Baseline Model and Benchmark Dataset for 3D Universal Lesion Segmentation in Computed Tomography
M. J. J. de Grauw, E. Th. Scholten, E. J. Smit, M. J. C. M. Rutten, M. Prokop, B. van Ginneken, A. Hering
MAESTRO Dataset Challenge Task Benchmark Dataset Computed Tomography Lesion Segmentation Baseline Model 3D Lesion

June 6, 2024

June 4, 2024

CtrSVDD: A Benchmark Dataset and Baseline Analysis for Controlled Singing Voice Deepfake Detection
Yongyi Zang, Jiatong Shi, You Zhang, Ryuichi Yamamoto, Jionghao Han, Yuxun Tang, Shengyuan Xu, Wenxiao Zhao, Jing Guo, Tomoki Toda, Zhiyao Duan
Benchmark Dataset Baseline Result Singing Voice Singing Voice Synthesis Deepfake Technique Deepfake Audio Singing Voice Deepfake Detection

May 13, 2024

RAID: A Shared Benchmark for Robust Evaluation of Machine-Generated Text Detectors
Liam Dugan, Alyssa Hwang, Filip Trhlik, Josh Magnus Ludan, Andrew Zhu, Hainiu Xu, Daphne Ippolito, Chris Callison-Burch
Adversarial Attack Real Time Benchmark Dataset Benchmark Datasets Machine Generated Machine Generated Text Robust Evaluation

May 7, 2024

Exposing AI-generated Videos: A Benchmark Dataset and a Local-and-Global Temporal Defect Based Detection Method
Peisong He, Leyao Zhu, Jiaxing Li, Shiqi Wang, Haoliang Li
Benchmark Dataset Video Dataset Defect Detection Generated Video Diffusion Based Video Generation

May 3, 2024

Single and Multi-Hop Question-Answering Datasets for Reticular Chemistry with GPT-4-Turbo
Nakul Rampal, Kaiyu Wang, Matthew Burigana, Lingxiang Hou, Juri Al-Johani, Anna Sackmann, Hanan S. Murayshid, Walaa Abdullah Al-Sumari, Arwa M. Al-Abdulkarim, Nahla Eid Al-Hazmi, Majed O. Al-Awad, Christian Borgs, Jennifer T. Chayes, Omar M. Yaghi
Data Set Benchmark Dataset Single Label Multi Hop Multi Hop Question Answering QA Datasets

May 1, 2024

April 29, 2024

ViOCRVQA: Novel Benchmark Dataset and Vision Reader for Visual Question Answering by Understanding Vietnamese Text in Images
Huy Quang Pham, Thang Kien-Bao Nguyen, Quan Van Nguyen, Dan Quang Tran, Nghia Hieu Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen
Visual Question Answering Benchmark Dataset Visual Question

April 24, 2024

ApisTox: a new benchmark dataset for the classification of small molecules toxicity on honey bees
Jakub Adamczyk, Jakub Poziemski, Pawel Siedlecki
Classification Code Benchmark Dataset Chemical Data Cheminformatics Task Honey Bee

April 21, 2024

SVGEditBench: A Benchmark Dataset for Quantitative Assessment of LLM's SVG Editing Capabilities
Kunato Nishina, Yusuke Matsui
Medical LLM Text to Image Model Benchmark Dataset Vector Graphic Quantitative Evaluation SVG Generation

April 17, 2024

Fast Polypharmacy Side Effect Prediction Using Tensor Factorisation
Oliver Lloyd, Yi Liu, Tom R. Gaunt
Benchmark Dataset Tensor Factorization Cost Model Adverse Drug Event Polypharmacy Side Effect

April 12, 2024

Synthetic Dataset Creation and Fine-Tuning of Transformer Models for Question Answering in Serbian
Aleksa Cvetanović, Predrag Tadić
Fine Tuning Question Answering Transformer Model Yes No Question Benchmark Dataset Monolingual Pre Trained

Benchmark Dataset

Papers

DefAn: Definitive Answer Dataset for LLMs Hallucination Evaluation

M3T: A New Benchmark Dataset for Multi-Modal Document-Level Machine Translation

TTM-RE: Memory-Augmented Document-Level Relation Extraction

The ULS23 Challenge: a Baseline Model and Benchmark Dataset for 3D Universal Lesion Segmentation in Computed Tomography

UrbanSARFloods: Sentinel-1 SLC-Based Benchmark Dataset for Urban and Open-Area Flood Mapping

From Tissue Plane to Organ World: A Benchmark Dataset for Multimodal Biomedical Image Registration using Deep Co-Attention Networks

CtrSVDD: A Benchmark Dataset and Baseline Analysis for Controlled Singing Voice Deepfake Detection

RAID: A Shared Benchmark for Robust Evaluation of Machine-Generated Text Detectors

Exposing AI-generated Videos: A Benchmark Dataset and a Local-and-Global Temporal Defect Based Detection Method

Single and Multi-Hop Question-Answering Datasets for Reticular Chemistry with GPT-4-Turbo

WorkBench: a Benchmark Dataset for Agents in a Realistic Workplace Setting

New Benchmark Dataset and Fine-Grained Cross-Modal Fusion Framework for Vietnamese Multimodal Aspect-Category Sentiment Analysis

ViOCRVQA: Novel Benchmark Dataset and Vision Reader for Visual Question Answering by Understanding Vietnamese Text in Images

ApisTox: a new benchmark dataset for the classification of small molecules toxicity on honey bees

A Survey on the Real Power of ChatGPT

A User-Centric Multi-Intent Benchmark for Evaluating Large Language Models

VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models

SVGEditBench: A Benchmark Dataset for Quantitative Assessment of LLM's SVG Editing Capabilities

Fast Polypharmacy Side Effect Prediction Using Tensor Factorisation

Synthetic Dataset Creation and Fine-Tuning of Transformer Models for Question Answering in Serbian