Global Evaluation

Global evaluation in various scientific domains focuses on developing robust and reliable methods for assessing the performance of models and systems, often addressing challenges in data diversity, evolving data distributions, and the need for human-centered metrics. Current research emphasizes the development of comprehensive benchmarks and evaluation frameworks, often incorporating techniques like Item Response Theory and multi-faceted metrics beyond simple accuracy, and utilizing diverse model architectures including Large Language Models (LLMs), Convolutional Neural Networks (CNNs), and Graph Neural Networks (GNNs). These advancements are crucial for ensuring the trustworthiness and effectiveness of AI systems across diverse applications, from medical diagnosis to autonomous driving, and for fostering reproducible and comparable research within the scientific community.

723papers

Papers - Page 3

March 6, 2025

March 5, 2025

March 4, 2025

March 3, 2025

March 2, 2025

Argument Summarization and its Evaluation in the Era of Large Language Models
Large Language New Era Argument Summarization Global Evaluation Full Model

March 1, 2025

An evaluation of DeepSeek Models in Biomedical Natural Language Processing
Biomedical Natural Language Processing Complex Process Global Evaluation Biomedical NLP Task

February 28, 2025

February 27, 2025

Global Evaluation

Papers - Page 3

Malware Detection at the Edge with Lightweight LLMs: A Performance Evaluation

TIMER: Temporal Instruction Modeling and Evaluation for Longitudinal Clinical Records

Scaling Crowdsourced Election Monitoring: Construction and Evaluation of Classification Models for Multilingual and Cross-Domain Classification Settings

Towards Robust Universal Information Extraction: Benchmark, Evaluation, and Solution

Robust time series generation via Schrödinger Bridge: a comprehensive evaluation

Evaluation of Architectural Synthesis Using Generative AI

Evaluation and Facilitation of Online Discussions in the LLM Era: A Survey

Position: Ensuring mutual privacy is necessary for effective external evaluation of proprietary AI systems

OIPR: Evaluation for Time-series Anomaly Detection Inspired by Operator Interest

Argument Summarization and its Evaluation in the Era of Large Language Models

An evaluation of DeepSeek Models in Biomedical Natural Language Processing

Evaluation of LLMs-based Hidden States as Author Representations for Psychological Human-Centered NLP Tasks

ECLeKTic: a Novel Challenge Set for Evaluation of Cross-Lingual Knowledge Transfer

A database to support the evaluation of gender biases in GPT-4o output

Can LLM Assist in the Evaluation of the Quality of Machine Learning Explanations?

Re-evaluating Open-ended Evaluation of Large Language Models

WalnutData: A UAV Remote Sensing Dataset of Green Walnuts and Model Evaluation

Evaluation of Hate Speech Detection Using Large Language Models and Geographical Contextualization

Judge as A Judge: Improving the Evaluation of Retrieval-Augmented Generation through the Judge-Consistency of Large Language Models

Evaluation of Missing Data Imputation for Time Series Without Ground Truth