Paper ID: 2111.10391

Data Excellence for AI: Why Should You Care

Lora Aroyo, Matthew Lease, Praveen Paritosh, Mike Schaekermann

The efficacy of machine learning (ML) models depends on both algorithms and data. Training data defines what we want our models to learn, and testing data provides the means by which their empirical progress is measured. Benchmark datasets define the entire world within which models exist and operate, yet research continues to focus on critiquing and improving the algorithmic aspect of the models rather than critiquing and improving the data with which our models operate. If "data is the new oil," we are still missing work on the refineries by which the data itself could be optimized for more effective use.

Submitted: Nov 19, 2021