Paper ID: 2410.09416 • Published Oct 12, 2024
Can Vision-Language Models Replace Human Annotators: A Case Study with CelebA Dataset
TL;DR
Get AI-generated summaries with premium
Get AI-generated summaries with premium
This study evaluates the capability of Vision-Language Models (VLMs) in image
data annotation by comparing their performance on the CelebA dataset in terms
of quality and cost-effectiveness against manual annotation. Annotations from
the state-of-the-art LLaVA-NeXT model on 1000 CelebA images are in 79.5%
agreement with the original human annotations. Incorporating re-annotations of
disagreed cases into a majority vote boosts AI annotation consistency to 89.1%
and even higher for more objective labels. Cost assessments demonstrate that AI
annotation significantly reduces expenditures compared to traditional manual
methods -- representing less than 1% of the costs for manual annotation in the
CelebA dataset. These findings support the potential of VLMs as a viable,
cost-effective alternative for specific annotation tasks, reducing both
financial burden and ethical concerns associated with large-scale manual data
annotation. The AI annotations and re-annotations utilized in this study are
available on this https URL