Paper ID: 2306.02411

Topological data quality via 0-dimensional persistence matching

Álvaro Torras-Casas, Eduardo Paluzo-Hidalgo, Rocio Gonzalez-Diaz

Data quality is crucial for the successful training, generalization and performance of artificial intelligence models. We propose to measure data quality for supervised learning using topological data analysis techniques. Specifically, we provide a novel topological invariant based on persistence matchings induced by inclusions and using $0$-dimensional persistent homology. We show that such an invariant is stable. We provide an algorithm and relate it to images, kernels, and cokernels of the induced morphisms. Also, we show that the invariant allows us to understand whether the subset "represents well" the clusters from the larger dataset or not, and we also use it to estimate bounds for the Hausdorff distance between the subset and the complete dataset. This approach enables us to explain why the chosen dataset will lead to poor performance.

Submitted: Jun 4, 2023