Paper ID: 2312.05287
Human in-the-Loop Estimation of Cluster Count in Datasets via Similarity-Driven Nested Importance Sampling
Gustavo Perez, Daniel Sheldon, Grant Van Horn, Subhransu Maji
Identifying the number of clusters serves as a preliminary goal for many data analysis tasks. A common approach to this problem is to vary the number of clusters in a clustering algorithm (e.g., 'k' in $k$-means) and pick the value that best explains the data. However, the count estimates can be unreliable especially when the image similarity is poor. Human feedback on the pairwise similarity can be used to improve the clustering, but existing approaches do not guarantee accurate count estimates. We propose an approach to produce estimates of the cluster counts in a large dataset given an approximate pairwise similarity. Our framework samples edges guided by the pairwise similarity, and we collect human feedback to construct a statistical estimate of the cluster count. On the technical front we have developed a nested importance sampling approach that yields (asymptotically) unbiased estimates of the cluster count with confidence intervals which can guide human effort. Compared to naive sampling, our similarity-driven sampling produces more accurate estimates of counts and tighter confidence intervals. We evaluate our method on a benchmark of six fine-grained image classification datasets achieving low error rates on the estimated number of clusters with significantly less human labeling effort compared to baselines and alternative active clustering approaches.
Submitted: Dec 8, 2023