Paper ID: 2405.19461
Clustering-Based Validation Splits for Model Selection under Domain Shift
Andrea Napoli, Paul White
This paper considers the problem of model selection under domain shift. Motivated by principles from distributionally robust optimisation (DRO) and domain adaptation theory, it is proposed that the training-validation split should maximise the distribution mismatch between the two sets. By adopting the maximum mean discrepancy (MMD) as the measure of mismatch, it is shown that the partitioning problem reduces to kernel k-means clustering. A constrained clustering algorithm, which leverages linear programming to control the size, label, and (optionally) group distributions of the splits, is presented. The algorithm does not require additional metadata, and comes with convergence guarantees. In experiments, the technique consistently outperforms alternative splitting strategies across a range of datasets and training algorithms, for both domain generalisation (DG) and unsupervised domain adaptation (UDA) tasks. Analysis also shows the MMD between the training and validation sets to be strongly rank-correlated ($\rho=0.63$) with test domain accuracy, further substantiating the validity of this approach.
Submitted: May 29, 2024