Data Similarity

Data similarity research focuses on understanding and leveraging the relationships between datasets to improve the efficiency and performance of machine learning algorithms, particularly in distributed and federated learning settings. Current research explores various methods for quantifying and exploiting data similarity, including approaches based on gradient and Hessian similarity, as well as techniques like soft deduplication to manage redundant data in large datasets. These advancements aim to reduce communication costs, improve convergence speed, and enhance the overall efficiency of training large models, impacting fields like natural language processing and other areas requiring large-scale data analysis. Furthermore, research is actively addressing the limitations of relying solely on data similarity assumptions, developing algorithms that perform well even with heterogeneous data.

Papers