Corpus Similarity Measure
Corpus similarity measures quantify the semantic and distributional closeness between text corpora, enabling researchers to assess the comparability and generalizability of linguistic analyses across different datasets. Recent work focuses on improving the robustness and reliability of these measures across diverse languages and resource levels, investigating the impact of data cleaning techniques and developing automatic evaluation methods to compare different metrics. These advancements are crucial for ensuring the validity of corpus-based research, particularly in low-resource settings, and for facilitating cross-lingual and cross-domain comparisons in natural language processing applications.
Papers
October 19, 2024
March 13, 2024
October 23, 2023
November 29, 2022
June 9, 2022