Paper ID: 2405.19220

WRDScore: New Metric for Evaluation of Natural Language Generation Models

Ravil Mussabayev

Evaluating natural language generation models, particularly for method name prediction, poses significant challenges. A robust metric must account for the versatility of method naming, considering both semantic and syntactic variations. Traditional overlap-based metrics, such as ROUGE, fail to capture these nuances. Existing embedding-based metrics often suffer from imbalanced precision and recall, lack normalized scores, or make unrealistic assumptions about sequences. To address these limitations, we leverage the theory of optimal transport and construct WRDScore, a novel metric that strikes a balance between simplicity and effectiveness. In the WRDScore framework, we define precision as the maximum degree to which the predicted sequence's tokens are included in the reference sequence, token by token. Recall is calculated as the total cost of the optimal transport plan that maps the reference sequence to the predicted one. Finally, WRDScore is computed as the harmonic mean of precision and recall, balancing these two complementary metrics. Our metric is lightweight, normalized, and precision-recall-oriented, avoiding unrealistic assumptions while aligning well with human judgments. Experiments on a human-curated dataset confirm the superiority of WRDScore over other available text metrics.

Submitted: May 29, 2024