Paper ID: 2211.11041

Pragmatic Constraint on Distributional Semantics

Elizaveta Zhemchuzhina, Nikolai Filippov, Ivan P. Yamshchikov

This paper studies the limits of language models' statistical learning in the context of Zipf's law. First, we demonstrate that Zipf-law token distribution emerges irrespective of the chosen tokenization. Second, we show that Zipf distribution is characterized by two distinct groups of tokens that differ both in terms of their frequency and their semantics. Namely, the tokens that have a one-to-one correspondence with one semantic concept have different statistical properties than those with semantic ambiguity. Finally, we demonstrate how these properties interfere with statistical learning procedures motivated by distributional semantics.

Submitted: Nov 20, 2022

Topics

Language Model
Statistical Learning
Token Prediction
Semantic Ambiguity
Distributional Semantics
Pragmatic Constraint

Links

arXiv PDF