How Do Word Meanings Significantly Change Over Time and In Context? Data Science Has an Answer

A Bayes by backprop approach for conditional word embedding

Linguists, social scientists, and natural language processing (NLP) researchers want to understand whether and how meanings of words vary across textual and historical contexts. In NLP research, “word embedding” refers to the mapping of word meanings as vectors of real numbers to represent how similar one word is to another in meaning. This allows researchers to determine how the meaning of a word like “awesome” may have shifted in a statistically significant way over time. Data scientists have already developed word embedding techniques to analyze language evolution, but most models split training sets according to decades or marked shifts in meaning of a particular word.

In a new paper, Rujun Han of the University of Southern California and CDS alum, Michael Gill of Facebook and former CDS Faculty Fellow, Arthur Spirling, Associate Professor of Politics and Data Science, Kyunghyun Cho, Assistant Professor of Computer Science and Data Science, propose a novel method that leverages document metadata to comprehensively model how a word’s meaning changes over time and in relation to similar terms. They focus on the problem of how to think about similarities between word meaning vectors in a statistical way. The researchers’ new approach allows for testing of hypotheses about the meanings of terms, determinations of whether a term is near or far from another, and assessments of the statistical significance of one word’s meaning relative to another.

By using a probabilistic deep learning method called Bayes by back-propagation, the researchers estimate the uncertainty of each word’s meaning vector. The method uses covariates (similar words) and context words to determine the probability of a word’s vector. This allows for more substantial analysis of relations between vectors, including the statistical significance of one word’s variance from another or the statistical significance of its own meaning over time.

For their training set, Spirling and collaborators used U.K. Parliament speech records from 1935–2012. For each word, they considered the six surrounding words as context. To demonstrate how covariate words from the documents impacted results, the researchers compared the meanings of “sterling” and “pound” with regard to “currency.” They found that “pound” had a much closer relationship with “currency” than “sterling” around 1970, which coincides with the U.K.’s abandonment of the sterling area. This analysis quantifies how contemporaneous politics affected the meanings of specific words.

In the future, the researchers “believe the proposed approach will serve as a more rigorous tool in social science and other domains.”

By Paul Oliver