Cultural Fault Lines Determine How New Words Spread On Twitter, Say Computational Linguists

The spread of neologisms follows pre-existing cultural boundaries, say researchers

A dialect is a particular form of language that is limited to a specific location or population group. Linguists are fascinated by these variations because they are determined both by geography and by demographics. So studying them can produce important insights into the nature of society and how different groups within it interact.

That’s why linguists are keen to understand how new words, abbreviations and usages spread on new forms of electronic communication, such as social media platforms. It is easy to imagine that the rapid spread of neologisms could one day lead to a single unified dialect of netspeak. An interesting question is whether there is any evidence that this is actually happening.

Today, we get a fascinating insight into this problem thanks to the work of Jacob Eisenstein at the Georgia Institute of Technology in Atlanta and a few pals. These guys have measured the spread of neologisms on Twitter and say they have clear evidence that online language is not converging at all. Indeed, they say that electronic dialects are just as common as ordinary ones and seem to reflect same fault lines in society.

Eisenstein and co begin with a sample from the Twitter stream of 107 million geo-located messages from more than 2.7 million different user accounts in the US. They filtered from this dataset all the advertising and marketing messages and then associated each of the remaining users with one of the 200 largest metropolitan areas in the US.

They then listed the words most frequently mentioned and focused on the 2600 whose frequency changed significantly between 2009 and 2012. Since each appearance of a word is geo-located, this allowed the team to see how the change in usage varied from one metropolitan area to another.

The results provide a fascinating insight into the evolution of electronic language. For example, the abbreviation ikr, meaning “I know, right?” occurs six times more frequently in the Detroit area than in the US overall; the phonetic spelling suttin, meaning “something”, occurs five times more frequently in New York City; and the emoticon^-^, meaning nervous or shy and of Korean origin, is four times more common in Southern California.

By measuring how often these terms appear each week, Eisenstein and co can show how the usage changes over time across the country. For example, the word ion, which is short for I don’t as in ion even care, became increasingly popular between 2009 in 2012 but was largely confined to the southeast US.

At the beginning of the study, the abbreviation ctfu, which stands for cracking the fuck up or laughing, appeared mainly in the Cleveland area but by 2012 was being used in Pennsylvania and the mid-Atlantic. However, ctfu is rare in the large cities to the west of Cleveland, such as Detroit and Chicago.

An important question for linguists is why these terms might become popular in some areas but not others. In other words, what is the link between places that use neologisms in a similar way?

Eisenstein and co say one factor seems to be geography— places that are closer together are more likely to share the same word usage. That makes sense because people communicate more often with others who are nearby.

But the team also say that new words tend to be shared between metropolitan areas that have a similar racial mix. In fact, the proportion of African-Americans is the strongest predictor of similar usage. “Examples of linguistically linked city pairs that are geographically distant but demographically similar include Washington D.C. and New Orleans (high proportions of African-Americans), Los Angeles and Miami (high proportions of Hispanics), and Boston and Seattle (relatively few minorities, compared with other large cities),” say Eisenstein and pals. “Rather than moving towards a single unified “netspeak” dialect, language evolution in computer-mediated communication reproduces existing fault lines in spoken American English.”

That’s interesting work that provides an important insight into the way language evolves. “By tracking the popularity of words over time and space, we can harness large-scale data to uncover the hidden structure of language change,” say Eisenstein and co.

It also opens the way to more complex and detailed computation of linguistic studies of language evolution. For example, Eisenstein and co say that geo-located tweets may allow them in future to study linguistic diversity within metropolitan areas. And in addition to changes in word frequency, they hope to study other forms of language change such as orthography (rules of spelling), syntax (word order) and pragmatics (the way context contributes to meaning).

Clearly, interesting times lie ahead for linguists with a computational bent.

Ref: arxiv.org/abs/1210.5268 : Diffusion of Lexical Change in Social Media


Follow the Physics arXiv Blog on Twitter at @arxivblog, on Facebook and by hitting the Follow button below

The Physics arXiv Blog

An alternative view of the best new ideas in science. About: http://tinyurl.com/p6ypk56

    The Physics arXiv Blog

    Written by

    An alternative view of the best new ideas in science. About: http://tinyurl.com/p6ypk56

    The Physics arXiv Blog

    An alternative view of the best new ideas in science. About: http://tinyurl.com/p6ypk56