Love thy neighbor? Measuring immigrant integration in world cities using Twitter data

CDS Data Science Fellow Bruno Gonçalves examines immigrant integration by analyzing over 350 multilingual Tweets from around the globe

Although many cities today are multicultural hotspots, immigrant integration is still an on-going challenge — primarily because successful integration depends on several aspects like obtaining an education, finding employment, honing the key languages of the new country, and more. How can we measure the current state of immigrant integration?

Researchers typically use metrics like spatial segregation to assess how integrated — or isolated — immigrant groups are, relative to the wider community. But the rise of social media means that researchers can also analyze the spatial segregation of languages through data from platforms like Twitter.

Not only does Twitter data “have the particularity of extending beyond national borders,” explained Gonçalves in his recent co-authored paper in PLOS, but it can also “quantify the spatial integration of immigrant communities” by analyzing the spatio-temporal patterns of different languages in a given geographical location.

With this in mind, the researchers collected over 350 million tweets posted by 14.5 million users between the years of 2010 to 2015 to examine immigrant integration in 53 cities.

After extracting the UserID, geographical coordinates, date, time, and text of every tweet, they used some clever filtering techniques to confirm that each user actually lives in the place where they are tweeting (e.g. they’re not just a visitor). Some of these filtering techniques involved calculating number of consecutive months of activity of each user, and the minimum number of hours spent by each user in the geographical area where their tweets are coming from.

Then, the researchers used CLD2 (Chromium Compact Language Detector) to identify the language of each user’s tweets. In addition to accounting for mutually intelligible languages and dialectical varieties, the researchers also labeled the official language of each city that they were examining as the “Local” language.

“After defining the Local languages in each city,” the researchers said, “we assign[ed] to each user its most frequent language. In case of bilingual/multilingual users, we set as user’s language the one which differs from English or Local unless there are only two languages in [their] dictionary.”

The researchers also decided to remove English from their analysis because it’s the world’s lingua franca. “Moreover,” the researchers added, “the role of English is dominant mainly in the worst links in terms of integration.”

After discarding English, their investigation yielded some fascinating results. “Arabic rises as the most common spatially segregated community,” the researchers explained, “followed by French-speaking communities that are spatially concentrated in other European countries such as Germany and Turkey.”

On a more positive note, they point out that London is in the lead of hosting diverse communities, followed by San Francisco, Tokyo, Los Angeles, Manchester, and New York.

Of course, however, the researchers caution that Twitter data is only a partially representative sample of the population because the platform itself contains several biases, from the overrepresentation of young people, to the possibility that certain communities — like Chinese immigrants — may not use Twitter because it is inaccessible in their country of origin (China). Still, as Gonçalves and his researchers remind us, “the important question here is not whether we can find all the [immigrant] communities, but whether we are able to say something meaningful about those detected.”

Click here to learn more about this study.

By Cherrie Kwok