Love thy neighbor? Measuring immigrant integration in world cities using Twitter data

CDS Data Science Fellow Bruno Gonçalves examines immigrant integration by analyzing over 350 multilingual Tweets from around the globe

Although many cities today are multicultural hotspots, immigrant integration is still an on-going challenge — primarily because successful integration depends on several aspects like obtaining an education, finding employment, honing the key languages of the new country, and more. How can we measure the current state of immigrant integration?

Researchers typically use metrics like spatial segregation to assess how integrated — or isolated — immigrant groups are, relative to the wider community. But the rise of social media means that researchers can also analyze the spatial segregation of languages through data from platforms like Twitter.

Not only does Twitter data “have the particularity of extending beyond national borders,” explained Gonçalves in his recent co-authored paper in PLOS, but it can also “quantify the spatial integration of immigrant communities” by analyzing the spatio-temporal patterns of different languages in a given geographical location.

With this in mind, the researchers collected over 350 million tweets posted by 14.5 million users between the years of 2010 to 2015 to examine immigrant integration in 53 cities.

After extracting the UserID, geographical coordinates, date, time, and text of every tweet, they used some clever filtering techniques to confirm that each user actually lives in the place where they are tweeting (e.g. they’re not just a visitor). Some of these filtering techniques involved calculating number of consecutive months of activity of each user, and the minimum number of hours spent by each user in the geographical area where their tweets are coming from.

Then, the researchers used CLD2 (Chromium Compact Language Detector) to identify the language of each user’s tweets. In addition to accounting for mutually intelligible languages and dialectical varieties, the researchers also labeled the official language of each city that they were examining as the “Local” language.

“After defining the Local languages in each city,” the researchers said, “we assign[ed] to each user its most frequent language. In case of bilingual/multilingual users, we set as user’s language the one which differs from English or Local unless there are only two languages in [their] dictionary.”

The researchers also decided to remove English from their analysis because it’s the world’s lingua franca. “Moreover,” the researchers added, “the role of English is dominant mainly in the worst links in terms of integration.”

After discarding English, their investigation yielded some fascinating results. “Arabic rises as the most common spatially segregated community,” the researchers explained, “followed by French-speaking communities that are spatially concentrated in other European countries such as Germany and Turkey.”

Image for post
Image for post

On a more positive note, they point out that London is in the lead of hosting diverse communities, followed by San Francisco, Tokyo, Los Angeles, Manchester, and New York.

Image for post
Image for post

Of course, however, the researchers caution that Twitter data is only a partially representative sample of the population because the platform itself contains several biases, from the overrepresentation of young people, to the possibility that certain communities — like Chinese immigrants — may not use Twitter because it is inaccessible in their country of origin (China). Still, as Gonçalves and his researchers remind us, “the important question here is not whether we can find all the [immigrant] communities, but whether we are able to say something meaningful about those detected.”

Click here to learn more about this study.

By Cherrie Kwok

Center for Data Science

This is the official research blog of the NYU Center for…

NYU Center for Data Science

Written by

Official account of the Center for Data Science at NYU, home of the Master’s and Ph.D. in Data Science.

Center for Data Science

This is the official research blog of the NYU Center for Data Science (CDS). Established in 2013, we are a leading data science training and research facility, offering a MS in Data Science and, as of 2017, one of the nation’s first universities to offer a Ph.D. in Data Science.

NYU Center for Data Science

Written by

Official account of the Center for Data Science at NYU, home of the Master’s and Ph.D. in Data Science.

Center for Data Science

This is the official research blog of the NYU Center for Data Science (CDS). Established in 2013, we are a leading data science training and research facility, offering a MS in Data Science and, as of 2017, one of the nation’s first universities to offer a Ph.D. in Data Science.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store