Using “Big Data” to forecast migration
A tale of high expectations, promising results and a long road ahead
by Jasper Tjaden, Andres Arau, Muertizha Nuermaimaiti, Imge Cetin, Eduardo Acostamadiedo, Marzia Rango.
Act 1 — High Expectations
“Data is the new oil,” they say. ‘Big Data’ is even bigger than that. The “data revolution” will contribute to solving societies’ problems and help governments adopt better policies and run more effective programs. In the migration field, digital trace data are seen as a potentially powerful tool to improve migration management processes (visa applications; asylum decision and geographic allocation of asylum seeker, facilitating integration, “smart borders” etc.).1
Forecasting migration is one particular area where big data seems to excite data nerds (like us) and policymakers alike. If there is one way big data has already made a difference, it is its ability to bring different actors together — data scientists, business people and policy makers — to sit through countless slides with numbers, tables and graphs. Traditional migration data sources, like censuses, administrative data and surveys, have never quite managed to generate the same level of excitement.
Many EU countries are currently heavily investing in new ways to forecast migration. Relatively large numbers of asylum seekers in 2014, 2015 and 2016 strained the capacity of many EU governments. Better forecasting tools are meant to help governments prepare in advance.
In a recent European Migration Network study, 10 out of the 22 EU governments surveyed said they make use of forecasting methods, many using open source data for “early warning and risk analysis” purposes. The 2020 European Migration Network conference was dedicated entirely to the theme of forecasting migration, hosting more than 15 expert presentations on the topic. The recently proposed EU Pact on Migration and Asylum outlines a “Migration Preparedness and Crisis Blueprint” which “should provide timely and adequate information in order to establish the updated migration situational awareness and provide for early warning/forecasting, as well as increase resilience to efficiently deal with any type of migration crisis.” (p. 4) The European Commission is currently finalizing a feasibility study on the use of artificial intelligence for predicting migration to the EU; Frontex — the EU Border Agency — is scaling up efforts to forecast irregular border crossings; EASO — the European Asylum Support Office — is devising a composite “push-factor index” and experimenting with forecasting asylum-related migration flows using machine learning and data at scale. In Fall 2020, during Germany’s EU Council Presidency, the German Interior Ministry organized a workshop series around Migration 4.0 highlighting the benefits of various ways to “digitalize” migration management. At the same time, the EU is investing substantial resources in migration forecasting research under its Horizon2020 programme, including QuantMig, ITFLOWS, and HumMingBird.
Is all this excitement warranted?
Yes, it is.
Act 2 — Promising Results
Indeed, a growing number of studies hint at the potential of big data for migration forecasting. Let’s take two prominent examples: the use of Google Trends and Facebook networks.
Google Trends
Google Trends is a powerful and publicly accessible tool for online search data. Google Trends is a good proxy for what over a billion users worldwide are curious about (i.e. searching for on the Google search engine).2 Migrants may use the internet to prepare for a journey, or at any point during the journey. This means search data may be a potentially effective way of gaining insight on migration plans and patterns. Several studies have explored the use of Google Trends for predicting migration flows.3
Google Trends data have been used for forecasting migration within the U.S, and to OECD countries, as well as for predicting Latin America migration to Spain, among others.
The Pew Research Center used searches of ‘destination’ countries such as ‘Greece’ and ‘Italy’ among Arabic speakers in Turkey to estimate movements of asylum seekers from the Middle -East (mainly from Syria and Iraq) to EU countries. Results showed a clear correlation between Google searches related to Greece and monthly asylum applications in Greece. Andre Groger, Tobias Heidland, and Marcus Bohme used Google Trends search keywords that were semantically linked with the word “migration” as a means to improve the performance of migration models. They concluded that Google Trends is a ‘novel’ way for measuring the intent to migrate and a better way to get real-time predictions of migration movements. Research looking at Google Trends queries and migration flows from Latin America to Spain found that online searches are correlated with records of subsequent migration flows.
Based on these promising case studies, at IOM Global Migration Data Analysis Centre (IOM GMDAC) we set out to build a model that works globally and between all countries in the world. Digging into the data revealed the many complications involved in using online search data such as how to technically query the data, the language settings, the specific terms and the data used for “ground truthing” (i.e. comparing searches with official migration data).4
Google search data appear to be correlated with official migration statistics for some migration corridors (see Figure 1 below), but not others (see Figure 2 below). During our 5-month research, we gradually and iteratively tried to find the best approach for each origin-destination country pair (‘corridor’) in the world.
We found that there was no ‘universal’ approach at the global scale and each migration corridor seems to have its unique query approach. In other words, to get comparable data for many countries, the way search data is pulled from the Google API should be adjusted for different contexts.
Aspects like language settings, particular search term selection and other query-related methodologies are key to optimize the use of Google Trends to anticipate migratory movements.
The plot above shows the strong correlation between the migration-related search terms on Google and migrant and refugee flows from Syria to Sweden, based on OECD data (our ground truth data). The correlation is quite consistent across the years.
But let’s not get too excited too soon. We also found many unique bilateral migration corridors which showed no clear relationship between Google Trends and official migration statistics, such as China to Korea (Figure 2) and Iraq to Canada.
Granted, a 5-month research project won’t answer all our questions. Going forward we will need to take into account other contextual factors — internet access, literacy rates, levels of socio-economic development, geographical proximity between origin and destination, among others, –to improve the model and understand why online searches can be useful to predict migration in some cases but not others. So far, the exercise has showed the potential but also the limitations of using “big data” for migration forecasting, underlining the need for realistic expectations among policy makers.
The Facebook Connectedness Index
Can Facebook data be used to forecast migration?
With over 2.71 billion users around the globe , Facebook remains the most-widely used social media platform around the world. Facebook data have been used to monitor stocks of migrants globally, and have successfully anticipated the increase of Venezuelan migrants and refugees in Colombia and Spain. Facebook data have also been used to assess the cultural assimilation of Mexican immigrants in the U.S. In August 2020, Facebook made available data from its Social Connectedness Index (SCI), which measures the ‘frequency’ and ‘density’ of Facebook friendships and social ties around the world. For now, we only have a snapshot of networks at one point in time. If these data were available over time, changes in cross-country contacts could possibly be used to forecast global migration patterns. Preliminary results are encouraging (Figure 3).
We combined Facebook’s SCI data with the United Nations Population Divisions estimates of international migrant stocks to explore whether these data have the potential to predict migration from one country to another. We used Facebook’s Connectedness Index between countries as a proxy for migration networks (social ties) and tested its correlation with the bilateral stock of migrants from the same country pair. The assumption is that changes in the networks across countries go hand in hand with changes in international migration patterns.
Indeed, as Figure 3 shows, higher levels of migrants from one country living in another (bilateral stocks) are associated with a higher probability of Facebook friendship links between users in both locations. A combination of countries with a large migrant population like Mexico-United States or Morocco-Spain have a greater Connectedness Index than other combination of countries with lower bilateral migration numbers like India-Argentina or Nigeria-Norway.
To further explore this relationship, we tested if the positive correlation holds for other countries in other regions. We found that a 1-percent increase in networks is associated with a 0.7 percent increase in bilateral migrant stocks across all available countries pairs. This is good news for forecasting. As more data become available over time, a large increase in network size in any country may indicate an increase in immigration into the country before official statistics become available.5
Act 3 — A Long Road Ahead
So far, so promising. Yet, it will be a long way until forecasts based on these digital data will meet policy makers’ expectations.
The Google Flu case is a good example to explain why. In 2009, researchers from Google used google searches to “nowcast” the flu based on people’s searches. In 2014, a number of renowned researchers showed that after initial success the google flu tool “failed spectacularly” by missing the peak of the flu season by 140 percent. The Google Flu example is not a story of failure, but perhaps of moderation and complementarity. Google did not save the day, but it became a mainstream tool that the CDC now uses in combination with other data sources.
Will we see the same for the case of forecasting migration? Google Flu was released 12 (!) years ago. The migration community has just now started to explore big data and all sorts of digital data sources for forecasting purposes.
Additional momentum from the policy side is helpful, but there is a risk of frustration when exceedingly high expectations are not met quickly. We should not get ahead of ourselves and accept that we are in an exploratory stage. It may take more than a decade for governments and national statistical offices to mainstream digital data as one instrument among others.
There are many more hurdles ahead. The migration community is facing several additional challenges. First, what people search when they have the flu is more specific than what they search for when they want to migrate. Migration is a global phenomenon, but there are many countries around the world with limited internet connectivity. Third, to test whether digital data can deliver accurate forecasts, these must first be compared first with reality in the form of accurate “traditional” migration statistics. This comparison is difficult as many countries in the world do not even systematically compile information on who is coming into their countries and who is leaving on an annual or monthly basis.
There are several other issues with many forms of non-traditional data. Many of these are related to the fact that these are user data collected by private companies. Facebook, for example, is less popular nowadays among younger cohorts. Changes in the user base may have dramatic implications for the ability to use this data for forecasting activities. Social media data are not always reliable, are not representative of the general population — though there are ways to correct for this bias ; serious ethical implications exist for using private data for public policy purposes, aside from fundamental rights issues, particularly when it comes to politically sensitive topics, such as migration.
Epilog: Patience and realism
One likely trend that we will see is that results from big data forecasts will be combined or cross-validated by experts. As previous research shows there is no “one size fits all” method for migration forecasting, but each method has its own set of strengths and weaknesses. Expert opinion on future migration flows can be valuable as big data may be big, but it is also “thin” meaning that other, essential, contextual information is often lacking.
Results from forecasts should first be contextualized, qualitatively analyzed by experts and then presented to policymakers alongside other sources. On the sidelines of a conference, Teddy Wilkin from EASO was sharing this story: “When my analysts detected a spike in searches for Italy among Nigerians, the machine took this at face value. In reality, we found that searches for Italy correlated with Champion League soccer games involving Italian teams.” According to him, EASO does not let policy makers into the “machine room.”
It is true that data scientists and policymakers have never had so many interactions on this issue, yet they are often still speaking different languages. We have a long way to go before digital data become a reliable mainstream forecasting tool, but our experiences in just the last year have been very encouraging. The momentum is there, but it needs to be sustained. Our hope is that policymakers and donor organizations don’t lose interest after the first gold rush. It is patience, realism and persistence that will allow the migration community to get the most value out of innovative data sources for migration forecasting — and other purposes.