Using Data Science To Understand Migration

By Mr. Data Science

Photo by Atul Pandey on Unsplash

Throughout this article, we will explore migration data to gain a better understanding of migration drivers. Since migration remains a contentious political issue, we will refrain from giving opinions and focus on the data instead. To investigate migration drivers we will use a couple of datasets (all of them csv files):

  • The country data set was downloaded from Kaggle
  • The happiness reports (5 files) were also downloaded from Kaggle

The goals for this article are to:

  1. demonstrate some useful data science techniques such as combining datasets, generating correlation heat maps, and applying k-means to a dataset
  2. combine the data into a unified dataframe and that can be analyzed. We will explore, for example, the correlation between happiness score and net-migration
  3. Explore the data to gain a better understanding of the differences between countries that have net positive and net negative migration

According to Wikipedia the term data analysis was first used back in the early 1960s and the term data science appeared in the mid-1980s. An important distinction was made between statistics and data analysis: data analysis should be empirical, more like a science than mathematics. This idea was also central to the use of the term data science. One important idea that is not always discussed in data science books or course tutorials is the that part of the role of the data scientist/analyst is to make the data useful to decision makers and people who perhaps do not have a technical background. This is why data visualization is such an important topic. In this article, we will primarily focus on the practical application of some data science techniques rather than a theoretical discussion of a particular algorithm.

Applied data science is the process of using data and data science techniques to solve problems or better understand issues. It is very powerful but you should know it is not perfect. It is limited by the data, the accuracy of the data and the completeness of the data. These are issues we will encounter in this article. However, even with these limitations, data science can offer a more objective understanding of an issue. This can be particularly useful with an issue like migration which can produce powerful emotions.

In addition to some EDA (exploratory data analysis), this article will also use the k-means algorithm to identify possible clusters within the migration data. Given multiple features within the dataset, there may be patterns or clusters that can help us better understand the drivers of migration. The k-means algorithm has many practical applications within business and other areas where decision making is essential [1].

In this article, we will rely on the following libraries. Make sure you can download, install, and import them before proceeding.

Lets start by importing pandas.

import pandas as pd

There are six files we will be exploring in total:

df_2015 = pd.read_csv("migration_data/2015.csv")
df_2016 = pd.read_csv("migration_data/2016.csv")
df_2017 = pd.read_csv("migration_data/2017.csv")
df_2018 = pd.read_csv("migration_data/2018.csv")
df_2019 = pd.read_csv("migration_data/2019.csv")

df_countries = pd.read_csv("migration_data/countries.csv")

The first five datasets are the world happiness report for the years 2015 to 2019, the sixth dataset is unrelated to the first five and has a very different structure, it contains additional information about different countries.

Combining the data into a single dataframe is an interesting challenge. It requires some work because the five happiness reports are all in different formats. We will standardize the structure of the data in the five happiness reports then concatenate the files into one large file. We will then need to remove duplicates.

The country dataframe is in a completely different format as well. Rather than concatenating the data onto the end of the combined happiness dataframes we will merge the country dataframe on a common key, the country column. This will provide more features for each country and may be interesting to analyze. If you are familiar with SQL it is like using an inner join on two tables.

First let’s check that the five happiness reports have the same number of rows:

df_2015.shape(158, 12)df_2016.shape(157, 13)df_2017.shape(155, 12)df_2018.shape(156, 9)df_2019.shape(156, 9)

Each row in these reports represents one country so the number of countries covered in each report is different, for example:

df_difference_1 = df_2015[~df_2015['Country'].isin(df_2016['Country'])].dropna(how = 'all')df_difference_1['Country'].head(5)21                  Oman
90 Somaliland region
93 Mozambique
96 Lesotho
100 Swaziland
Name: Country, dtype: object

Countries like Oman and Mozambique did not come into existence or cease to exist between 2015 and 2016. It seems the annual report is not complete, perhaps data was not available from some countries in some years. The structure of the data also varies from year to year:

df_2015.columnsIndex(['Country', 'Region', 'Happiness Rank', 'Happiness Score',
'Standard Error', 'Economy (GDP per Capita)', 'Family',
'Health (Life Expectancy)', 'Freedom', 'Trust (Government Corruption)',
'Generosity', 'Dystopia Residual'],
df_2017.columnsIndex(['Country', 'Happiness.Rank', 'Happiness.Score', 'Whisker.high',
'Whisker.low', 'Economy..GDP.per.Capita.', 'Family',
'Health..Life.Expectancy.', 'Freedom', 'Generosity',
'Trust..Government.Corruption.', 'Dystopia.Residual'],

So the first task is to standardize the columns for the five annual happiness reports. This means each report should have the same columns in the same order with the same names. We can use these five lines of code to drop columns that are not common to all five reports:

df_2015 = df_2015.drop(['Region','Happiness Rank','Family','Standard Error','Trust (Government Corruption)', 'Dystopia Residual'],axis=1);df_2016 = df_2016.drop(['Region','Happiness Rank','Family','Lower Confidence Interval','Upper Confidence Interval', 'Trust (Government Corruption)', 'Dystopia Residual'],axis = 1);df_2017 = df_2017.drop(['Happiness.Rank','Whisker.high','Family','Whisker.low','Trust..Government.Corruption.','Dystopia.Residual'],axis = 1);df_2018 = df_2018.drop(['Overall rank','Social support','Perceptions of corruption'],axis = 1);df_2019 = df_2019.drop(['Overall rank', 'Social support', 'Perceptions of corruption'],axis = 1);df_2015.head(1)

This leaves six features common to all dataframes, we can now simplify and standardize the feature names. The six dataframes can then be concatenated and duplicate rows (by country name) removed, leaving the most recent row for each country that appears at least once in the five reports.

df_2015 = df_2015.rename(columns={"Country": "country", "Happiness Score": "happiness", "Economy (GDP per Capita)": "gdp", "Health (Life Expectancy)": "health", "Freedom": "freedom", "Generosity": "generosity"})
df_2016 = df_2016.rename(columns={"Country": "country", "Happiness Score": "happiness", "Economy (GDP per Capita)": "gdp", "Health (Life Expectancy)": "health", "Freedom": "freedom", "Generosity": "generosity"})
df_2017 = df_2017.rename(columns={"Country": "country", "Happiness.Score": "happiness", "Economy..GDP.per.Capita.": "gdp", "Health..Life.Expectancy.": "health", "Freedom": "freedom", "Generosity": "generosity"})
df_2018 = df_2018.rename(columns={"Country or region": "country", "Score": "happiness", "GDP per capita": "gdp", "Healthy life expectancy": "health", "Freedom to make life choices": "freedom", "Generosity": "generosity"})
df_2019 = df_2019.rename(columns={"Country or region": "country", "Score": "happiness", "GDP per capita": "gdp", "Healthy life expectancy": "health", "Freedom to make life choices": "freedom", "Generosity": "generosity"});
dataframes = [df_2015,df_2016,df_2017,df_2018,df_2019]
df_happiness = pd.concat(dataframes)

This dataframe contains duplicate countries because many countries appeared in more than one report:

df_happiness.shape(782, 6)

The next line of code deletes all duplicate countries except for the last (most recent) row for a given country:

df_happiness = df_happiness.drop_duplicates(subset=['country'],keep='last')df_happiness.shape(170, 6)df_happiness.head()

To sum up: we’ve combined the five happiness report dataframes, however since they had some different features we had to drop some columns. We’ve also simplified and standardized the column names. The process of concatenating all the files together introduced some duplicates into the dataframe — there will be up to five rows for some countries, so we used the drop_duplicate function to keep just the last row for each country. The final dataframe contains the most recent row for each country. We still have the individual annual report dataframes but we also have a dataframe with all countries that appeared at least once in any of the five reports.

We also have df_countries, the structure of this dataset is:


Joining this data to the df_happiness dataframe we created above would give us a dataframe with more features for each country, this could be useful in examples 2 and 3 when we try to analyze the data.

This data seems to sometimes use the European system of separating digits with a ‘,’ rather than a ‘.’ so for example 0.23 is written as 0,23.

We need to do two things to this dataframe:

  • change the numbers to use ‘.’ rather than ‘,’
  • simplify the column names where necessary

The easiest way to fix the numbers is to re-load the data. The read_csv function has a number of parameters that can be used to format dates and numbers:

df_countries = pd.read_csv("migration_data/countries.csv",decimal=",")df_countries.head(2)

We can rename the columns, just to keep everything standardized.

df_countries = df_countries.rename(columns={"Country": "country", "Population": "population", "Area (sq. mi.)": "area", "Pop. Density (per sq. mi.)": "pop_density", "Coastline (coast/area ratio)": "coastline", "Net migration": "migration","Infant mortality (per 1000 births)": "infant_mortality","Literacy (%)": "literacy","Phones (per 1000)": "phones","Arable (%)": "arable","Crops (%)": "crops","Other (%)": "other","Climate": "climate","Birthrate": "birthrate","Deathrate": "deathrate","Agriculture": "agriculture","Industry": "industry","Service": "service"});df_countries.head(2)

We’ll also drop the columns GDP and Region because we already have a GDP column and we want to focus on specific countries rather than regions:

df_countries = df_countries.drop(['Region','GDP ($ per capita)'],axis=1)

Dealing with Nulls:

First count the nulls in each column:

df_countries.isna().sum()country              0
population 0
area 0
pop_density 0
coastline 0
migration 3
infant_mortality 3
literacy 18
phones 4
arable 2
crops 2
other 2
climate 22
birthrate 3
deathrate 4
agriculture 15
industry 16
service 15
dtype: int64

Five columns have more than ten rows with nulls, we can remove these columns:

df_countries = df_countries.drop(['literacy','climate', 'agriculture', 'industry', 'service'],axis=1)

The pandas documentation on joining, concatenating and merging is a useful read. In this case we will combine two dataframes based on a shared column (country), I’m using how=”inner” :


Initially the attempt to merge to the two datasets failed, the problem was some leading and training whitespace in the country column. Python and pandas see ‘Albania’ and ‘ Albania’ and ‘Albania ‘ as different so the next two lines of code eliminate any leading or trailing white spaces. This fixed the problem:

df_countries['country'] = df_countries['country'].str.strip()
df_happiness['country'] = df_happiness['country'].str.strip()
df_combined = pd.merge(df_happiness, df_countries, how="inner", on=['country'])df_combined.sort_values(by=['country']).head()

We have lost some rows because no country-country match was found during the merge operation. We can check for any remaining nulls:

df_combined.isna().sum()country             0
happiness 0
gdp 0
health 0
freedom 0
generosity 0
population 0
area 0
pop_density 0
coastline 0
migration 0
infant_mortality 0
phones 1
arable 0
crops 0
other 0
birthrate 1
deathrate 1
dtype: int64

Only a few nulls left, we can drop these rows.

df_combined = df_combined.dropna()

We now have a combined dataset free of nulls with data from the happiness reports and the country data file. We did have to drop some features and some countries but we still have a usable dataframe.

In part 2 we will do some data analysis to better understand the topic of migration.

We are interested in migration so we can check the correlation between migration and the other features and plot this as a heatmap:

import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(16, 6))

corr = df_combined.corr()

sns.heatmap(corr, xticklabels = corr.columns, yticklabels = corr.columns, cmap="viridis",annot=True);

Correlation of course doesn’t imply causation but it can help identify possible dependencies between the different features. In this case the bright yellow/green represents a strong positive correlation and dark blue/purple is a strong negative correlation. The top left corner for example shows a strong positive correlation between happiness, GDP and health. Interestingly there is some correlation between freedom and happiness but the correlation is weaker, perhaps suggesting freedom is less important for people to feel happy. Correlation can also be negative between two or more variables when an increase in one variable corresponds to a decrease in the other variable. This can be seen with happiness, gdp and health which all have a negative correlation with infant mortality. So poorer countries (as measured by low GDP) tend to have higher rates of infant mortality.

So what correlates with migration? To read this kind of diagram you can either look at the horizontal migration row or the vertical migration column, whichever is easier. Migration does not seem to have very strong correlation with any of the features. The strongest migration correlations seem to be: surprisingly — ownership of a phone (0.27), population density in the migrants country (0.25) and maybe less surprisingly the GDP of the migrant’s country (0.22). So poorer countries with high population density are more likely to show emigration rather than immigration. Of the poorer countries, those where people own cell phones seem to have higher migration rates. The lack of obvious strong correlation between migration and any one feature is maybe telling us that there is no single quick fix that a rich country might take to reduce migration or the build up of migrants at their border. Note that there is very little correlation between happiness and migration (0.12), so whether or not people are happy in a country does not seem to be a factor in the levels of emigration from that country or immigration into a country, at least according to this data.

We can also look at which countries produce migrants and which countries attract migrants. We’ll create some new dataframes which are subsets of the dataframe df_combined. This sub-division of a dataframe can be done with this code:

df_combined_pos = df_combined[df_combined['migration'] > 0]
df_combined_neg = df_combined[df_combined['migration'] < 0]

According to the OECD glossary of statistical terms: net migration is negative when the number of emigrants exceeds the number of immigrants so the dataframe df_combined_neg contains all the countries where emigration is greater than immigration and df_combined_pos contains the countries that attract more migrants.

We can take a look at some of the countries in these dataframes:


According to this data the country that attracts the most immigration (as a proportion of its total population) is Afghanistan, this seems surprising given the political and security situation in Afghanistan. We can’t dismiss the possibility of the data containing errors or maybe the way the data is measured is not as expected. Other countries in the top five destinations for migrants are less surprising, for example this English language Luxembourg Government website states that almost half the population of Luxembourg were not born in Luxembourg.

The countries that produce the most migrants include:


Comparing the average happiness for countries with high emigration and high immigration:

neg_mean = df_combined_neg['happiness'].mean().round(2)
pos_mean = df_combined_pos['happiness'].mean().round(2)

print('Countries with more emigration have a mean happiness score of: {0}'.format(neg_mean))
print('Countries with more immigration have a mean happiness score of: {0}'.format(pos_mean))
Countries with more emigration have a mean happiness score of: 5.29
Countries with more immigration have a mean happiness score of: 6.42

As we saw with the correlation heat map there isn’t a strong correlation between a country’s happiness score and net migration. Perhaps as we expect, countries that attract migrants have a slightly higher happiness score than the countries where the migrants come from, but the difference is relatively small.

There are 40 countries with zero net-migration:

df_combined_zero = df_combined[df_combined['migration'] == 0]df_combined_zero.shape(40, 18)

These countries include:


It could be that these countries have almost no immigration or emigration or it could be that they have high levels of emigration and immigration which cancel each other out. The data we have does not allow us to investigate this further, we only have net-migration.

The average happiness score for these countries is:


Surprisingly this is lower than the mean score for countries with negative net migration.

zero_mean_pop = df_combined_zero['pop_density'].mean().round(2)
neg_mean_pop = df_combined_neg['pop_density'].mean().round(2)

print('Countries with zero net migration have a mean population density score of: {0}'.format(zero_mean_pop))
print('Countries with more emigration have a mean population density score of: {0}'.format(neg_mean_pop))
Countries with zero net migration have a mean population density score of: 96.9
Countries with more emigration have a mean population density score of: 117.98

While happiness does not seem to be a very important factor in emigration, population density might be. Countries with negative net migration seem to have higher mean population densities.

In the next part we’ll apply a k-means algorithm to the data. This is an unsupervised algorithm which clusters or divides the data into groups. We can look to see if those groups correspond to migration in any way.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.preprocessing import StandardScaler

We can now investigate this data further using some machine learning. Let’s see if we can cluster the countries into groups using the k-means algorithm:

k-means uses a distance algorithm to group and label data so it is usually best to normalize the data otherwise the calculations would be dominated by numerically large features such as population and area:

df_no_country = df_combined.drop(['country'],axis=1)
scaler = StandardScaler()
X_std = scaler.fit_transform(df_no_country)
X_std[0:2] #look at the first two 'rows'
array([[ 1.24500532e+00, 1.12265085e+00, 1.72761796e-01,
1.56996543e+00, 3.76045246e-01, -2.72897775e-01,
-3.01688840e-01, -2.56251376e-01, -2.01227542e-01,
1.01102844e-02, -5.51766143e-01, -5.69945423e-01,
-1.10935417e+00, -6.44947796e-01, 1.14935482e+00,
1.12946418e+00, -1.16162547e+00],
[-9.72992421e-01, -1.23701245e+00, -1.52273283e+00,
4.00866128e-01, -1.58166947e-03, -2.91106905e-01,
-3.89806556e-01, -2.47445648e-01, -1.54417992e-01,
-6.43071551e-02, 1.66639298e+00, -8.83977507e-01,
-1.10654813e+00, -6.77723520e-01, 1.16306047e+00,
1.40753569e+00, 1.66796712e+00]])

In example 2 we divided the data into three groups based on net migration. So let’s use k = 3 again and see how k-means clusters the data:

kmeans = KMeans(n_clusters = 3,random_state = 111)
KMeans(n_clusters=3, random_state=111)pd.Series(kmeans.labels_).value_counts()2 72
0 45
1 31
dtype: int64
preds = kmeans.labels_
kmeans_df = pd.DataFrame(df_combined)
kmeans_df['Labels'] = preds

We now have a data set with labels for each country, the labels are 0,1,2. We can now analyze these groups looking for differences:

The number of countries with each label:

kmeans_df['Labels'].value_counts()2    72
0 45
1 31
Name: Labels, dtype: int64

Again we can subdivide the dataframe this time using the new labels:

df_0 = kmeans_df[kmeans_df['Labels'] == 0]
df_1 = kmeans_df[kmeans_df['Labels'] == 1]
df_2 = kmeans_df[kmeans_df['Labels'] == 2]

Some differences in mean migration numbers:

  • group 1 has the highest mean net migration (3.49)
  • group 0 has a positive mean net migration (0.32) but only just above zero.
  • group 2 has a negative net migration (-1.21) so emigration is higher than immigration

It is interesting that the three groups created by k-means have a positive mean net-migration, a negative mean net-migration and a mean net-migration of around zero. This is very close to the manual divisions we created in Example 2 above. But the algorithm was not looking at only the net migration values.

Interestingly group 2 has a higher mean happiness score than group 0 even though group 2 countries tend to have negative net-migration. Again as we saw before the happiness score alone can not be used to predict migration but even so this seems counter-intuitive.

Other differences include:

  • countries that attract migrants tend to be richer, they have better healthcare, in particular infant mortality is much lower than the countries migrants come from
  • countries that have higher emigration tend to score less on the freedom measurement but the difference is small and as with happiness, countries with a negative mean net-migration tend to score better on freedom than the group 0 countries which have a net-migration of around zero.

Perhaps we can say that concepts like happiness and freedom are less important to migrants than the more practical issues of money and health care.

  1. Raghupathi, K., 10 Interesting Use Cases for the K-Means Algorithm, date retrieved=04/06/2021,, GitHub, Medium,

I’m just a nerdy engineer that has too much time on his hands and I’ve decided to help people around the world learn about data science!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store