Climate Change and Migration: Central America, A Case Study

Jabran Zahid
9 min readSep 17, 2019

--

by: H. Jabran Zahid in collaboration with Peace Rising. Check out my GitHub repository for more details and the data and code.

INTRODUCTION

The United Nations High Commission on Refugees estimates that there are more than 70 million people worldwide who are displaced. Nearly 1 in every 100 persons around the world have been forced from home. While these numbers are alarming, they are incapable of conveying the the unimaginable impact such displacement has on individuals, families and communities. Identifying communities which are vulnerable in advance would provide a means to focus resources which could help mitigate the suffering.

Maria Lila Meza Castro, a 39-year-old migrant woman from Honduras, runs from tear gas with her twin daughters in front of the border wall between the U.S. and Mexico, in Tijuana, on Nov. 25. Credit: (Kim Kyung-Hoon/Reuters)

The causes of displacement are varied and often not well understood. However, there is a scientific consensus that humans are driving climate change at seemingly catastrophic rates and the destabilizing affects of this can already be witnessed in major migrations we see today; the term “climate refugee” will soon be part of our cultural lexicon. The synergy between climate change and other factors contributing to displacement of people was recently recognized by the adoption of the General Compact on Refugees by the 2018 UN General Assembly. The compact notes that “climate, environmental degradation and natural disasters increasingly interact with the drivers of refugee movements.” Climate models predict that climate change is accelerating. This imminent threat makes identifying vulnerable communities all the more urgent.

Peace Rising is a non-profit organization which aims to use predictive modeling to help the international community leverage existing resources to target regions most vulnerable to displacement. In collaboration with Peace Rising, I have made a modest contribution to this goal. Here I will discuss a machine learning model which predicts immigration patterns observed in Honduras, Guatemala and El Salvador using semi-structured remote sensing and aggregate global Geographic Information System (GIS) data from 2002–2017. This model provides an important step for exploring the causal link between the various input features and migration patterns. This work was carried out over three weeks as part of the Insight Data Science Fellows program in Boston.

THE XGBOOST MODEL

The goal of this modeling effort is to provide an interpretive framework for understanding what factors are the best predictors, i.e. most strongly correlated, with migration patterns. Strongly correlated features may provide a means for identifying vulnerable populations. This work is an exploratory effort; it is a first step in assessing the feasibility of such modeling efforts.

To model migration patterns, I build a classification model targeting relative population change. I use GIS data on conflict and protest, climate projections, agriculture and food security, human and economic development, equity and fractionalization, water and the environment, nonrenewable resources, and governance and institutions as input features to the model. I use XGBoost with a walk-forward validation scheme using the Area Under Curve (AUC) scoring metric. I optimize hyper parameters using the hyperopt package in Python. The year-to-year walk-forward validation average AUC score of the hyper parameter optimized XGBoost model is 0.63 (year-to-year scores range between 0.51–0.81). Details of the data and methodology are provided at the end of this post.

DRIVERS OF MIGRATION

The primary motivation of the modeling effort is to provide a framework for identifying and interpreting features which are correlated with migration patterns. The necessity for interpretability motivated me to use XGBoost, a decision tree based algorithm. The feature importance score indicates the usefulness of individual features in making decisions when constructing trees thus providing a direct means for identifying important features.

The importance of all input features sorted by feature type. The legend is color matched denoting the category the feature belongs to. The 12 features corresponding to the precipitation (rain) and temperature data are for each of the twelve months. Of the 19 BioClim features, 17 are shown (BI02 and BI07 could not be engineered from the data). The six governance features are World Governance Indicators.

The figure above indicates the relative importance of various types of features used in this analysis. The most important feature categories are the population data, followed by arms sales, governance, BioClim variables, temperature and precipitation data. I expected the precipitation data to have greater feature importance. However, these data are used together with the temperature data to calculate the BioClim variables and thus are correlated with those features. The covariance of these features may help explain why precipitation data alone was not more important.

Six most important features for predicting migration.

The six most important features in predicting migration in and out of regions are show above. The single most important feature for predicting whether population will increase or decrease is the previous years population change. The dependence on previous years population change is not surprising; migrations likely take years and are thus correlated in time.

Arms transfers were the second most important feature. While the number of arms may not themselves be indicators of migration, arms transfers are subject to larger geopolitical trends which are likely correlated to migration.

The third and sixth most important features are the government effectiveness and political stability and absence of violence, respectively. As noted in the preceding links, government effectiveness reflects the “perceptions of the quality of public services, the quality of the civil service and the degree of its independence from political pressures, the quality of policy formulation and implementation, and the credibility of the government’s commitment to such policies.” The sixth feature reflects “perceptions of the likelihood of political instability and/or politically motivated violence, including terrorism.” Perceptions of political stability and effectiveness of government to provide services appear to be important factors in predicting migration.

The fourth and fifth features in terms of importance are BioClim variables. These features are the mean temperature of the driest quarter and the total precipitation in the coldest quarter. Here, a quarter refers to three consecutive months. These features are especially interesting as they suggest a dependence of migration on climate. Understanding this dependence is important given that these effects will be exacerbated by climate change; a situation noted by the UN General Assemblies recent adoption of the Global Compact on Refugees. This dependency on climate may provide an opportunity to place future quantitative studies of migrations in the broader context of climate modeling and prediction.

The feature importance results reveal potentially interesting relations between input features and migration patterns. I emphasize that these results are tentative and significant additional effort is required to validate the model. Furthermore, these results demonstrate correlation between migration and input features; they do not (and can not) address the fundamental question of what is causing these migrations. Such interpretations require more sophisticated analysis combining quantitative modeling with domain expertise.

This work demonstrates the feasibility of predicting migration patterns from data. The potential power of such modeling efforts is all the more impressive given the global scope of the available data sets. Such models are a powerful quantitative tool for identifying vulnerable populations. These models may be an important tool for immigration organizations wishing to concentrate resources to help mitigate the factors leading to population displacement.

FUTURE DIRECTIONS AND OUTLOOK

Here I outline some directions that may help to make the model and interpretation more robust. An exploration of feature importance based on examination of the relationships between individual features and the target labels may be useful for identifying the underlying causes driving migration. An exploration of the correlation between the features may also reveal underlying patterns which may provide further insights. Such analyses can be straightforwardly carried out (e.g., by calculating the correlation matrix).

The mean walk-forward AUC score of 0.63 may be improved by inclusion of features from additional data sets or engineering of existing features. Additional data on climate, governance, violence, infrastructure and food security are likely to be the most useful. Moreover, there is significant year-to-year variation of the walk-forward AUC validation score. Understanding the drivers of this variation may help to identify limitations of the model and also provide insight into the underlying factors driving migration.

I used a classification approach due to the low significance of the validation score when trying to predict the relative change in population directly, i.e. when setting up the problem as a regression. If the inclusion of additional data improves the predictability of the model, exploration of a regression model is warranted. The relationship between various features and population change is likely dependent on the magnitude of population change. Large rapid migrations may be caused by different factors than more sustained migrations which reflect broader demographic trends (e.g., urbanization). A regression model which accounts for the magnitude of population change may capture more nuanced relationships between the features and target labels. Such a model is easy to build using XGBoost and the existing code.

The model I have developed is a proof-of-concept. Given the scope and complexity of the problem of tracking displaced populations, the success of the model in predicting migration patterns is remarkable. Further development of such modeling approaches combined with domain expertise will be valuable for understanding the drivers of displacement. Ultimately, such efforts could provide important, actionable insights for targeting the most vulnerable communities. Given the imminent and ostensibly inevitable impact of climate change, the potential for quantitative modeling approaches to help address the problem of forced migrations should not be underestimated.

DATA EXTRACTION AND METHODOLOGY

Peace Rising has assembled a 300 GB global geospatial-temporal database consisting of data on conflict and protest, climate projections, agriculture and food security, human and economic development, equity and fractionalization, water and the environment, nonrenewable resources, and governance and institutions. These data are in GIS format as vector and raster data. The data are associated with locations via a latitude and longitude tag. The population data which is the target for prediction spans 2002–2017. Not all data is available for this time period. When missing data are interspersed in the time interval of interest, I interpolate as a function of time. When data are missing from beginning or end of the period of interest, I adopt values which are nearest in time.

Here is an example of the 5 km x 5 km square grid vector overlayed on Oak Ridge National Laboratory LandScan data of global population distribution data. The LandScan data provide global population estimates on an annual basis.

I extract the spatio-temporal data for Guatemala, Honduras and El Salvador. All data extraction was done using QGIS, an open-source GIS application. I extract data on a 5 km x 5 km grid by creating a vector overlay and using intersection between this overlay and the various vector data. For raster data I use the zonal statistics tool in QGIS to calculate mean values for raster data at each grid point. This grid extraction method yields 8600 individual points per year. There are a total 59 input features generated from a subset of the GIS data which I was able to access. I generate the target labels using population from which I calculate the year-to-year spatially resolved relative population change.

Population of 2017 relative to population of 2016. The blue and red points show regions of population increase and decrease, respectively.

The figure above shows an example of spatial distribution of the targets for 2017, i.e. the population change the model predicts. The points are the population in 2017 relative to the population in 2016. The regions which are shown in red are areas in which the population has declined. The areas in blue are regions of population increase. I have normalized the population data of each year to remove the average long term growth observed across all countries. Thus, these trends in the figure above are interpreted as migration patterns. The spatial coherence indicates that these trends are not due to noise, though they may be subject to unknown systematic uncertainties associated with LandScan data. I proceed under the assumption that any systematic uncertainties do not significantly impact the analysis. I produce similar maps quantifying the year-to-year variation for the full temporal span of the data set.

Hueristic diagram illustrating feature vector construction. The columns and rows indicate years that are modeled. The blue squares show the target data and the red squares show the feature data. For each year, I use the previous years data as feature inputs to predict the labels which are the relative change in the population. For example, I predict changes in the population of 2003 relative to 2002 using 2002 data as feature inputs. Similarly, I predict the population of 2017 relative to 2016 using 2016 data as feature inputs.

The diagram above shows the modeling scheme. To predict population change for year N, I use year N-1 as inputs. Given this modeling scheme, I predict population changes for each year between 2003–2017. Thus, I have 8600 grid points per year multiplied by 15 years of data as my parent sample. The goal of this modeling effort is to identify factors contributing to migration. From my parent sample, I select samples where the increase/decrease of the relative population is greater than 25%. This selection results in 33,733 samples (about a quarter of the parent sample).

I am attempting to model movement of people in space and time. Given the complexity and difficulty of the problem, I framed the problem as a simple classification. Technical details regarding the setup of the problem and model choices are provided on the GitHub repository for this project. In summary, I do a binary classification where the two classes are whether population increased or decreased.

For the sake of posterity, I note that the data extraction was challenging and significant effort. The preliminary results presented below are subject to uncertainties. However, the data extraction is robust and provides a solid foundation for any future modeling efforts.

--

--