Predicting migration flow through Europe
Earlier this year the main route that migrants and refugees were taking to central Europe was through the Balkan countries from Greece up to Germany. UNHCR has been collecting data about daily arrivals in the Balkan countries since the beginning of September 2015. We created a situational awareness dashboard with the aim to help operations understand the context they are working in. This combined data sets about the daily arrivals, border statuses and any important new articles.
Red Cross National societies on the route requested whether it would be possible to estimate daily arrivals through the Balkans to efficiently prepare their capacity. Previously there had been instances of a lot of people volunteering on a particular day and then only a small amount of migrants arriving.
From observation of the situational awareness map we could see trends in the arrivals numbers with bulges travelling up through the region and were interested to see whether these could be modelled and forecasts made.
We looked to build a regression model with the explanatory variables being the historical arrival data for the countries ‘down stream’. Eg for a three day forecast of Macedonia we would only look at arrivals into Greece three days prior.
Due to the large number of independent variables we were modelling against we were worried about overfitting the model, but found lasso (least absolute shrinkage and selection operator) a penalised regression technique that would remove redundant variables.
This was passed over the historical data and produced some quite promising results as seen below. This is a model trained on December to mid-January data forecasting data for mid-January to mid-February
The model was most reliable predicting countries towards the end of the journey and for the shorter forecasts. A snapshot of the website for these dates can be found here.
Interestingly the models gave indications of the average travel time through the Balkans route at each stage. For earlier this year it was taking around 3 days to reach Macedonia from the Greek Islands and they would be in Austria in the following 2 to 3 days.
We created a workflow where we would create new models once a month. These were then placed inside the website which would access the latest HXLated daily arrivals information, run the models on the latest data and display the forecasts to the user. This meant to produce the latest forecasts we just had to make sure the daily arrivals google spreadsheet was kept up to date.
We were aware this model had some weakness and these were highlighted quicker than we had hoped. The models were based on the the last 40 days of arrival data in an attempt to capture the current context. If the context changed such as border closures then we would have to wait another 40 days to get a model that captured the current political climate. Sure enough the borders were quickly closed after a couple of months and currently modelling is yielding poor results especially due to lower numbers passing through.
I don’t think the models in the end had any impact on Red Cross operations as they became redundant very quickly after their introduction, but I think the exercise shows value in a few areas.
That in some cases population movements can be modelled by penalised regression.
The value in UNHCR collecting daily arrivals figures.
That a workflow making use of HXL can be set up to produce forecast figures which doesn’t take much effort to maintain.
Timeliness of data is important — Often we would only have the data in HXL form (a manual step by our team) 24 hours late making the 1 day forecast redundant. UNHCR releasing their data in HXL on HDX would speed this up.
We have some ideas on how to improve this in future including exploring other penalised regression algorithms that could yield better results and looking at how weather patterns affect arrivals into Greece.