Using Infrastructure to Forecast the Spread of Disease in Mexico

Dimitri Kisten
8 min readMar 13, 2020

--

Using Infrastructure to Forecast the Spread of Disease (project found here)

As someone who has formerly worked in healthcare, I was naturally inclined to pursue a project involving public health. Having never done such a project, I learned a lot about the challenges of undertaking a public health data science project and gained a lot of insight on how to further improve my model as well as insight on other areas of public health that I can explore more. This article is part of a 2 part series; the first being an outline of my thought process before starting the project, my methodologies, and my findings while the second part (found here) will be the challenges and lessons learned regarding the field of data science in conjunction with public health.

Idea Generation

My idea for my project stemmed from the obvious observation that different countries around the world have different rates of disease being spread. My original idea was to try and build a model encompassing the whole world, however I realized this would likely produce over generalized results offering no insight; we know that rich countries have better healthcare, and poorer countries have worse health care and therefore higher rates of disease. What I decided to explore was the relationship between the number of cases for a given disease and the infrastructure of a country to see if any novel relationships could be observed in order to direct where spending should be to limit the spread of disease and also to predict the magnitude of an outbreak given certain values of infrastructure. For example, if country X increases spending in water and sanitation by X amount how does this affect the spread of disease? How does the number of cases decrease or increase for every $1 million we spend on pharmaceuticals? For transportation? While an answer like this may be obvious for first world countries, countries which undergo random change year to year with no clear direction for investment, particularly politically unstable countries with limited resources, need to concentrate their resources into areas that will directly benefit the health of their country.

The Data

For this project I gathered infrastructure and disease data from two organizations; World Health Organization (WHO) and the Organisation for Economic Co-operation and Development (OECD). The data contained information regarding spending on both healthcare infrastructure and non-healthcare infrastructure. The variables I selected from this data will be outlined below. Additionally I gathered data on several communicable diseases from these sources including Mumps, Measles, Rubella, Pertussis, Malaria, Zika, Tuberculosis, Hepatitis, and Cholera. The data available spanned roughly 20 years from 1999 to 2019 for most countries.

Country Selection

In order to find relationships between infrastructure inputs and diseases rates I needed a country which exhibited significant variation or volatility in their infrastructure year to year as well as significant variation or volatility in the number of cases for a disease year to year. If I studied countries with relatively stable infrastructures and a number of diseases there would be nothing to discover. This excluded many first world countries with stable infrastructure and countries that were extremely poor who do not have the resources to change significantly over time. In order to compare variation/volatility between countries I used the Coefficient of Variation (COV) to gauge how ‘unstable’ a country was in regards to infrastructure or disease rates. I used this metric because it is unitless, universal across datasets, and is great for comparisons unlike metrics like variance which only make sense within datasets and depend on units. Figure 1 shows the calculation for COV which i simply the standard deviation divided by the mean.

Figure 1

By examining the COVs for each country I was able to provide some validation to my original hypothesis. As shown in Figure 2, many of the countries with the highest COVs, are developing countries or countries that are politically unstable. Again, having a high Coefficient of Variation means that these countries have infrastructure that is changing more, relative to other countries.

Figure 2

In addition to having changing infrastructure, I needed a country which also had volatile disease rates for a given disease while also having a significant amount of cases. A country could have 10 cases one year, and 0 the next, and have a high COV, therefore my analysis included having the sum of cases over a 20 year period.

Using those metrics I decided to select Mexico to perform my analysis and modelling on. The reason for this was because Mexico had an infrastructure COV of 26 which was relatively high compared to the worldwide range of 2 to 33. It had the highest COV in Medical Technology worldwide, third highest COV for number of hospitals, and relatively high COV of 1.3 for healthcare spending with a worldwide max of 2.3. The diseases I included for analysis were Mumps and Pertussis. As shown in Figure 3 below listing the number of cases for each disease in Mexico, Mumps and Pertussis show a lot of variation year over year, making them good candidates for my study.

Figure 3

Selecting the Features for My Model

Selecting features for my model was part science and part discretion. I looked at infrastructure metrics which had high correlations with our Total Cases (Mumps & Pertussis) while also looking at features that I personally wanted to study and see the effects of. Additionally I decided to remove features that logically had multicollinearity with each other. For example, there were many metrics for energy consumption such as gas and oil which are obviously correlated, therefore I only selected one metric per category of infrastructure. The features I included in my model are listed below.

  • ICT goods exports (% of total goods exports)
  • Individuals using the Internet (% of population)
  • Fixed telephone subscriptions (per 100 people)
  • Air transport, freight (million ton-km)
  • Industrial design applications, nonresident, by count
  • Public private partnerships investment in water and sanitation (current US dollars)
  • Public private partnerships investment in transport (current US dollars
  • Electricity production from natural gas sources (% of total)
  • ICT service exports (% of service exports, BoP)
  • Healthcare Expenditure Percent GDP
  • Pharmaceutical Spending US Dollars / Capita
  • CT Scan Device Counts
  • PET Scan Device Counts
  • Doctors per 1000 People
  • Nurses per 1000 People
  • Hospital Count

Modeling

Because I was dealing with forecasting and including outside of exogenous variables to predict our target, # of total cases, I decided to fit my data onto an Autoregressive Integrated Moving Average Exogenous (ARIMAX) time series model. Modeling was expected to be fairly difficult in terms of producing accurate predictions by the nature of how my study was set up. I wanted to find relationships in data that appeared to be random in regards to infrastructure and outbreak cases. By nature of limited and volatile/random data, the model was not able to predict data well. Additionally, I cannot exactly gauge how well the model performs because I have such a limited dataset (20 data points) to train and test on. My baseline model shown below in Figure 4, which predicted the mean of total cases (10175) came closer than my predictions which had an RMSE of 23867. Additionally I was forced to difference 3 times per the Augmented Dicky Fuller Test which is not ideal. I will likely use a Vector AutoRegression (VAR) model in the future.

Figure 4

Although I was not able to build a robust model, I was able to assess the ideas of the original hypothesis by taking a look at the coefficients of each of the features outlined below in Figure 5.

Figure 5

Analysis & Findings

The most important findings from this project were that relations between infrastructure and the spread of disease were not as expected. As shown from the coefficients above, and Figure 6, 7, and 8 below better infrastructure does not lead to limited spread in disease. Figure 6 shows us that with more access to the internet the number of diseases increased. The same relationship is shown in Figure 7 with the increase in hospitals and Figure 8 showing the same relationship with regards to investment in sanitation, water, and transportation. My previous hypothesis is that things like the internet which increases access to information in regards to diagnosis, prevention, or awareness, or increases in people with access to telephones and electricity might help ease the burden of disease. But this is not the case; it might actually appear that the effect of these ‘good’ things is actually bad in terms of communicable disease because it can actually contribute to society becoming more close knit, more social, and less spread apart which would help better quarantine disease, an interesting theory to look into. Additionally I noticed that as healthcare infrastructure got better, so did the number of cases. This leads me to hypothesize that my ideas about infrastructure might be different because I am thinking from the perspective of American infrastructure which has an abundance of resources relative to the rest of the world. First world countries likely have resources to operate in a more prepared manner, meaning even if we don’t have outbreaks, America spends heavily on funding research, pharmaceuticals, medicine, etc. Whereas a country like Mexico, may only ramp up healthcare spending once outbreaks get more, being more reactionary. Therefore the hypothesized causal relationship may actually be backwards.

Figure 6

Figure 7

Figure 8

Next Steps

Moving forward I would like to examine these relationships between infrastructure and disease in a more specific manner. Instead of using many different inputs I would like to focus on a specific category, for example, focusing on transportation effects, energy, public sanitation, communication, or just healthcare infrastructure. How a country operates on each of these levels is very different country to country and cannot be generalized. How a country makes decisions, deploys a budget, reports statistics all play an important role in how we direct our models and are things I would have to look deeper into. For those reasons modeling for a country can be tricky, much less the entire world. Although my model and findings may not have confirmed my original theory, it did provide us with valuable data and insights on where I need to look next.

--

--