5 Challenges in Public Health & Data Science

7 min readMar 13, 2020

My thoughts going into my public health and infrastructure project (which I did a write up on here and can be found on my GitHub here) were to say the least, naive. We’ve made tons of technological advances over the course of the last 2 decades and I believed gathering and handling data on infrastructure and disease would be very simple. Additionally my assumptions of why my hypothesis might prove to be true, ignored a lot of caveats only realized after performing my analysis. Nonetheless the process of going through the project provided me with a lot of lessons on aspects of my methodology and thesis to consider for future projects. Below are a list of things that I learned while working on my public health project.

Data collection is still in it’s infant stage

The data I was looking for was not something that I expected to be very hard to find. It wasn’t very granular in nature, mostly counts and budget numbers. I wanted spending numbers on infrastructure both non-healthcare related and healthcare related. Although data for the United States was fairly robust, the same cannot be said for many other countries who have only recently started reporting their numbers. Many developing countries have only started publicly reporting infrastructure spending in the last 5–10 years making it very hard to draw any conclusions from how they budget. This was one of the main problems I faced when modeling, the mere lack of data both in scope (amount of countries available) and volume. Most of the data, even for first world countries only went back to 1999 making it hard to produce a viable model. My hunch is that much of this data is available however not public. Many governments haven’t published this data and are likely using it for personal use. While it may be possible to dig deeper and request the information through government agencies, I believe there should be a push to make government spending and infrastructure data available for everyone for every country. I think there is a huge application for solving the problems of a country by examining how their government spending relates to the lives of their citizens. In regards to healthcare, there is a huge application for collecting more granular data. I was shocked at how much data was missing for diseases that have been around for so long. This raises the point that many developing countries do not have the means or infrastructure to collect data as there is no centralized agency or methodology for doing so. This also raises the point which will be discussed further below on, how accurate can our models be if the data collection practices are not uniform across countries. With so many countries lacking the capability to collect data, it eliminates them from being the subject of meaningful data analysis that could potentially save lives.

2. Data from verified sources can still be questionable.

Another assumption going into this project was that data from reliable sources such as WHO and OECD will contain the most accurate information I can find. However this was not the case as I explored data for several diseases. I noticed that in Mexico there would be an outbreak of a disease and then the following year there would be 0 reported cases of the disease. While this is common for diseases that have been eradicated, that is suspicious for diseases where there were tens of thousands of cases the year prior. Even more unusual because years where there was not data available had a clear label of ‘Null’ or ‘Not available,’ therefore filling a year with 0 reported cases was somewhat fishy. It became even more unusual when I noticed that some years with 0 reported cases actually had a few hundred to thousands of cases upon further research. This would make a significant difference in my model and other studies who take this information for fact. It brings up multiple questions that need to be addressed when dealing with government data, or any data for that matter: How is data collected? What constitutes a value/count for a certain metric? What is the process/requirements for submitting ‘official’ data? Why is there missing data for diseases in recent years where there was data the year prior? Why is data filled with 0 if that was not the case? Ultimately the quality of our data determines the quality and accuracy of our model and these are things I plan on diving deeper into before modeling in future projects.

3. Domain knowledge is king.

Although obvious, the value of having domain knowledge on the problem you are trying to solve is arguably more valuable than any coding or modeling you do (in my opinion). Many of the ideas I had were speaking in general terms ignoring many limitations that I would have thought of if I had done more research or had more domain knowledge on hand. An example of this is if I dug deeper into how governments in developing countries operate differently than governments in the United States I would probably have taken a different route in my approach. My initial idea was that by governments increasing spending on healthcare that they will help control the magnitude of an outbreak. This basis for this idea is based on the assumption that Mexico’s government operates on a preparatory manner versus a reactionary manner. In first world countries, health is a vital part of improving quality of life. For example, America will pour money into pharmaceuticals, research, and education regardless of whether or not we have an increase in disease cases. We are constantly looking to improve healthcare. Whereas in developing countries the causal relationship is different. Healthcare spending doesn’t always go up, but it seems to go up with the number of outbreaks, meaning that it could be in response to an outbreak instead of beforehand, it is reactionary. This is likely the case in many developing countries that do not have the resources to continually fund healthcare, but rather wait until it is necessary. This is just a hypothesis based on the findings of my project but thinking of relationships like that, the ins and outs of the subject we are dealing with, are critical for idea generation and avoiding redundant endeavors.

4. The catch 22 of data collection

Besides domain knowledge, there are other aspects to consider as we enter a world where data collection increases. An aspect I did not consider was that one of the variables I am measuring is dependent on the variable I am using to predict it. This was the case for hospital counts. One thing that I did not think about was that as hospitals increase so does the amount of reporting. So it could be that the number of diseases has stayed relatively the same for years but is only being reported recently because of access to healthcare which opens a can of worms. As we know now with the COVID-19 pandemic, reporting cases is very tricky and numbers don’t paint an accurate story at all even in the United States. The lack of cases is more often than not related with the scope of testing and speed of testing, both of which are lacking. It becomes even more increasingly hard to model when this is factored in. This is something to consider for all data science endeavors. How do we scale our data as data becomes more accessible and easier to collect? Is there an actual pattern we are observing in the data or is that just a function of collecting more data? That is critical for moving forward with modeling and likely can’t be fully assessed because our assumptions need forward data to test on.

5. Be specific in your ideas

The encompassing lesson from everything discussed has been to be specific in your ideas. The reason is that with all the caveats of a broadly encompassing hypothesis, it is almost impossible to make sure every hole and area is addressed accurately which makes for a sloppy end product. There were so many holes when considering 15+ aspects of infrastructure and further holes in regards to collecting that data and disease data. Creating very specific ideas allows you to manage and dive deeper into your data and create more focused and meaningful work. I originally wanted to produce a project that involved the world. I then narrowed it down to Mexico, and the diseases to Pertussis and Mumps, and even that ‘scaling down’ still had lots of gaps and questions in regards to the quality of data. A better angle would be to analyze one single input of infrastructure, for example, examining the effects of transportation on the spread of disease. While it is easy to fall into the trap of trying to explain everything and do everything this can often lead to an end product that is not as robust, accurate or meaningful as a project whose scope was more concentrated.

Despite there being gaps in my thought process and modeling, this project showed me how to better my idea generation process, how to perform better exploratory analysis and how to better prepare my data before production of a model. Public health & data science is a relatively young combination however I believe thinking about how we handle the above challenges and working at them will be critical to producing meaningful work in this space, something I look forward to.

5 Challenges in Public Health & Data Science

Written by Dimitri Kisten