You can read a reader-translated version of this article in فارسی.
Machine learning and data science tools are more accessible than ever. But along with learning the tools themselves, it’s just as important to learn how to effectively explore data and figure out its limitations before you feed the data into your modeling tools.
You’d be surprised how often people jump into building models without looking at the data. This is a mistake. To build effective models, you need to understand how the data was collected and where it has gaps. This is equally important whether you are working with a few hundred rows of data in an Excel spreadsheet or a terabyte-sized image classification data set.
Every real-world data set will be full of weirdness because data is collected in the real world and the real world is weird. This is definitely true of all the data we see reported daily during the current COVID-19 epidemic. It’s hard to collate numbers from all over the world on a daily basis and get it right, so the numbers you see reported exhibit all the same gaps and issues you should expect to see any other real-world data set.
Let’s look at some of the COVID data being reported and see how we would go wrong if we tried to build a model on top of it without examining it first.
Lesson 1: How the data was collected will create strong patterns in the data
The international standard for COVID reporting is for each country to report the number of deaths that occur in hospitals on a daily basis. This makes it possible to compare how the disease is impacting different countries.
Let’s take a look at the daily numbers reported by the United Kingdom:
Notice that the deaths reported follow a perfect weekly cycle. They drop significantly at the beginning of each week each on Sunday and Monday. This is a really interesting finding that might have huge implications in a model. Maybe something is different about staffing, supplies or treatment on those days leading to different outcomes?
The problem is that this weekly cycle is fake. It’s an artifact of how the data is collected and reported.
Once a day, each medical facility reports its total number of deaths to a central authority. The overall rise in deaths reported by the UK is the sum of those numbers minus yesterday’s sum.
This causes two important side effects:
- The sum for a single day can be (and usually is) incomplete. If any medical facility fails to report a number in time or under-reports, those deaths will be missing from the overall UK total and will eventually get lumped into a future day’s total when that facility catches up.
- There is a 1-day lag between each facility reporting and the UK-wide sums being reported to the public.
The explanation for the weekly cycle is simple. Hospitals don’t all have full staffing on weekends, so they don’t have the bandwidth to perfectly report their numbers in time. Slow reporting causes a drop over the weekend and then a corresponding rise after the weekend. And because of the one-day lag in reporting, that shows up in the data as a drop on Sunday and Monday instead of on Saturday and Sunday.
This is a common issue with data sets — how they are collected can create patterns in the data that are even stronger than the data itself. For example, many freely-available image data sets are created by grad students working on their PhDs. So if you grab a random data set off the web with images of cars, you are probably going to get a lot of pictures of compact cars in campus parking lots and not many pictures of large trucks. But in the US, pickup trucks outsell cars almost 3-to-1!
Lesson 2: Investigate Outliers
Data sets will almost always have outlier values (points significantly outside the range of the rest of the data), but you don’t always want to include them in your analysis. This is because outliers can be the result of a simple typo or the result of an extraordinary event happening. It’s important to look at outliers to understand if you should include or exclude them in your analysis.
Here’s the rate of new COVID cases reported by China as reported by worldometers.info:
There is a huge outlier on February 12 where they report 14,108 new cases of the disease. This daily increase is several times larger than the number of cases reported on any other day.
If you blindly built a model from this data, that outlier would throw everything off. Conversely, if you assume that the outlier represents a true event, you might be misled into thinking that there is something special about February 12th that caused an increase in infections.
The real reason for the jump is that China changed its reporting methodology on February 12th. Before that date, China was only reporting cases of the disease confirmed by an RNA-based virus test. But due to testing bottlenecks, doctors had also been screening patients for COVID using chest x-rays to look for tell-tale lung symptoms. On February 12, China back-reported cases that had only been confirmed via X-ray, causing the huge jump in reported cases. Those cases didn’t all happen on February 12 — that’s just when they were added to the count.
The explanation is easy enough to find if you look for it. Armed with that information, you can decide how to treat that outlier. But you would never know if you didn’t look at the data carefully before you started modeling.
Lesson 3: Normalize Geographic Data
One basic tip that people forget all the time is that data collected by geographic region almost always makes more sense when you normalize it by population or some other representative factor. After all, 300 cases of a disease are a much bigger deal in a village with 500 people than in a city of 8 million.
For example, here’s a map shaded by the number of COVID cases in each London Borough as of April 8, 2020, using the government-provided statistics:
The problem that different boroughs have different populations. When you color areas on a map using only counts, you almost always end up recreating a population map.
On this map, Croydon at the very south end of the city is the same color as Southwark in the center of the city. But Croydon has 20% more people than Southwark and covers a larger area. If both areas have the same number of cases, it doesn’t make sense to say that both areas are equally impacted since the rate of infection in Croydon would be lower.
The solution is to normalize the map by another factor, like population. Simply divide the number of cases in each borough by the population of that borough to get an occurrence rate. Using occurrence rate, you’ll get a more understandable map that estimates which areas are most heavily affected:
When you divide by population, you can see that Croydon has a medium occurrence rate while Southwark by the river is one of the hardest-hit areas.
This effect is even stronger when looking at national-level data in countries like the US where the population is very unevenly distributed. In the US, almost everyone lives near the east coast, the west coast or in Texas. Most of the rest of the country is lightly populated in comparison. If you draw a map of the US without normalizing the data, you’ll probably just end up drawing a map of where people live.
Lesson 4: Treat Your Surprising Results with Suspicion and Quadruple Check Them
No matter how hard you work to understand your data and build an accurate model, there are an infinite number of ways that your model can go wrong by accident. So if you feed data into your model and get an amazing or unexpected result, it’s worth being extra skeptical and walking through that case in detail to see if you missed anything.
One of the models informing the US COVID response is the model created by IHME. This model predicts the peak of the COVID epidemic and its total demand on the healthcare system. For the US, they have predicted roughly 60,000 total deaths:
They recently added predictions for the UK as well, though they are more preliminary. For the UK, they are initially predicting an even higher toll of 66,000 deaths:
This is a truly extraordinary prediction. The model is saying that the USA with a population of around 330 million people will have fewer deaths than the UK with a population of about 66 million (1/5th!). With such a big difference, it seems reasonable to be skeptical until we understand the reasoning.
This skepticism has nothing to do with the skill of the team building the model or the quality of their work. Predicting rare events with certainty is incredibly difficult. Models help us get our heads around how different variables might drive outcomes, but they are just models. They aren’t simulations that we blindly trust. So whenever you see an extreme prediction, you should try to understand why it happened.
The first couple of days after this new model was published, the real numbers reported by the UK were lower than the model’s lowest predicted range. And indeed a couple of days later, IHME refreshed this graph with a much wider confidence band representing a much more uncertain prediction:
Even with the new confidence range, this model is still predicting numbers above other similar models created by other teams. Prof. Neil Ferguson of the Imperial College London has been quoted in the press as saying that this model is flawed for the UK because it is incorrectly modeling hospital utilization and that his model predicts lower numbers. So now we have a case where two separate models are coming up with two totally different ranges of predictions.
There’s nothing wrong with that. Looking at other models is a good way to check your assumptions and see if there are factors you haven’t considered in your model. The worst thing we could do is blindly trust any particular model. No model is going to handle every corner case perfectly.
Treat your own models with the same skepticism — if you are surprised by the result, assume that you made a mistake until you understand why you got the surprising result. Don’t blindly trust your model!
Update April 11, 2020: A few hours after I published this story, IHME updated its UK model and radically lowered its prediction for UK deaths from 66,314 to 37,494 (nearly a 50% reduction). That’s still a bit higher than predictions from other models, but much closer. So the moral of the story holds true— be skeptical of unexpected results!
Update April 14, 2020: IHME further lowered its prediction for UK deaths from 37,494 to 23,791 (36% lower than its last prediction).