Survivorship Bias in Database Systems
Algorithmic bias in machine learning systems has been a hot topic recently, but statistical bias more generally is as old as statistics itself.
In this post, I’ll cover a specific kind of selection bias called Survivorship Bias and some of its causes in the context of database systems.
What is Survivorship Bias?
Survivorship Bias happens when you have data that is the result of a hidden filtering process.
For example, let’s say we are evaluating a weight loss program, and we see that the average weight of participants is 210 pounds before the program, and 170 pounds after the program. Looks like the program works!
But consider that not every participant completes the program. Some drop out halfway through, and the heavier someone is, the more likely they are to drop out.
So the apparent effectiveness of the program is greatly exaggerated by the fact that the people who actually complete the program were not that heavy to start with.
In modern database systems, there are a number of common patterns that result in Survivorship Bias. Let’s examine two of them.
The Switching Cost Filter
When switching to a new data management system, data from an existing system often needs to be moved into the new system.
This has a cost.
Inevitably, there will be some selection process, whether implicit or explicit, that filters the existing data on some criteria in order to minimize the cost of moving data into the new system.
This can be seen most clearly in systems that track data over time.
For example, in 1999, the website boxofficemojo.com came into existence and started tracking the production budgets and box office revenues of movies.
The site also added data for movies that predate the system, going as far back as the 1930's.
The graph below shows a 10 year rolling average of the return on investment (ROI) for movies that have production budget data listed on the website.
On the face of things, it looks like the movie business was extremely profitable from the 1930's to the 1970's, but took a downward turn starting in the late 70's, and then leveled off right around the time that boxofficemojo.com came into existence in 1999.
A more likely explanation is that, for each film released before 1999, data was only added if people still cared about that film in 1999.
In other words, implicit selection criteria filtered out the flops.
(In case you’re wondering, that abrupt jump in 2009 was caused by the low-budget hit film Paranormal Activity, which had a domestic total gross that was about 7194 times its production budget).
Hard Delete vs Soft Delete
In transactional databases used by software applications, there are two common approaches to the deletion of data. One is to actually delete the data (hard delete), and the other is to merely flag the data as “deleted” (soft delete).
The hard delete approach is the simpler of the two, but it is problematic for statistical analysis or machine learning applications.
For example, say you are analyzing data from a blogging platform database, and you want to find out what separates popular blog posts from the rest.
If bloggers are in the habit of deleting their least popular posts, then it will appear that blog posts overall are more popular than they actually are.
Using the soft deletion approach would provide a more realistic picture.
Of course, there are potential ethical issues with soft deletion if it involves misleading the software user into believing their data is truly gone when it actually isn’t.
In some cases, given the new GDPR regulations that are now in effect, soft deletion might be outright illegal.
At the very least, you can stay mindful of the potential for missing data and reframe your analysis accordingly.
These examples demonstrate two ways that Survivorship Bias can manifest in database systems. Hopefully this will help you spot instances of the bias and avoid the pitfalls caused by erroneous assumptions about the completeness of your data.