To confront or accept? -Null values
Recently, the prevalence of NULL values in my datasets has been the bain of my existence. Rows that are missing one or more column values are causing issues with understanding and developing my final EDA. Simply put, I want them to go away.
I’ve read a lot of documentation about what the best practice is for dealing with NULL values in a dataset and have found there is no blanket solution for this issue and unfortunately, just removing them isn’t always the best option.
Sure, when your dataset is enormous (10M rows anyone?), you can afford to drop 35,000 rows that are missing a duration_length and not bat an eye. Who would notice and frankly, who would care? I typically enter into this process wanting to spend as little time (and write as little code) as possible to solve this problem. Here are some of the methods I’ve recently employed:
- Ignore them. Pretend they’re not there and move on with your life.
- Completely drop these NULL violators from the equation. Seriously, a dropna() is a favorite of mine.
- Replace the NULL values with a measure of central tendency (mean, median, etc)
- Drop the offending column- keep the variables that are present and ignore the missing fields
- Model your data to predict what the NULL value would be and replace it
Pandas has, of course, provided robust documentation on how to perform these actions in Python, which is wonderfully quick and effective.
There are pros and cons to each of these methods. Simply dropping the rows that have a NULL field obviously equates to a loss of data which can be dangerous. For example, your dataset shows that the majority of Android users are missing a duration_length field for their session. Sure, we can delete them (iOS users make up the bulk of our users anyway) but now we have limited our representation of Android users in our dataset. They’ve morphed from exhibiting one aggregate behavior into another entirely and are now (potentially) Frankensteined versions of themselves. From a business standpoint, this is a big no no. Additionally, NULL values may be indicative of a bug in your code that is causing something to not be logged accurately.
So what about replacing the value with a central tendency measure? Well, we all know that means can be entirely misleading on a dataset with a high variance. Medians are also tricky if you want to include those outliers. In fact, replacing data in general is risky because, simply put, we are creating false values for real data. The goal is estimate with some accuracy what’s not there without completely fucking everything up. (Oddly, that last statement may be the data scientist role described in a nutshell).
Regressions are an interesting method to approach NULL values, although it’s definitely not the quickest. Essentially predicting what a value should be based on values we do have is a unique proxy for having the real thing. Using this criteria (male Android users in the Miami area typically have a duration_length of between 75–95 seconds), we can decently predict what value should be there and pretend it’s real. I’m finding this is much easier to do in R than Python, but I haven’t given up.
Whatever method you find yourself drawn to, I’d recommend trying out at least one other one before settling on your decision to see how your data is impacted. Although admittedly, I’m probably going to stick with my dropna() and move on with my life.