Two Ways to Investigate Missing Data
As teacher, I regularly assessed my students’ academic performance. I got a more accurate picture of my class’ and students’ performance by making sure every single student had a score for each assessment. Ensuring that I got data from each student I was responsible for involved very detailed routines.
As a Data Scientist you sometimes have no control over a data set’s completeness. How would you verify how complete the data set truly is? After I read in my data, I use .info(). Doing so provides a summary of how many non-null values are in each column. If I notice any differences in between columns, a further investigation is needed. Another way to get an exact count or missing values is to use .isna().sum().
We’ll use a publicly available data set from Kaggle. Below we will read-in our data and preview our data.
We can see there is already some “NaN” values in a few columns. Let’s get a summary of the card_df using .info().
In this case, we see there is lots of missing records. A more accurate, and my favorite way, to get an accurate assessment of which records are complete and/or incomplete is using .isna().sum(). Take a look at the results of below when cards_df.isna() is run.
Notice how executing this line of code results in either true/false — this Boolean result tells whether the record in that column is missing. You might think that you need to search through these results and hope that you catch every single true result. However, adding .sum() to the .isna() method gives a nice summary of how many records in each column is missing. Let’s try it:
Having an exact count of the missing values can be helpful in having a clearer idea of potential gaps in our data. There is a difference between values being left blank and values missing and while having count of missing values is important, so too is having an understanding of the data.Sometimes, as in the case of this data set, some values are intentionally missing.