John Pauler
Learning Data
Published in
7 min readMay 19, 2023

The Importance Of Data Cleaning & Data Validation

Hey data people,

I want to talk about everyone’s favorite topic, data validation and data cleaning.

I’m kidding of course.

While it may not be the sexiest part of an Analyst’s workflow, but it’s a really important step toward deriving meaningful insights that include the proper caveats.

Entry level Analysts often overlook the data cleaning process and jump straight into the analysis.

It’s easy to spin up an Excel pivot table, use some quick SQL aggregate functions, or build a Power BI chart to slice and dice your data. But if there are unresolved issues in the underlying data set, that easy to perform analysis might be problematic.

This is where more seasoned Analysts shine, as they know how important the data validation step is. As a result, their analyses tend to be well-caveated and more trustworthy.

The Project: A Concrete Example

To talk through this one, I’m going to use a concrete example from a submission in the recent data visualization contest, the Maven Family Leave Challenge.

If you want to follow along with this one, you can download the data set at the Data Playground. It’s free for anyone to grab, and you don’t even need to register your email or anything like that.

>>Get the data set and follow along<<

The Maven Analytics Data Playground, illustrating how to get the data set to follow along.

This challenge entry was from Damon Woolsey. He did a great job of digging into the data to perform the appropriate validation work and pointed out a number of issues before he jumped into his core analysis. You can see his detailed project here.

Data Validation Steps

Let’s walk through some of the results of his data validation.

Damon writes, “While every company had data on paid maternity leave, some lacked data on unpaid maternity leave, and the vast majority lacked data on paid and unpaid paternity leave. Out of 1,600 companies, only 107 lacked data on unpaid maternity leave, while 1,312 lacked data on paid paternity leave and 1,537 lacked data on unpaid paternity leave.”

Great example here of a step every Analyst should take during their data validation, checking for missing values. As Damon points out, the data was robust when it came to paid maternity leave, but was much more sparse when it came to paternity leave.

I’ve added a chart below to visualize to you just how sparse the paternity data really was.

Showing the number of records with data for the 4 categories

If you’re using Excel, you could do this missing values analysis yourself using a Pivot Table or the COUNTIF( ) function by looking for records with #N/A values.

Next, I want to talk about another very important step Damon took, which was a “gut check” of the data that was populated.

He summarizes his findings… “Nineteen companies were reported to have more paid paternity leave than paid maternity leave, which is highly unlikely. Western Digital, for example, was reported to offer 8 weeks of paid maternity leave and 12 weeks of paid paternity leave. Their website, however, states that they offer 12 weeks of paid leave to birth parents, spouses, and domestic partners.”

Awesome! This is the type of critical thinking that adds a lot of value to an Analysis. It’s not about holes in the data, or outliers, or anything like that. Damon just used good sound judgment to ask the question whether or not the data passed a sniff test. We talk about critical thinking as one of the most important skills an Analyst needs to develop and bring to their work. This is a great example where you can see it in action.

It’s interesting to call out that this type of critical thinking, at least in my opinion, will be one of the last things that gets programmed into an AI. Certainly we can write AI programs that check for NULLs, for duplicates, for outliers. That will be awesome because the Analyst will need to spend less time here, and more time on the fun stuff. But the critical thinking piece is a lot harder. It’s the part that I think adds a ton of value today, and that won’t be getting automated out any time soon.

Those were two awesome examples of data validation Damon performed. Next we’ll quickly run through some of the data cleaning steps he went through using Excel, and sharesome of the formulas you can use to do this on your own…

Data Cleaning

Damon says he “Removed excess whitespace using the TRIM() function.”

This one is really valuable, and also very simple. In case you haven’t used it before, here’s an illustration…

A walkthrough of Excel’s TRIM( ) function used to trim leading and trailing spaces.

The formula just takes the text from some cell, and removes any blank spaces that come before or after the text. Important to note that it does not remove text between words (there are other formulas we can use for that). This one is really handy because it can catch weird spaces (especially trailing spaces) that you wouldn’t see with the naked eye. It’s also a very simple formula, just type TRIM( ) and inside the brackets include the cell you want to trim. Done.

Next Damon used COUNTBLANK( ) to check for blank cells. This is always a great step. In this case, there were none because the NULL cells were filled with the value ‘#N/A’. But it’s a great step, and an easy way to catch blanks, which do show up quite often.

Damon also checked to see the range of the numerical data. I’m not sure how he did this one, but there are a few ways. This is a great step just to make sure there aren’t any outliers that don’t make sense. For example, multi-year maternity leaves or negative values.

To pull this off, could use the MIN( ) AND MAX( ) functions. This is how I would probably do it because this method scales quickly if you have lots of columns to look at, and the concept also translates to other tools like SQL.

To use the MIN( ) or MAX( ) functions just type the function with brackets, and inside the brackets include the range of the cells you want to find the minimum or maximum value for. MAX( ) is shown below, and MIN( ) works the exact same way.

The next thing Damon talked about is using COUNTIF( ) to check for duplicates. This is another one that is really simple and that I use all the time when cleaning data.

To use COUNTIF( ) the two arguments that you’ll need are the range that you want to count values within, and then the value that you would like to count. In the example here, we’re looking for duplicates in column B, so we use the range $B$2:$B$1602. Note the use of the dollar signs to fix the range so that we can type the formula on the first row of data in cell A2 and then just drop it down to work on all rows. The second argument it takes is the value, shown here as B53. So in the formula shown in cell A53, we’re searching column B within the range specified, and counting the number of times the value in B53 shows up.

Illustrating Excel’s COUNTIF( ) function used to look for duplicate values in a column

Most of the time, the value returned by this formula is 1. In those cases, it means you don’t have any duplicates in the column. But sometimes you’ll see other values, like the value of 2 shown on rows 54 and 55. This means that same value, Collins Aerospace, shows up 2 times in the column. This method is great because it scales quickly and you can filter your data set for values other than 1 to quickly see your data set.

Wrapping Up

I hope these examples help and you can employ the methods in your own data validation and cleaning work in the near future.

I was really impressed with Damon’s critical thinking and his process, especially after I realized that he is a recently transitioning Analyst. Could have fooled me!

For anyone looking to hire Analysts, Damon is open to work and his natural abilities and passion for data would be a great addition to any Analytics team.

>> Get in touch with Damon on LinkedIn <<

If you want to go deeper with data validation and data prep, check out this free guide (a blog post, nothing to download or register for) from our lead Power BI instructor Aaron Parry:

>> Data Prep Steps: How to QA Data <<

And if you want to see more analyses from this challenge and hear some of our instructors discuss the pros and cons of some of the top entries, you can check out our Finalist voting round video below.

To Damon, thanks for your willingness to be an example here. Talking about these things concretely can really help folks as they are learning and building their own data skills.

For anyone who read this far, thank you. Feel free to drop a comment with any thoughts or ideas for things you would like to see next from us.

-John

--

--

John Pauler
Learning Data

Editor of the Learning Data publication. Lead SQL instructor at Maven Analytics.