Data Analysis With SQL

Data Cleaning With SQL

A full picture of how I’ve cleaned and wrangled data for a real project

9 min readMar 25, 2023

Data cleaning is a crucial process in any data analysis project. It involves identifying and correcting or removing inaccuracies, inconsistencies, and errors in the data. While it may seem like a tedious and time-consuming task, data cleaning is essential for obtaining reliable insights and making informed decisions based on data. Without proper data cleaning, the quality and accuracy of your analysis may be compromised, leading to incorrect conclusions and ineffective strategies.

To solve this problem we need a powerful tool. SQL provides a fast and efficient way to identify and address issues in large datasets. SQL can be used to filter out irrelevant data, standardize formatting, detect and remove duplicates, and handle missing values. By leveraging SQL for data cleaning, you can streamline the process and ensure that your analysis is based on high-quality, accurate data. In this post, I will present the problem and explore some common data-cleaning techniques using SQL and how they can help you improve the quality and reliability of your data analysis.

For that we’ll use SQL skills like Joins, Temp Tables, CTEs, Window Functions, Aggregate Functions, Creating Views, Converting Data Types, and more.

The Files

Here are the files to access all of the SQL code that we’ll discuss here.

GitHub Repository

Let’s dive into it.

The Project Workflow

In every project, I always start by using the same framework to develop a good outcome. It is simple but efficient.

It goes like that:

Understand the problem
Collect the data
Clean the data
Analyze the data
Present Results
Document everything

Today we’ll focus on the first 3 steps of this workflow. The other 3 steps are covered here.

1. Understanding the Problem

The CEO of the group of companies where I work has requested an analysis of all employees on the construction sites since January 2022. This group of companies includes a real estate developer and construction company, as well as separate administrative offices and construction sites that are treated as individual companies. The group operates in different cities in 3 states of Brazil.

In a meeting, some questions were defined to be answered:

Has the average salary decreased or increased since January 2022?
How effective is our HR program to reduce the gender gap?
How are our salaries distributed across the states?
How standardized is our pay policy across the states?
How experienced is our engineering team?
In what function groups do we spend the most?
What construction sites spent the most in salaries for the period?

He also asked for some recommendations if something needed to be pursued by the company appears.

With these questions on the table, I knew the direction I had to go to try and execute the task.

2. Collect the Data

Collecting data is not an easy task. Data is easily spread around the company, and in my case, it was no different.

First, it was technical, we needed to connect to the ERP system that uses databases in Microsoft SQL Server. I used MS SQL Server Management Studio to connect to it.

Second, it was necessary to spend a really long time drowned in a very messy data system that needed much more attention.

Third, it was needed constant back-and-forth communication with other departments like IT and HR.

But, it was done, and I never expect something less than that in any project.

After all that we found the databases that we need to keep going.

Four tables will be used [companies, employees, functions, salaries]. The ‘salaries’ table contains the monthly payroll information of all employees, not just those currently employed.

As the original information is confidential, the data has been replaced for demonstration purposes, but the same method used for the company will be applied here.

3. Clean the Data

After connecting we can start visualizing our data and begin cleaning it.

3.1 Create Temporary Datasets

Temporary datasets are needed because we never want to mess with the original database. Once you do that it’s hard to explain to IT and stakeholders why you are changing things that you shouldn’t.

We’ll join all tables and create a new one. It’s a table to merge data from the multiple tables that we have.

Looking at the tables and chatting with HR I’ve found out that the salaries table will be the primary table that I’ll use to join all of the others.

SELECT *
 INTO emp_dataset
 FROM salaries
 LEFT JOIN companies
 ON salaries.comp_name = companies.company_name
 LEFT JOIN functions
 ON salaries.func_code = functions.function_code
 LEFT JOIN employees
 ON salaries.employee_id = employees.employee_code_emp

After that, I take a look at all of the columns that I’ll really need to get the results and create a second “TEMP” table that is going to be cleaned with all the needed information.

-- Select only relevant columns for further analysis
-- Create an unique identifier code between the columns 'employee_id' and 'date' and call it 'id'
-- Convert the column 'date' to DATE type because it was previously configured as TIMESTAMP
-- Transform this new table into a dataset (df_employee) for analysis

SELECT CONCAT(employee_id, CAST(date AS date)) AS id,
    CAST(date AS date) AS month_year,
       employee_id, 
       employee_name, 
    [GEN(M_F)], -- It's between brackets because SQL may identify 'GEN(M_F)' as a non-existent formula. We will change the name of this column later
    age,
       salary,
       function_group, 
       company_name, 
       company_city, 
       company_state, 
       company_type, 
       const_site_category
INTO df_employee
FROM emp_dataset

3.2 Check Data Quality

Now it’s time to check the data quality to ensure the quality of the analysis.

After performing a quick visual analysis, we identified some problems that need to be corrected.

Here you can see that we have missing values, inconsistent entries, standardization problems, and more. We’ll investigate further because there may be hidden issues among the columns.

3.3 Start Cleaning

Start by changing bad column names. In this case, we’ll change the GEN(M_F) column name to ‘gender’

sp_rename 'df_employee.[GEN(M_F)]','gender','COLUMN'

Now use ‘TRIM’ to remove all unwanted spaces from all text columns. This is the beginning of standardization.

UPDATE df_employee
SET  id = TRIM(id),
  employee_id = TRIM(employee_id),
  employee_name = TRIM(employee_name),
  gender = TRIM(gender),
  function_group = TRIM(function_group),
  company_name = TRIM(company_name),
  company_city = TRIM(company_city),
  company_state = TRIM(company_state),
  company_type = TRIM(company_type),
  const_site_category = TRIM(const_site_category)

Check for ‘null’ values

SELECT *
 FROM df_employee
 WHERE id IS NULL
 OR month_year IS NULL
 OR employee_id IS NULL
 OR employee_name IS NULL
 OR gender IS NULL
 OR age IS NULL
 OR salary IS NULL
 OR function_group IS NULL
 OR company_name IS NULL
 OR company_city IS NULL
 OR company_state IS NULL
 OR company_type IS NULL
 OR const_site_category IS NULL

There are no ‘null’ values

Check for ‘empty’ values (maybe in other databases this step is not needed, but in this case, null and empty are different things)

SELECT *
 FROM df_employee
 WHERE id = ' '
 OR month_year = ' '
 OR employee_id = ' '
 OR employee_name = ' '
 OR gender = ' '
 OR age = ' '
 OR salary = ' '
 OR function_group = ' '
 OR company_name = ' '
 OR company_city = ' '
 OR company_state = ' '
 OR company_type = ' '
 OR const_site_category = '

Now we notice that there are missing values in the table.

We are going to check every column to look further into this problem.

Let’s start with the ‘salary’ column.

There are 70 rows with missing values. Now we do it for every column.

We checked every single column and found out that there are two columns with missing values.

Let’s treat it like I did in real life.

‘salary’: After speaking with HR, I was informed that anyone who does not have a salary value specified should not be included in the payroll reports and therefore should not be analyzed. In this case, we will delete all rows in this column that contain these missing values.
‘const_site_category’: After speaking with the department responsible for registering companies, we found that when a company is not of the type ‘Construction Site’, the text input should not be filled in. Since the analysis only involves employees on construction sites, we will not have any issues with our analysis if we delete these rows.

Now we talked with the right people and we need to delete these missing values.

-- Deleting rows of the detected missing values 

-- salary

DELETE FROM df_employee
WHERE salary = ' '

-- const_site_category


DELETE FROM df_employee
WHERE const_site_category = ' '

Now we step into a standardization problem. We have to check every column to better understand what we are dealing with.

Example:

The ‘gender’ column is displayed with two options.

We need to change “F” to “Female” and “M” to “Male” to get a better reading later.

And of course, we’ll check the other columns, but for the sake of this article, they will not be shown here. You can always check the code.

Now that we have a standardized table, there is one last step before we begin our analysis. We will remove duplicates.

I perform this step after checking for missing values, as they can affect duplicate identification.

Our criteria will be based on the ‘employee_id’ column that we already created because an employee cannot receive two salaries within the same month. Since our id takes into account the employee within the month, no id can appear twice in the table.

With a simple query, we can identify which ids are duplicated:

Now we remove the duplicated rows for good.

-- Removing duplicate rows by creating a CTE with the WINDOW function ROWNUMBER.
-- The duplicates are those that contain repeated employee_id and pay_month.
-- Afterwards, apply the DELETE statement with the condition that the row_num is greater than 1.

WITH rncte AS
   (SELECT *,
     ROW_NUMBER()
     OVER(
     PARTITION BY pay_month, employee_id
     ORDER BY employee_id) row_num
   FROM df_employee)
DELETE
FROM rncte
WHERE row_num > 1

We can see that 173 rows were duplicated.

The last thing we do is ALWAYS check it all again. Double-checking is always a good thing in any process.

We have done all of this without changing the actual database, which is very important.

4. Analyze the Data

Now that we have a good and clean database to start the analysis we can do it right now to answer some simple questions like “What is the average of salaries by month?”

Here we can see the power of this tool.

But as I said before, the analysis will not be covered here, since this is a post to cover data cleaning.

I did some analysis in SQL and I’ll leave it in the GitHub repository for you to see.

The analysis of this project has been covered in this other post “Data Analysis with Python”.

Thanks

Thank you for reading this essay and I hope you’ll find a good use for this knowledge.

If you have any other questions you can find me on LinkedIn or Instagram.

You can check my other projects:

Bye.