SQL is very important for Data Cleaning

Ensuring High-Quality Data for Analysis via SQL

Published in

CodeX

8 min readJun 29, 2024

When building a machine learning model or conducting a data analysis project, ensuring data quality is crucial. The principle “garbage in, garbage out” emphasizes the importance of data cleaning, which involves identifying and rectifying data inconsistencies and errors. In this tutorial, we will explore data cleaning using SQL with a simple HR database containing two tables: employees and departments.

1. Understanding the Data

Before we begin cleaning the data, we need to understand the structure of our database. This will help us comprehend the relationships between tables and the data types used, guiding us in our cleaning and analysis tasks.

1.1 Show the Tables

First, let’s display the tables in our database.

SHOW TABLES;

+-----------------+
| Tables_in_hr_db |
+-----------------+
| departments     |
| employees       |
+-----------------+

We have two tables in our database: ‘departments’ and ‘employees’.

1.2 Describe the Tables

Next, we’ll describe the structure of these tables to understand the columns and data types.

DESCRIBE departments;
DESCRIBE employees;

+-----------------+-------------+------+-----+---------+-------+
| Field           | Type        | Null | Key | Default | Extra |
+-----------------+-------------+------+-----+---------+-------+
| department_id   | int         | NO   | PRI | NULL    |       |
| department_name | varchar(50) | YES  |     | NULL    |       |
+-----------------+-------------+------+-----+---------+-------+

+---------------+--------------+------+-----+---------+-------+
| Field         | Type         | Null | Key | Default | Extra |
+---------------+--------------+------+-----+---------+-------+
| employee_id   | int          | NO   | PRI | NULL    |       |
| employee_name | varchar(100) | YES  |     | NULL    |       |
| department_id | int          | YES  | MUL | NULL    |       |
| gender        | varchar(10)  | YES  |     | NULL    |       |
| salary        | varchar(15)  | YES  |     | NULL    |       |
| hire_date     | date         | YES  |     | NULL    |       |
+---------------+--------------+------+-----+---------+-------+

The departments table has two columns: department_id and department_name. The employees table has six columns, including employee_id, employee_name, department_id, gender, salary, and hire_date. There is a one-to-many relationship between departments and employees.

1.3 Retrieve the Data

Let’s retrieve the first five rows from each table to get a sense of the data.

SELECT * FROM departments LIMIT 5;
SELECT * FROM employees LIMIT 5;

+---------------+-----------------+
| department_id | department_name |
+---------------+-----------------+
|             1 | Human Resources |
|             2 | Marketing       |
|             3 | IT              |
+---------------+-----------------+

+-------------+---------------+---------------+--------+------------+------------+
| employee_id | employee_name | department_id | gender | salary     | hire_date  |
+-------------+---------------+---------------+--------+------------+------------+
|           1 | Sam Wilson    |             1 | M      | $62,000.00 | 2022-01-15 |
|           2 | Linda Jones   |             2 | Female | $78,000.00 | 2022-02-20 |
|           3 | Tom Hanks     |             1 | m      | $56,000.00 | 2022-03-10 |
|           4 | Angela White  |             3 | f      | $82,000.00 | 2022-04-05 |
|           5 | Peter Parker  |             2 | male   | $72,000.00 | 2022-05-12 |
+-------------+---------------+---------------+--------+------------+------------+

From this sample, we can see inconsistencies in the gender column and the presence of a dollar sign in the salary column, which could complicate calculations.

2. Cleaning the Data

2.1 Dealing with Duplicates

Duplicates can distort our analysis or model, so it’s essential to identify and handle them early.

To find duplicates, we can use the ORDER BY clause for a simple approach.

SELECT employee_name, department_id, gender, salary, hire_date 
FROM employees 
ORDER BY 1,2,3,4,5;

+----------------+---------------+--------+------------+------------+
| employee_name  | department_id | gender | salary     | hire_date  |
+----------------+---------------+--------+------------+------------+
| Angela White   |             3 | f      | $82,000.00 | 2022-04-05 |
| Linda Jones    |             2 | Female | $78,000.00 | 2022-02-20 |
| Linda Jones    |             2 | female | $78,000.00 | 2022-02-20 |
| Peter Parker   |             2 | male   | $72,000.00 | 2022-05-12 |
| Sam Wilson     |             1 | M      | $62,000.00 | 2022-01-15 |
| Sam Wilson     |             1 | m      | $62,000.00 | 2022-01-15 |
| Tom Hanks      |             1 | m      | $56,000.00 | 2022-03-10 |
| Tom Hanks      |             1 | M      | $56,000.00 | 2022-03-10 |
+----------------+---------------+--------+------------+------------+

For a more robust approach, we can use a subquery with the COUNT function to identify duplicates.

SELECT COUNT(*) 
FROM(
  SELECT employee_name, department_id, gender, salary, hire_date, COUNT(*) as quantity
  FROM employees 
  GROUP BY 1,2,3,4,5) subquery
WHERE quantity > 1;

+----------+
| COUNT(*) |
+----------+
|        3 |
+----------+

The above query counts the number of duplicate records. To list these duplicates, we can modify the query to include the HAVING clause.

SELECT employee_name, department_id, gender, salary, hire_date, COUNT(*) as quantity
FROM employees 
GROUP BY 1,2,3,4,5
HAVING quantity > 1;

+---------------+---------------+--------+------------+------------+----------+
| employee_name | department_id | gender | salary     | hire_date  | quantity |
+---------------+---------------+--------+------------+------------+----------+
| Sam Wilson    |             1 | M      | $62,000.00 | 2022-01-15 |        2 |
| Linda Jones   |             2 | Female | $78,000.00 | 2022-02-20 |        2 |
| Tom Hanks     |             1 | m      | $56,000.00 | 2022-03-10 |        2 |
+---------------+---------------+--------+------------+------------+----------+

We have found the duplicates and can now decide whether to remove, update, or retain them based on our requirements.

2.2 Dealing with Missing Data and Nulls

Next, we will identify records with empty strings and NULL values.

-- Records with empty strings
SELECT * FROM employees WHERE gender = '';

-- Records with null values
SELECT * FROM employees 
WHERE 
employee_name IS NULL OR
department_id IS NULL OR
gender IS NULL OR
salary IS NULL OR
hire_date IS NULL;

+-------------+---------------+---------------+--------+------------+------------+
| employee_id | employee_name | department_id | gender | salary     | hire_date  |
+-------------+---------------+---------------+--------+------------+------------+
|           6 | Emma Brown    |             3 |        | $74,000.00 | 2022-06-18 |
+-------------+---------------+---------------+--------+------------+------------+

+-------------+---------------+---------------+--------+------------+------------+
| employee_id | employee_name | department_id | gender | salary     | hire_date  |
+-------------+---------------+---------------+--------+------------+------------+
|           8 | Jake Green    |             2 | NULL   | $68,000.00 | 2022-08-30 |
|           9 | Nick Blue     |          NULL | F      | $62,000.00 | 2022-09-05 |
+-------------+---------------+---------------+--------+------------+------------+

We found one record with an empty string in the gender column and two with NULL values. Depending on our dataset and requirements, we can handle missing data by filling it with the average, median, mode, or replacing it with a constant value.

Suppose we know that Jake Green is male; we can update his gender using a CASE statement.

SELECT employee_id, employee_name, department_id, salary, hire_date,
CASE WHEN gender IS NULL THEN 'M' END AS gender
FROM employees
WHERE employee_id = 8;

If NULL values are expected, we can use the COALESCE function to display a default value.

2.3 Correcting Categorical Data

The gender column contains inconsistent values like Female, F, f, etc. To ensure data quality, we need to standardize these values. First, let's print all distinct values in the gender column.

SELECT DISTINCT UPPER(gender) AS gender FROM employees;

+--------+
| gender |
+--------+
| M      |
| FEMALE |
| F      |
| MALE   |
|        |
| NULL   |
+--------+

Now, we will replace these values with m for males and f for females.

SELECT employee_id, employee_name, department_id, salary, hire_date, 
CASE 
WHEN UPPER(gender) IN ('M','MALE') THEN 'm'
WHEN UPPER(gender) IN ('FEMALE','F') THEN 'f'
ELSE '' END AS gender
FROM employees;

+-------------+---------------+---------------+------------+------------+--------+
| employee_id | employee_name | department_id | salary     | hire_date  | gender |
+-------------+---------------+---------------+------------+------------+--------+
|           1 | Sam Wilson    |             1 | $62,000.00 | 2022-01-15 | m      |
|           2 | Linda Jones   |             2 | $78,000.00 | 2022-02-20 | f      |
|           3 | Tom Hanks     |             1 | $56,000.00 | 2022-03-10 | m      |
|           4 | Angela White  |             3 | $82,000.00 | 2022-04-05 | f      |
|           5 | Peter Parker  |             2 | $72,000.00 | 2022-05-12 | m      |
|           6 | Emma Brown    |             3 | $74,000.00 | 2022-06-18 |        |
|           7 | Sophia Black  |             1 | $70,000.00 | 2022-07-25 | f      |
|           8 | Jake Green    |             2 | $68,000.00 | 2022-08-30 |        |
|           9 | Nick Blue     |          NULL | $62,000.00 | 2022-09-05 | f      |
|          10 | Sam Wilson    |             1 | $62,000.00 | 2022-01-15 | m      |
|          11 | Linda Jones   |             2 | $78,000.00 | 2022-02-20 | f      |
|          12 | Tom Hanks     |             1 | $56,000.00 | 2022-03-10 | m      |
|          13 | Mark Brown    |             3 | $80,000.00 | 2050-05-10 | m      |
+-------------+---------------+---------------+------------+------------+--------+

We also notice an employee, Mark Brown, with a hire date in the future (2050–05–10). Let’s correct it using a CASE clause.

2.4 Type Conversion

The salary column contains a dollar sign, which can hinder calculations. To calculate the average salary by department, we need to remove the dollar sign and cast the data type to decimal.

First, let’s attempt a query without cleaning the salary column.

SELECT department_name, AVG(salary) as "Average Salary"
FROM employees LEFT JOIN departments
ON employees.department_id = departments.department_id
GROUP BY department_name;

+-----------------+----------------+
| department_name | Average Salary |
+-----------------+----------------+
| Human Resources |              0 |
| Marketing       |              0 |
| IT              |              0 |
+-----------------+----------------+

The average salary is zero because of the dollar sign. Let’s fix this.

SELECT department_name, 
AVG(
CAST(REPLACE(salary, '$', '') AS DECIMAL(10,2))
) AS "Average Salary"
FROM employees LEFT JOIN departments
ON employees.department_id = departments.department_id
GROUP BY department_name;

+-----------------+----------------+
| department_name | Average Salary |
+-----------------+----------------+
| Human Resources |      60.000000 |
| Marketing       |      71.500000 |
| IT              |      78.666667 |
+-----------------+----------------+

The salary column is now correctly formatted and can be used for calculations.

3. Final Thoughts and Bonus

This tutorial provides an introduction to data cleaning with SQL, a critical step for any data project. Effective data cleaning requires understanding your data, identifying issues, and applying appropriate cleaning techniques. The best way to improve your data cleaning skills is through practice and engaging

BONUS

For further learning of data cleaning techniques, below points should help you for your roadmap.

Data Normalization: Implement and understand normalization techniques to prevent redundancy.
Regular Expressions: Using regex for complex data cleaning tasks.
Automation: Develop scripts or use tools like Python’s Pandas for automated cleaning.
Projects and Case Studies: Work on real-world datasets to apply and refine your skills. Join communities and participate in data cleaning challenges.

Data Cleaning using Pandas

A Beginner level methods to clean messy data.

medium.com

Feel free to show your support with Claps and Follow me for more articles on Data Science techniques. Thank you!!