Data Cleaning Using SQL

aakash
8 min readAug 12, 2023

--

Data cleaning is a critical step in the data preparation process that involves identifying and correcting errors, inconsistencies, and inaccuracies in your dataset to ensure that it is accurate, reliable, and suitable for analysis or modeling.

Data Cleaning using SQL

In order to make our data relevant for further analysis, the cleaning of data is a crucial stage in data analysis that is highly needed. Cleaning the data can be so challenging and sometimes tiring, and most importantly time-consuming, hence proper attention is needed to achieve its goal, which tends to be arriving at accurate, consistent, and complete data for analysis.

Through the process of cleaning, SQL (structured Query Language) was used, and hence we will be walking through step by step guide on how to achieve the outcome (clean data) using SQL.

The dataset used for the cleaning can be found on the Kaggle website and can be accessed through the link Click Here

The dataset contains 1300 rows and 12 columns.

Before starting the cleaning process I manually assessed the dataset, to look for inconsistencies, errors, null values, duplications, unwanted columns, irrelevant characters, and errors.

A SNIPPET OF THE DATASET

Data Cleaning Process

  1. Create Backup: Before starting the data cleaning it is good practice to create the backup of your data.

Firstly, create a new table with the same columns and inserted all the values which are presented in the dataset into the new backup table.

CREATE TABLE laptops_backup like laptops;

INSERT INTO laptops
SELECT * FROM laptops;
Backup of the Data.

2. Check the size of the dataset: To check the size of the dataset, calculate the row count of all the entries to know the exact number of rows in the dataset and the result shows 1303 rows and 13 columns.

SELECT COUNT(*) FROM laptops;
Total No. of Rows
Columns

3. Drop Non-Important Columns: Dropping the columns which are not important, in the following data there was a column which is named “Unamed:0”. (I got to know while manually assessing the dataset.)

ALTER TABLE laptops DROP COLUMN `Unnamed: 0`;
Column “Unnamed:0”

4. Drop Null Values: While manually assessing the dataset, I found that the dataset contains a few NULL rows so we need to clean that values. After executing the below query we have dropped around 32 rows in the dataset which contains null values.

With index_values AS(SELECT `index`
FROM laptops
WHERE Company IS NULL
AND TypeName IS NULL
AND Inches IS NULL
AND ScreenResolution IS NULL
AND Cpu IS NULL
AND Ram IS NULL
AND Memory IS NULL
AND GPU IS NULL
AND OpSys IS NULL
AND Weight IS NULL
AND price IS NULL)

DELETE laptops
FROM laptops
JOIN index_values ONlaptops.`index` = index_values.`index`
WHERE laptops.`index` = index_values.`index`;
Before Dropping the null values
After Dropping the null values

5. Check Duplicate Values: It is good to check for duplicate values in the data. But in this data, there are no duplicated values.

SELECT Company,TypeName,Inches,ScreenResolution,
Cpu,Ram,Memory,GPU,OpSys,Weight,price,COUNT(*)
FROM laptops_backup
GROUP BY Company,TypeName,Inches,ScreenResolution,Cpu,Ram,
Memory,GPU,OpSys,Weight,price
HAVING COUNT(*) > 1;
No Duplicates

6. Check Datatypes of Columns: While checking the datatypes of columns in this data, I got to know that the datatypes for some columns are inappropriate like Weight, Price, Ram, etc.

SELECT COLUMN_NAME, DATA_TYPE
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'laptops';
Columns with their Datatypes

i. Correcting the Datatype of the column named Inches

ALTER TABLE laptops MODIFY COLUMN Inches DECIMAL(10,1);

ii. Before Correcting the datatype of Price columns, Firstly, we need to round the Price Value.


UPDATE laptops
SET Price = ROUND(Price);

ALTER TABLE laptops MODIFY COLUMN Price INTEGER;
Before
After

iii. Weight column contains measurement values Kg which is inappropriate and unnecessary. So, I have decided to remove the Kg keyword and convert it into decimal values.

-- Removing the Kg keyword

UPDATE laptops
set Weight = REPLACE(Weight,'kg','');

-- Changing the datatype

ALTER TABLE laptops MODIFY COLUMN Weight Decimal(10,1);
Before
After

iv. Ramcolumn contains measurement values that are inappropriate and unnecessary. So, I have decided to remove the gb keyword and convert it into decimal values.

-- Removing the GB keyword

UPDATE laptops
set Ram = REPLACE(Weight,'GB','');

-- Changing the datatype

ALTER TABLE laptops MODIFY COLUMN Ram INTEGER;
Before
After

7. Cleaning the Remaining Column: I decided to clean the remaining columns and format them in a customized way so that they can be used in the data model and analysis.

i. Creating new columns for Resolution weight and height. So Firstly, we need to create columns named Resolution_weight and Resolution_height.

-- Creating the new Columns

ALTER TABLE laptops
ADD COLUMN resolution_width INTEGER AFTER ScreenResolution,
ADD COLUMN resolution_height INTEGER AFTER resolution_width;

-- Extracting the resolution_height
update laptops
set resolution_height = substring_index(substring_index(ScreenResolution,' ',-1),'x',1);

-- Extracting the resolution_width
update laptops
set resolution_width = substring_index(substring_index(ScreenResolution,' ',-1),'x',-1;
Before
After

ii. After creating the resolution columns, I got to know that the resolution column also contains information about the TouchScreen laptops. So, I decided to create a column for touchscreen information. This column contains two values which are 0 — indicates not a touchscreen laptop and 1- indicates touchscreen laptop.

-- adding a new column

ALTER TABLE laptops
ADD COLUMN is_touchscreen INTEGER AFTER ScreenResolution;

-- Updating the values in new column

UPDATE laptops
set is_touchscreen = CASE WHEN ScreenResolution LIKE '%touchscreen%' THEN 1 ELSE 0 END
1 — has touchscreen , 0 — doesn’t have touchscreen

iii. Next CPU column contains three pieces of information which are cpu brand, cpu name , cpu speed.

-- Creating the new Columns

ALTER TABLE laptops
ADD COLUMN cpu_brand VARCHAR(255) AFTER cpu,
ADD COLUMN cpu_name VARCHAR(255) AFTER cpu_brand,
ADD COLUMN cpu_speed DECIMAL(10,1) AFTER cpu_name;


-- updating the values in new column
UPDATE laptops
SET cpu_brand = substring_index(Cpu,' ',1);

update laptops
SET cpu_speed = replace(substring_index(Cpu,' ',-1),'GHz','');

update laptops
SET cpu_name = replace(replace(Cpu,cpu_brand,' ' ),substring_index(Cpu,' ',-1),' ');
Before
After

iv. Next Memory Column contains information about memory type (eg. SSD, HDD), primary memory, and secondary memory. So, I decided to create three new columns and I will extract the values from the Memory column.

-- Creating the new Columns

ALTER TABLE laptops
ADD COLUMN memory_type VARCHAR(255) AFTER memory,
ADD COLUMN primary_storage INT AFTER memory_type,
ADD COLUMN secondary_storage INT AFTER primary_storage;

-- updating the values in new column

UPDATE laptops
SET memory_type = CASE
WHEN Memory LIKE '%ssd%' AND Memory LIKE '%hdd%' THEN 'Hybrid'
WHEN Memory LIKE '%flash storage%' AND Memory LIKE '%hdd%' THEN 'Hybrid'
WHEN Memory LIKE '%ssd%' THEN 'SSD'
WHEN Memory LIKE '%hdd%' THEN 'HDD'
WHEN Memory LIKE '%flash storage%' THEN 'Flash Storage'
ELSE NULL
END

UPDATE laptops
SET primary_storage = regexp_substr(substring_index(memory,'+',1),'[0-9]+'),
secondary_storage = CASE WHEN memory LIKE '%+%' THEN regexp_substr(substring_index(memory,'+',-1),'[0-9]+') ELSE 0 END;

After creating columns, I got to know that some of the data Memory contains in TB but remaining in MB. So, I decided to convert TB values into MB values by multiplying with 1024(1 TB = 1024).

UPDATE laptops
SET primary_storage = CASE WHEN primary_storage <=2 THEN primary_storage*1024 ELSE primary_storage END,
secondary_storage = CASE WHEN secondary_storage <=2 THEN secondary_storage*1024 ELSE secondary_storage END;
Before
After

v. Next gpu column contains two pieces of information which are gpu brand, and gpu name.

-- Creating the new Columns

ALTER TABLE laptops
ADD COLUMN gpu_brand VARCHAR(255) AFTER Gpu,
ADD COLUMN gpu_name VARCHAR(255) AFTER gpu_brand;

-- updating the values in new column

UPDATE laptops
SET t1.gpu_brand = substring_index(Gpu,' ',1);

UPDATE laptops
SET t1.gpu_name = replace(Gpu,gpu_brand,'');
After
Before

vi. Editing the Operating system column because it contains the values with there version like windows 7,windows10.

- Editing the OS column

UPDATE laptops
SET t1.OpSys = CASE
when OpSys like '%mac%' Then 'macos'
when OpSys like 'windows%' Then 'windows'
when OpSys like '%linux%' Then 'linux'
when OpSys like 'No OS' Then 'N/A'
else 'other'
END;
Before
After

vii. The final step was taking down some columns which were unncessary and not needed for further analysis.

ALTER TABLE laptops
DROP COLUMN ScreenResolution,Cpu, Gpu, Memory;

Data after performing the cleaning:

Columns and thier datatypes

Conclusion:

In conclusion, the journey of data analysis is a rigorous expedition and indispensable practice of data cleaning. As highlighted throughout this blog, data cleaning serves as the guardian of data integrity, ensuring that the insights and conclusions drawn from our datasets are accurate, reliable, and meaningful.

After cleaning the dataset, it was more consistent and accurate for further use for analysis. Using SQL to clean the dataset was so challenging and along the line I got to learn a couple of things to which I was never exposed to.

If you enjoyed the blog buy me a coffee it would be appreciated.

Kindly leave your comments and remarks it will be highly appreciated.

Follow me on Linkedin

--

--

aakash

I have developed a passion for exploring complex datasets, building predictive models, and extracting meaningful insights.