Data Cleaning Using SQL

8 min readAug 12, 2023

Data cleaning is a critical step in the data preparation process that involves identifying and correcting errors, inconsistencies, and inaccuracies in your dataset to ensure that it is accurate, reliable, and suitable for analysis or modeling.

In order to make our data relevant for further analysis, the cleaning of data is a crucial stage in data analysis that is highly needed. Cleaning the data can be so challenging and sometimes tiring, and most importantly time-consuming, hence proper attention is needed to achieve its goal, which tends to be arriving at accurate, consistent, and complete data for analysis.

Through the process of cleaning, SQL (structured Query Language) was used, and hence we will be walking through step by step guide on how to achieve the outcome (clean data) using SQL.

The dataset used for the cleaning can be found on the Kaggle website and can be accessed through the link Click Here

The dataset contains 1300 rows and 12 columns.

Before starting the cleaning process I manually assessed the dataset, to look for inconsistencies, errors, null values, duplications, unwanted columns, irrelevant characters, and errors.

Data Cleaning Process

Create Backup: Before starting the data cleaning it is good practice to create the backup of your data.

Firstly, create a new table with the same columns and inserted all the values which are presented in the dataset into the new backup table.

CREATE TABLE laptops_backup like laptops;

INSERT INTO laptops 
SELECT * FROM laptops;

2. Check the size of the dataset: To check the size of the dataset, calculate the row count of all the entries to know the exact number of rows in the dataset and the result shows 1303 rows and 13 columns.

SELECT COUNT(*) FROM laptops;

Total No. of Rows

3. Drop Non-Important Columns: Dropping the columns which are not important, in the following data there was a column which is named “Unamed:0”. (I got to know while manually assessing the dataset.)

ALTER TABLE laptops DROP COLUMN `Unnamed: 0`;

4. Drop Null Values: While manually assessing the dataset, I found that the dataset contains a few NULL rows so we need to clean that values. After executing the below query we have dropped around 32 rows in the dataset which contains null values.

With index_values AS(SELECT `index`
    FROM laptops 
    WHERE Company IS NULL
      AND TypeName IS NULL
      AND Inches IS NULL
      AND ScreenResolution IS NULL
      AND Cpu IS NULL
      AND Ram IS NULL
      AND Memory IS NULL
      AND GPU IS NULL
      AND OpSys IS NULL
      AND Weight IS NULL
      AND price IS NULL)
      
DELETE laptops 
FROM laptops 
JOIN index_values ONlaptops.`index` = index_values.`index`
WHERE laptops.`index` = index_values.`index`;

Before Dropping the null values

After Dropping the null values

5. Check Duplicate Values: It is good to check for duplicate values in the data. But in this data, there are no duplicated values.

SELECT Company,TypeName,Inches,ScreenResolution,
        Cpu,Ram,Memory,GPU,OpSys,Weight,price,COUNT(*)
FROM laptops_backup
GROUP BY Company,TypeName,Inches,ScreenResolution,Cpu,Ram,
          Memory,GPU,OpSys,Weight,price
HAVING COUNT(*) > 1;

6. Check Datatypes of Columns: While checking the datatypes of columns in this data, I got to know that the datatypes for some columns are inappropriate like Weight, Price, Ram, etc.

SELECT COLUMN_NAME, DATA_TYPE
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'laptops';

i. Correcting the Datatype of the column named Inches

ALTER TABLE laptops MODIFY COLUMN Inches DECIMAL(10,1);

ii. Before Correcting the datatype of Price columns, Firstly, we need to round the Price Value.


UPDATE laptops
SET Price = ROUND(Price);

ALTER TABLE laptops MODIFY COLUMN Price INTEGER;

iii. Weight column contains measurement values Kg which is inappropriate and unnecessary. So, I have decided to remove the Kg keyword and convert it into decimal values.

-- Removing the Kg keyword

UPDATE laptops
set Weight = REPLACE(Weight,'kg','');

-- Changing the datatype

ALTER TABLE laptops MODIFY COLUMN Weight Decimal(10,1);

iv. Ramcolumn contains measurement values that are inappropriate and unnecessary. So, I have decided to remove the gb keyword and convert it into decimal values.

-- Removing the GB keyword

UPDATE laptops
set Ram = REPLACE(Weight,'GB','');

-- Changing the datatype

ALTER TABLE laptops MODIFY COLUMN Ram INTEGER;

7. Cleaning the Remaining Column: I decided to clean the remaining columns and format them in a customized way so that they can be used in the data model and analysis.

i. Creating new columns for Resolution weight and height. So Firstly, we need to create columns named Resolution_weight and Resolution_height.

-- Creating the new Columns

ALTER TABLE laptops
ADD COLUMN resolution_width INTEGER AFTER ScreenResolution,
ADD COLUMN resolution_height INTEGER AFTER resolution_width;

-- Extracting the resolution_height
update laptops 
set resolution_height = substring_index(substring_index(ScreenResolution,' ',-1),'x',1);

-- Extracting the resolution_width 
update laptops 
set resolution_width = substring_index(substring_index(ScreenResolution,' ',-1),'x',-1;

ii. After creating the resolution columns, I got to know that the resolution column also contains information about the TouchScreen laptops. So, I decided to create a column for touchscreen information. This column contains two values which are 0 — indicates not a touchscreen laptop and 1- indicates touchscreen laptop.

-- adding a new column

ALTER TABLE laptops
ADD COLUMN is_touchscreen INTEGER AFTER ScreenResolution;

-- Updating the values in new column

UPDATE laptops 
set is_touchscreen = CASE WHEN ScreenResolution LIKE '%touchscreen%' THEN 1 ELSE 0 END

1 — has touchscreen , 0 — doesn’t have touchscreen

iii. Next CPU column contains three pieces of information which are cpu brand, cpu name , cpu speed.

-- Creating the new Columns

ALTER TABLE laptops
ADD COLUMN cpu_brand VARCHAR(255) AFTER cpu,
ADD COLUMN cpu_name VARCHAR(255) AFTER cpu_brand,
ADD COLUMN cpu_speed DECIMAL(10,1) AFTER cpu_name;


-- updating the values in new column
UPDATE laptops 
SET cpu_brand = substring_index(Cpu,' ',1);

update laptops
SET cpu_speed = replace(substring_index(Cpu,' ',-1),'GHz','');

update laptops
SET cpu_name = replace(replace(Cpu,cpu_brand,' ' ),substring_index(Cpu,' ',-1),' ');

iv. Next Memory Column contains information about memory type (eg. SSD, HDD), primary memory, and secondary memory. So, I decided to create three new columns and I will extract the values from the Memory column.

-- Creating the new Columns

ALTER TABLE laptops
ADD COLUMN memory_type VARCHAR(255) AFTER memory,
ADD COLUMN primary_storage INT AFTER memory_type,
ADD COLUMN secondary_storage INT AFTER primary_storage;

-- updating the values in new column

UPDATE laptops
SET memory_type = CASE
  WHEN Memory LIKE '%ssd%' AND Memory LIKE '%hdd%' THEN 'Hybrid'
  WHEN Memory LIKE '%flash storage%' AND Memory LIKE '%hdd%' THEN 'Hybrid'
  WHEN Memory LIKE '%ssd%' THEN 'SSD'
  WHEN Memory LIKE '%hdd%' THEN 'HDD'
  WHEN Memory LIKE '%flash storage%' THEN 'Flash Storage'
  ELSE NULL
   END

UPDATE laptops
SET primary_storage = regexp_substr(substring_index(memory,'+',1),'[0-9]+'),
secondary_storage = CASE WHEN memory LIKE '%+%' THEN regexp_substr(substring_index(memory,'+',-1),'[0-9]+') ELSE 0 END;

After creating columns, I got to know that some of the data Memory contains in TB but remaining in MB. So, I decided to convert TB values into MB values by multiplying with 1024(1 TB = 1024).

UPDATE laptops
SET primary_storage = CASE WHEN primary_storage <=2 THEN primary_storage*1024 ELSE primary_storage END,
secondary_storage = CASE WHEN secondary_storage <=2 THEN secondary_storage*1024 ELSE secondary_storage END;

v. Next gpu column contains two pieces of information which are gpu brand, and gpu name.

-- Creating the new Columns

ALTER TABLE laptops 
ADD COLUMN gpu_brand VARCHAR(255) AFTER Gpu,
ADD COLUMN gpu_name VARCHAR(255) AFTER gpu_brand;

-- updating the values in new column 

UPDATE laptops 
SET t1.gpu_brand = substring_index(Gpu,' ',1);

UPDATE laptops
SET t1.gpu_name = replace(Gpu,gpu_brand,'');

vi. Editing the Operating system column because it contains the values with there version like windows 7,windows10.

- Editing the OS column

UPDATE laptops
SET t1.OpSys = CASE
               when OpSys like '%mac%' Then 'macos'
               when OpSys like 'windows%' Then 'windows'
               when OpSys like '%linux%' Then 'linux'
               when OpSys like 'No OS' Then 'N/A'
               else 'other'
              END;

vii. The final step was taking down some columns which were unncessary and not needed for further analysis.

ALTER TABLE laptops
DROP COLUMN ScreenResolution,Cpu, Gpu, Memory;

Data after performing the cleaning:

Conclusion:

In conclusion, the journey of data analysis is a rigorous expedition and indispensable practice of data cleaning. As highlighted throughout this blog, data cleaning serves as the guardian of data integrity, ensuring that the insights and conclusions drawn from our datasets are accurate, reliable, and meaningful.

After cleaning the dataset, it was more consistent and accurate for further use for analysis. Using SQL to clean the dataset was so challenging and along the line I got to learn a couple of things to which I was never exposed to.

If you enjoyed the blog buy me a coffee it would be appreciated.

Kindly leave your comments and remarks it will be highly appreciated.

Follow me on Linkedin

Data Cleaning Using SQL

Data Cleaning Process

Written by aakash