Data Analysis Project : Data Cleaning & Transformation Using SQL

Sari Saldi
7 min readAug 22, 2023

--

Unsplash .com

About Dataset

This dataset contains information about housing sales in Nashville, TN such as property, owner, sales, and tax information.

Purpose

The purpose of this data cleaning project is to enhance the quality, consistency, and usability of the dataset by implementing a structured and comprehensive data cleaning procedure. Through these cleaning procedures, the project aims to enhance the integrity, accuracy, and usability of the dataset, enabling more reliable and insightful analysis, reporting, and decision-making processes.

Note :

I’ll be using PostgreSQL as the tool to carry out these data cleaning tasks.

Data Cleaning Procedure

1) Standardize Date Format :

The goal is to ensure uniformity in date representation across the dataset, eliminating variations in date formats that might cause confusion or inaccuracies during
analysis.

---- Standardize Date Format 

select saledate
from housing
Before
-- Update the data type of "saledate" column from Timestamp to Date type

alter table housing
alter column saledate type date using saledate :: date;
After

2) Populate Property Address Data :

The aim is to fill in missing or incomplete property address data, creating a more complete and accurate dataset for analysis.

Firstly, I’ll start by checking if the “propertyaddress” column contains any missing values. After examining this column, it turns out that there are multiple instances of missing values. This isn’t something we want, so we need to address it. It’s noticeable that when the “parcelid” has the same value, the “propertyaddress” value is also the same.

Here’s an approach: if there’s a missing value in the “propertyaddress” column, we can fill it with the corresponding value from the same column where the “parcelid” matches.

To achieve this, I’ll employ a self-join technique.

/*
a) Test the query using coalesce and compare the original values that are
still null with the values that have been filled using the coalesce
function.
*/

select a.parcelid,
a.propertyaddress,
b.parcelid,b.propertyaddress,
COALESCE(a.propertyaddress, b.propertyaddress)
from housing a
join housing b
on a.parcelid = b.parcelid
and a.uniqueid <> b.uniqueid
and a.propertyaddress is null
order by a.parcelid

At first glance, this query might seem complex, but its purpose is to identify instances where the “propertyaddress” column contains missing values and to explore potential solutions. Let’s break down the query step by step.

  1. We select columns from two instances of the “housing” table, labeled as “a” and “b.” The columns include “parcelid” and “propertyaddress” from both instances.
  2. We perform a self-join on the “housing” table using the “parcelid” as the connecting factor. This self-join allows us to compare records within the same table.
  3. The condition a.uniqueid <> b.uniqueid ensures that we're comparing distinct records and not a record with itself.
  4. We focus on records where the “propertyaddress” in the instance “a” is missing (IS NULL).
  5. The COALESCE function combines the "propertyaddress" values from instances "a" and "b." If "propertyaddress" in "a" is missing, it's replaced with the value from "b."
  6. Finally, we order the results by “parcelid” for better clarity.

This query aids in identifying missing property addresses by leveraging the relationship between records with the same “parcelid” but distinct “uniqueid.” It also demonstrates how the COALESCE function can help us address missing data points using information from related records.


-- b) Update Colummn

update housing a
set propertyaddress = coalesce(a.propertyaddress, b.propertyaddress)
from housing b
where a.parcelid = b.parcelid
and a.uniqueid <> b.uniqueid
and a.propertyaddress is null

3) Breaking Out Address into Individual Columns :

This step involves dissecting the compound address field into separate columns for address, city, and state. This enhances data organization, simplifies querying, reduces confusion, and supports better geographical analysis.

Upon closer examination, it becomes evident that the “propertyaddress” column contains both an address and a city name, which are separated by a delimiter, typically a comma.

Property Address

Before
-- a) Breaking out propertyaddres column into address & city 

alter table housing
add column property_address varchar(50)

update housing
set property_address = split_part(propertyaddress,',',1)
---------------------------------------------------------------
alter table housing
add column property_city varchar(50)

update housing
set property_city = split_part(propertyaddress,',',2)
After

Owner Address

Before
-- b) Breaking out owneraddress column into address, city, state 

alter table housing
add column owner_address varchar(50)

update housing
set owner_address = split_part(owneraddress,',',1)
---------------------------------------------------------------
alter table housing
add column owner_city varchar(50)

update housing
set owner_city = split_part(owneraddress,',',2)
---------------------------------------------------------------
alter table housing
add column owner_state varchar(5)

update housing
set owner_state = split_part(owneraddress,',',3)
After

4) Change “Sold as Vacant” Field Values :

By replacing “Y” and “N” with “Yes” and “No” in the “Sold as Vacant” field, the dataset gains clearer and more intuitive values, improving readability and interpretability.

-- a) Checking and counting how many types of descriptions are there in the column 'soldasvacant'

select distinct soldasvacant, count(soldasvacant)
from housing
group by 1
order by 2
-- b) Update column

update housing
set soldasvacant = case when soldasvacant = 'Y' then 'Yes'
when soldasvacant = 'N' then 'No'
else soldasvacant
end

-- and Check again

select distinct soldasvacant, count(soldasvacant)
from housing
group by 1
order by 2

5) Remove Duplicates Using Window Functions :

The objective is to identify and eliminate duplicate records in the dataset using advanced techniques like window functions. This ensures data accuracy and prevents overrepresentation of certain entries.

Note :

I won’t actually execute the query provided below on the real dataset to remove the data, as it could result in a reduction of certain data points. And It carries a certain level of risk to do so on actual data, so before removing data, it’s better to consider the potential effects on your data quality. Furthermore, as we can observe from the table, we initially established the “uniqueid” as the primary key, which ensures the absence of duplicate values in this crucial field. Nevertheless, there might be scenarios where we decide to eliminate duplicates in the table at some point in the future, and if such a need arises, we can follow this approach.

-- Remove duplicates using windows function

with duplicate_values as (
select *,
row_number() over(partition by parcelid,
propertyaddress,
saleprice,
sales_date,
legalreference
order by uniqueid
)row_num
from housing
)
-- select *
delete
from duplicate_values
where row_num > 1
Note : If the value of row_num is 2 for a specific row, it means that this row is the second occurrence within its group of duplicates.

6) Delete Unused Columns :

Removing unnecessary or redundant columns streamlines the dataset and eliminates clutter. This simplifies analysis, reduces confusion, and optimizes data storage.

We are removing the “propertyaddress” and “owneraddress” columns because we have already transformed and organized their data during the earlier “Breaking out address into individual column (address, city, state)” on the 3rd cleaning process.

alter table housing 
drop column propertyaddress

alter table housing
drop column owneraddress
Before
After
Before
After

In the realm of data analysis, the journey toward meaningful insights starts with data cleaning — a process that transforms raw information into a refined asset. As we’ve embarked on the path of refining Nashville’s housing sales dataset, we’ve seen how each step, from standardizing dates to dissecting addresses and eliminating duplicates, contributes to a dataset that’s not only accurate but also conducive to accurate analysis. By embracing the art of data cleaning, we’ve unveiled a clearer lens through which to understand and leverage this valuable data. As you venture forth into your own data projects, may this exploration serve as a compass, guiding you toward a world of trustworthy, insightful, and well-prepared data. Happy data cleaning !

--

--