Data Cleaning in SQL

Chinenye Agba
6 min readMar 18, 2023

--

Nashville Real Estate Data

Photo from nanaimore

Introduction

Data Analysts are known to Analyse data and draw useful insights which can help businesses plan, strategize, improve processes, increase revenue and stay ahead of the competition. However, before any dataset can be analysed, it has to be cleaned and prepared for use.

Data Cleaning

This is the process of formatting and transforming a dirty data into clean one. It includes appropriate transformation of incorrect, duplicate, incomplete and inconsistent data so it can be useful for Data Analysis. This is a very crucial step in the analytical process to ensure data integrity and accuracy in the end result of any analysis.

About This Data

This project involves cleaning of the Nashville real estate data with SQL. Housing information captured in the data which can be found here, contain lots of errors, incorrect datatypes inconsitencies and duplicates which were corrected and formatted using SQL.

My Data Cleaning Guide:

  1. Check and convert fields to appropriate datatypes where neccessary
  2. Add new relevant fields
  3. Check and fix inconsistencies in data
  4. Extract values and format fields into useful data
  5. Remove unwanted data and duplicates
  6. Create View of the cleaned relevant data

Data Cleaning Process

I downloaded the dataset from the link above as an excel file. The table contained 56,477 records and 19 fields. I proceeded to load same into Microsoft SQL Server for cleaning. Here, I previewed the entire table to familiarise with the fields (columns) of the data and their data types.

I was able to immediately identify that the ‘Sale Date’ field was in the datetime format, so I converted to just date as there was no time data included.

SaleDate

I had to create a new field where the converted data will be placed, followed by an UPDATE statement to input the converted data into the new field. I used this approach because the express coversion did not work for the existing field. See script below

The result:

Converted SalesDate

Next, I checked the ‘Property Address’ field and found that it had some null records. However, these nulls were duplicates of another record with the same ‘Parcel ID’ but different ‘UniqueID’, and the missing address appeared to be present in either of these duplicates. So, to retrieve the missing addresses, I wrote the script below to update the table by joining the table to itself in the Update statement; using ‘Parcel ID’ as the primary key AND where ‘unique ID’ in both tables are not equal. Also, I included the ISNULL function to replace these nulls with the corresponding values.

My next action was to then split the ‘Property Address’ field in order to separate the Street from the City name. Reason for this is because the data is more useful this way especially in visualization. The field originally looked like this :

With the PARSENAME funtion in the script below, I replaced the comma delimiter in the address with period (fullstop). First, I created new fields where the split address will be placed. Then in the UPDATE statement, I used the PARSENAME function to essentially split the values based on the delimiter into street and city and place them in their respective new fields.

The result:

Next, was to replicate the same step above, on the ’Owner Address’ field. The addreses contained street, city and state names seperated by comma delimiters.

See script below:

The result:

Next, I took a distinct count of the ‘Sold As Vacant’ field and discovered that the entries were Yes, Y, No and N. I then updated the table by replacing all the Y as Yes and N as No as with the script below

Next, I carefully checked all relevant fields that would be useful for further analysis and confirmed they were formatted correctly.

At this point, I proceeded to delete duplicates in the table with a CTE and Row_Num function which I partitioned by the unique fields in the table. This was to help filter out the duplicates. First was to confirm there were duplicates using the SELECT statement with the CTE. Once confirmed, i then used the DELETE statement on the CTE to remove them. I left this deletion for later to ensure I have recovered all useful data from these fields before permanent removal.

Finally, I dropped unimportant fields that are not relevant to further analysis, such as the fields I converted and split previously into new ones while transforming this data, and the ‘Tax District’ field.

Next, I created new fields with clearer labels for than the ones I aliased previously. I followed this with an update statement to transfer the data from these alias fields, to their repsective new fields as shown below

Once transferred, I also dropped the old Columns from the table with the ALTER TABLE statement.

Finally, I created a View of my preferred data with the script below and reordered the fields logically.

Note: Before running each script that is not a SELECT statement, I would run it in a SELECT statement to get a preview of what the output would look like. After each alteration, I again previewd the table with the SELECT statement to confirm that the intended change was appropriately effected.

The final result is the table screenshot below

To view the cleaned data, click here.

Also, you can view the entire script on Github.

In the cleaned data, there are srill presence of some null records because I am not able to retrieve those anywhere. Nonetheless, the fields which vital to the analysis of this data; such as

Conclusion

A clean Data is the baseline for a trusworthy data exploration and analysis. It is a crucial step that cannot be ignored or else, you would be left with data that is all over the place which you cannot make any useful meaning out of.

Cleaning data in SQL is very fast, and an optimised method to work on large amount of Data, a process that may be slower if done in spreadsheet. This project has shown that beauty of data cleaning processes especially on SQL and how the end result is a well organised, meaningful data to be used for further analysis.

Thanks for reading!

I will be glad to read your feedbacks and comments below.

--

--

Chinenye Agba

Data Analyst/Client Success Expert. Love making sense of Data and exploring the Dataverse, one case study at a time!