Data Cleaning With MSSQL

Matt Damberg
11 min readAug 24, 2023

--

Photo by Caspar Camille Rubin on Unsplash

Dirty data leads to errors which sets off a cascade of negative events affecting our stakeholders as well as our reputation and our bottom line. Accurate data helps us ensure that we are as accurate as we possibly can be when using it to help stakeholders make business decisions. So how do we ensure that the data we use is accurate? Enter data cleaning. It is one of the most import skills a data analyst can possess and in the following article I'm going share with you a data cleaning project I recently completed.

This project comes from Alex the Analyst and uses multiple tools to clean the data including JOINS, CASE statements, SUBSTRING, PARSENAME, Common Table Expressions and Windows functions including RowNumber.

The Data

SELECT COUNT (*)
FROM [NashvilleHousing]

As we can see, we have a total of 56,477 total rows. And one table with a total of 19 Columns

This is data set examines the real estate market in Nashville, TN and shows a wide variety of information on homes in the City. To begin, I first looked through the different columns to see what immediately stood out to me and found several different areas which would require attention.

Date Formatting Issues

The first was a data formatting issue with the column SalesDate. It was formatted in a Date/Time format which was not necessary as all of the time values were values of 0. I remedied this with a simple SELECT statement using the CONVERT function to change SalesDate to a Date format and then created a new column for these values. This is demonstrated in the query below.

--Here we are converting from a Date/Time format to a Date format.

Select [SaleDateConverted], CONVERT(DATE,[SaleDate])
FROM [dbo].[NashvilleHousing]

UPDATE [dbo].[NashvilleHousing]
SET SaleDate = CONVERT(DATE,[SaleDate])

--Creating a new column for the Date column by altering table and then updating database.

ALTER TABLE [dbo].[NashvilleHousing]
ADD SaleDateConverted Date

UPDATE [dbo].[NashvilleHousing]
SET SaleDateConverted = CONVERT(DATE,[SaleDate])

Finding and Populating Null Values

When looking through the table, I noticed that the column PropertyAddress had some NULL values present. Using a SELECT statement I pulled all the NULL values for the PropertyAddress Column.

--Pulling NULL values from the PropertyAddress Column. 

SELECT *
FROM [dbo].[NashvilleHousing]
WHERE [PropertyAddress] IS NULL
ORDER BY [ParcelID]

Upon further inspection, these NULL PropertyAddress values all had ParcelID’s that matched other row’s ParcelID’s which had addresses. The query directly below shows all of these cases while an individual example of one of these cases is demonstrated in the in the next query below that.

--This self join shows the PropertyAddresses' Parcel ID's  match and therefore should have the same address populate.

SELECT a.[ParcelID], a.[PropertyAddress], b.[ParcelID], b.[PropertyAddress]
FROM [dbo].[NashvilleHousing] a
JOIN [dbo].[NashvilleHousing] b
ON a.[ParcelID] = b.[ParcelID]
AND a.[UniqueID ] <> b.[UniqueID ]
WHERE a.[PropertyAddress] IS NULL
 
-- A pair of matching columns was isolated for an example

SELECT
[ParcelID],
[PropertyAddress],
[OwnerAddress]
FROM [dbo].[NashvilleHousing]
WHERE [OwnerName] = '%COSTNER, FRED%
ORDER BY [ParcelID]

As you can see, both rows contain the same ParcelID and OwnerAddress but only one has a PropertyAddress. Knowing that the address of the property was not likely to change and that the ParcelID’s both matched, I decided to replace the NULL value with the matching PropertyAddress. I first use a SELECT Statement with ISNULL and then use UPDATE to add it to the table.

--Use ISNULL to populate the NULL addresses in a.PropertyAddress.

SELECT a.[ParcelID], a.[PropertyAddress], b.[ParcelID], b.[PropertyAddress], ISNULL(a.[PropertyAddress], b.[PropertyAddress])
FROM [dbo].[NashvilleHousing] a
JOIN [dbo].[NashvilleHousing] b
ON a.[ParcelID] = b.[ParcelID]
AND a.[UniqueID ] <> b.[UniqueID ]
WHERE a.[PropertyAddress] IS NULL

--UPDATE Table to add addresses into table.

UPDATE a
SET [PropertyAddress] = ISNULL(a.[PropertyAddress], b.[PropertyAddress])
FROM [dbo].[NashvilleHousing] a
JOIN [dbo].[NashvilleHousing] b
ON a.[ParcelID] = b.[ParcelID]
AND a.[UniqueID ] <> b.[UniqueID]
WHERE a.[PropertyAddress] IS NULL

To double check our work, we can run the SELECT statement looking for NULL values again. If we have done this correctly then no records should be returned.

SELECT *
FROM [dbo].[NashvilleHousing]
WHERE [PropertyAddress] IS NULL

As we can see there are no more NULL values, indicating we have successfully populated the PropertyAdress column.

Splitting Up Address Records

Method 1: SUBSTRING, CHARINDEX

The next task I set myself upon was the ProprtyAddress Column itself. The column had Address and City crammed together into one value which would make it exceedingly difficult to work with as well as read. An example of this is provided below.

To break up the different values of PropertyAddress we will use the SUBSTRING as and the CHARINDEX function. Lets begin with the removing the address itself.

SUBSTRING returns a selected number of characters from a string. In the query below we first indicate that the SUBSTRING will come from the PropertyAddress column. We then see a number 1 which indicates the starting position of where we want to begin selecting string (in this case the first character hence the 1). Next, we then need to specify how many characters we want for our substring. This can be done with numbers or with other functions. Here we will use CHARINDEX. This function searches for a specific designated value within a string (which in our case is a comma). So the SUBSTRING will select string until it reaches the comma as set by the CHARINDEX. We then add a (-1) value to the end of the function instructing it to remove one character from the end. This is done to remove the comma delimiter that remained at the end of the Address seperating it from the City.

--Remove using SubString

SELECT
SUBSTRING([PropertyAddress], 1, CHARINDEX(',', [PropertyAddress])-1) AS Address
FROM [dbo].[NashvilleHousing]

Next, we pull the City value from the column. We will again do this by using SUBSTRING and CHARINDEX. Our starting point for this statement is going to be the comma where we ended our last statement, so instead of a number, we can just place the CHARINDEX from the previous query into this spot. However, we got rid of the comma by using the (-1) value, so we will need to add a (+1) in order to add the comma back. We then set the length for the string we want returned, to the length of the PropertyAddress.

SELECT
SUBSTRING([PropertyAddress], 1, CHARINDEX(',', [PropertyAddress])-1) AS Address, --Selects address from string.
SUBSTRING([PropertyAddress], CHARINDEX(',', [PropertyAddress])+1, LEN([PropertyAddress])) AS City--Selects City from string.
FROM [dbo].[NashvilleHousing]

After these queries were executed, new columns titled PropertySplitAddress and PropertySplitCity were added and UPDATED to the table along with the new values.

ALTER TABLE [dbo].[NashvilleHousing]
ADD PropertySplitAddress NVARCHAR(255) --Splits the Address

UPDATE [dbo].[NashvilleHousing]
SET PropertySplitAddress = SUBSTRING([PropertyAddress], 1, CHARINDEX(',', [PropertyAddress])-1)


ALTER TABLE [dbo].[NashvilleHousing]
ADD PropertySplitCity NVARCHAR(255) --Splits the City

UPDATE [dbo].[NashvilleHousing]
SET PropertySplitCity = SUBSTRING([PropertyAddress], CHARINDEX(',', [PropertyAddress])+1, LEN([PropertyAddress]))

After addressing the PropertyAddress problem, a similar text based issue was discovered in the OwnersAddress column. This column also contained multiple values which would be better separated. This time we need to separate out the Address, City and State from column.

Method 2: PARSENAME

This time, lets use a different method to separate these values from the OwnersAddress column. We can repurpose the PARSENAME function here to snip them out and later make them into their own individual columns. This function searches for chunks of string which are delimited by PERIODS. We will assign numerical values to the chunks of delimited string (1–3 in this case) and it will then read it from RIGHT to LEFT. If you look above, you can see an example of the OwnerAddress with a comma delimiter. However, as I've stated, the string is delimited by commas, so we will need to use REPLACE to swap them with periods.

As you can see from the query below, we use the REPLACE function to indicate that the commas should be replaced with periods. Next, we assign numerical values to each part of the delimited string, remembering that PARSENAME reads from RIGHT to LEFT. The value 3 is assigned to Address as it’s the last to be read, the value 2 is assigned to City as it’s the second to be read and the 1 value is assigned to State as it’s (you guessed it) the first to be read.

--Replacing periods with commas then Parsing out Address, City and State.

PARSENAME(REPLACE([OwnerAddress],',', '.') , 3) AS Address,
PARSENAME(REPLACE([OwnerAddress],',', '.') , 2) AS City,
PARSENAME(REPLACE([OwnerAddress],',', '.') , 1) AS State

FROM [dbo].[NashvilleHousing]

Now that we have separated these values we can ALTER our table and create new individual columns for these parts of the address. Lastly, we can then UPDATE the table to add the new values to the column.

--Now that Address, City and State are seperated, we can add them as columns to our table. 

ALTER TABLE [dbo].[NashvilleHousing]
ADD OwnerSplitAddress NVARCHAR(255)

UPDATE [dbo].[NashvilleHousing]
SET OwnerSplitAddress = PARSENAME(REPLACE([OwnerAddress],',', '.') , 3) --Adding the Address

ALTER TABLE [dbo].[NashvilleHousing]
ADD OwnerSplitCity NVARCHAR(255)

UPDATE [dbo].[NashvilleHousing]
SET OwnerSplitCity = PARSENAME(REPLACE([OwnerAddress],',', '.') , 2) --Adding the City


ALTER TABLE [dbo].[NashvilleHousing]
ADD OwnerSplitState NVARCHAR(255)

UPDATE [dbo].[NashvilleHousing]
SET OwnerSplitState = PARSENAME(REPLACE([OwnerAddress],',', '.') , 1) --Adding the State

Inconsistent Values

After solving several string based problems it was time to switch gears and tackle some inconsistent values. SoldAsVacant is a true or false column containing multiple different result values including ‘Y’, ‘N’, ‘Yes’ and ‘No’. This non-standardized format is confusing and will be difficult to work with. Lets change the possible result values to a set pair.

First, we will count each result type to see what value is used the most and set those as our standard. As you can see below, ‘Yes’ and ‘No’ have the highest count so they will be our values.

--Inconsistent result values for this field (Y,N,Yes,No)

Select
Distinct [SoldAsVacant],
Count([SoldAsVacant])
FROM [dbo].[ExtraNashvilletable]
GROUP BY [SoldAsVacant]

To attain this, a simple CASE statement is used to indicate that anytime a ‘Y’ value is found, it should be switched to a ‘Yes’ value, anytime an ’N’ value is found it should be switched to a ‘No’ value. If it is neither of these values it should remain the way it is, as it is already in the correct format. We then UPDATE our new values into the table as demonstrated below.

--Replaces 'Y' and 'N' with 'Yes' and 'No', leaves all other values. 

SELECT
[SoldAsVacant],
CASE WHEN [SoldAsVacant] = 'Y' THEN 'Yes'
WHEN [SoldAsVacant] = 'N' THEN 'No'
ELSE [SoldAsVacant]
END
FROM [dbo].[NashvilleHousing]

--Now we can UPDATE this into our table

UPDATE [dbo].[NashvilleHousing]
SET [SoldAsVacant] = CASE WHEN [SoldAsVacant] = 'Y' THEN 'Yes'
WHEN [SoldAsVacant] = 'N' THEN 'No'
ELSE [SoldAsVacant]
END

This standardizes all result values to ‘Yes’ and ‘No’. Lets do another COUNT to check our work. If it returns only Yes and No values then we have been successful.

Select 
SoldAsVacant,
Count(SoldAsVacant)
FROM [dbo].[ExtraNashvilletable]
Group by SoldAsVacant
Order by 2

Removing Duplicates

One of the most common tasks of data cleaning involves the finding and removal of duplicate records. Duplicates skew data, waste time, create poor customer relationships as well as inaccurate analytics and results in poor decision making which harms the performance of the business. Ensuring an accurate data set allows us as analysts and developers to assist stakeholders in making the best possible informed and actionable decisions for the business.

To find duplicate values for this table we will use a CTE or (Common Table Expression) as well as the Row_Number window function. The CTE will create a query which we can later SELECT from. The Row_Number will do just as its name suggests, assign each row a number. We begin by selecting columns for our CTE which should be unique to each record. We then query off of that CTE with a WHERE statement where the row number is greater than 1. The values that result are assigned a row number of 2 and indicates that these values, which should be unique, are appearing more than once and are therefore duplicates. This is demonstrated in the query below.

--Duplicates will be assigned a row number of 2

WITH RowNumCTE AS (
SELECT * ,
ROW_NUMBER () OVER(
PARTITION BY [ParcelID],
[PropertyAddress],
[SalePrice],
[SaleDate],
[LegalReference]
ORDER BY [UniqueID ]
) AS row_num
FROM [dbo].[NashvilleHousing]
)

Select *
FROM RowNumCTE
WHERE row_num >1

In the bottom left corner of the the picture below, we can see that we have 104 total duplicate rows.

Once we have found our duplicate values there are multiple ways of removing them from the data. As this is just a Data Project, I will be deleting them. However, deleting data is not common practice in the real world and should only be done in very specific situations and with the express permission and authority from your Database Administrator.

--The 104 duplicate rows are deleted

WITH RowNumCTE AS (
SELECT * ,
ROW_NUMBER () OVER(
PARTITION BY [ParcelID],
[PropertyAddress],
[SalePrice],
[SalePrice],
[LegalReference]
ORDER BY [UniqueID ]
) AS row_num
FROM [dbo].[ExtraNashvilletable]
)

DELETE
FROM RowNumCTE
WHERE row_num >1

Removing Unused Columns

Finally, we now have several original columns left over. These include the original PropertyAddress and OwnersAddress both of which we no longer need and SalesDate which was converted from Date/Time to Date. Additionally, we have another column present in the table titled TaxDistrict which is not relevant to what we need from the data set. For our final step, lets DROP these columns from the table.

--Drop original unsplit address columns, TaxDistrict and SalesDate

ALTER TABLE [dbo].[NashvilleHousing]
DROP COLUMN
[OwnerAddress],
[TaxDistrict],
[PropertyAddress]

ALTER TABLE [dbo].[NashvilleHousing]
DROP COLUMN SaleDate

Conclusion

In conclusion, data cleaning is arguably the most important step in the Data Analytics Life Cycle. By using clean data, we can find real trends, spot real opportunities, and make better choices for our stakeholders. So, data cleaning doesn’t just improve our analyses, it also helps us make smarter decisions for business growth and success. Just as a strong building needs a solid foundation, accurate and reliable data is crucial for making informed and actionable decisions.

If you enjoyed reading my post, please consider following me on LinkedIN to view my other portfolio projects as well as join me on my journey from an ER Nurse to the world of Data and Business Development.

www.linkedin.com/in/matt-damberg-a44836285

--

--