Data Analysis Project: Cleaning a Messy Job Postings Dataset using SQL.

Maria-Goretti Anike
5 min readJun 12, 2023

--

Hello everyone. I know, I know, it’s been a while, yeah. But hey, I’m back now, so let’s get to it, shall we? If you’d like to go straight to the cleaning process, you can head on to the third paragraph. But you could read on for a little backdrop to this story.

After I completed my first project and got down from the ‘oh-wow-I-actually-did-a-project-on-my-own’ high, the question that plagued me was “Okay, what next?” I had an EDA project down, so I wanted my next project to be on Data Cleaning. I then watched a YouTube video by Data with Mo where he talked about project ideas, and he gave 2 different data cleaning ideas, one for the FIFA ’21 Players dataset, and another for Data Science Job Postings on Glassdoor, and as a lot of people have worked on the first, I decided to go with the latter. I dragged my feet on this, I won’t lie, ‘cos I wanted to perform the Cleaning using SQL, so I took time to learn some Data Cleaning Queries and Functions, and now, here we are, with my completed Data Cleaning project. Yay!

After downloading the dataset from Kaggle, and opening it on Excel,

The dataset as a CSV file

I then imported it into SQL Server.

The dataset in SQL Server

The goal of this cleaning exercise is to:

  1. Remove the dollar signs, the letter ‘K’ and the ‘(Glassdoor est.)’ from the Salary Estimate column. We should be left with only numbers in that column.
  2. Get rid of the Duplicate values in the dataset.
  3. Get rid of the numbers in the Company name.
  4. Create new City and State columns from the Location column.

The first thing I did was to remove the index column, ‘cos it’s as a result of direct importation from Python, and why would I need 2 index columns anyway? Also, working with spaced-out columns that need ‘[]’ irks me, so I renamed the columns to replace the spaces with underscores.

As earlier mentioned, I removed the ‘$’ and ‘K’ from the Salary_Estimate column, and replaced them with ‘’.

Now for the removal of ‘(Glassdoor est.)’. After going through a bit of stress, I got 3 different ways to remove it. 3 ways, the same output. God! How I love SQL!

Then I added the new Salary_Estimate column to the table, trimmed the white spaces in the new column, and deleted the old column.

Time to get rid of the duplicate values. While going through the dataset, I noticed that some ‘duplicate values’ were not really duplicates. Some job postings had the same information in all the columns except the Salary_Estimate column.

An example:

Here, we see that we have 167 rows of duplicate values, but once we add the Salary_Estimate column to the mix, here’s what we get.

Just 13 rows, to show that the other rows had a difference. So, I deleted these duplicate rows and got back 659 rows.

Then I removed the numbers from the company name, added the new Company_Name column, and dropped the old column.

I broke out the City and State names from the Location column, then created new columns for Location_City and Location_State.

I also did the same for the headquarters.

While doing this separation, I observed that some rows didn’t have any comma (,), thereby making them impossible to break out, and rendering their values NULL.

So I decided to correct this by doing a self-join on the table, and replacing these NULL values with information from the Location column.

Then I changed ‘-1’s in the Founded column to NULL, and then deleted unused columns.

Whew! That’s that for that. If you read up to this point, thank you so much. You could go through my code for this project on my Github page here. You could also connect with me on Twitter and LinkedIn.

Till next time.

Bye!

--

--

Maria-Goretti Anike

Hey yo there 😄! I'm Maria, your favourite Data Explorer and ardent SQL devotee. I write all about Healthcare, Marketing, and Product Analytics.