Company Layoffs Database: SQL Data Cleaning & EDA

rolanda azeem
5 min readJun 20, 2024

--

In this project, I will clean up a database and perform exploratory data analysis on it. I will do this by

  • removing duplicates
  • standardizing data
  • addressing null values or blanks
  • removing any rows and columns not needed
  • conduct an EDA

I start off by looking at a snapshot of our data. After first impressions, I create a duplicate of the database and store the original away to be referred to whenever needed.

snapshotting and duplicating the database

Using row_number() over(), I am able to suss out where a particular row has repeat datapoints in all columns meaning it is a duplicate.

highlighting duplicates within the dataset

I create a second duplicate table and populate it with the unique data points, removing the duplicates.

creating 2nd duplicate table
current duplicate table without duplicate data points

We proceed using the second duplicate table as our working dataset.

On to standardization, the column “company” contains words and spaces. The spaces sometimes come before the words or after, creating inconsistencies that may affect future analysis. I us the trim function to solve this issue.

using trim to cleanup spaces

I move to the next column “industry”. The industry column has entries under both “Crypto” and “Cryto Currency” which I decide are the same thing. Since “Cryto” appears more frequently, I would be using that. Industry also has some null values and blanks.

handling industry column

Next, I look at where a company in the same location appears more than once but has industry nulls or blanks in one of the entries. To make things easier, I will set the blanks to null values, the populate the nulls with industry values inferred from the existing data.

populating nulls with inferred values

After confirming that inferable entries have been made, I move to the next column “country”. There are 2 different “United States” because one has a period at the end. We need to remove said period so that the system recognizes them as one.

removing period United States

Checking the data types of the columns, I discover that date column is a text data type. I convert it to the more appropriate ‘date’ data type.

updating date column to date data type

My analysis will focus on the numeric aspect of the dataset, having null values the numeric columns will be problematic.

dropping unneeded rows and columns

Checking for null values in our 2 numeric columns, I find that 348 rows have both numeric columns null, this accounts for about 15% of our original data. Unfortunately, I can not infer these values from the existing data, nor can I confidently input a calculated estimate. I decided that leaving these rows in will affect my analysis so opt to remove them. Also, the row_num column I used to find the duplicates is no longer needed so I remove that as well.

Feeling content with the cleaning for now, I move on to my exploratory data analysis. Things I am curious about the data set include:

  • What is the timeframe of the data set ?
  • Which companies went completely under during the period ?
  • Which companies are laying off the most employees ?
  • Which industry laid off the most on average ?
  • Which countries laid off the most ?
  • How did layoffs progress over the period of interest ?
  • How the stage a company is in affected layoffs.
EDA 1

The data set is from 11th March 2020 to to 6th March 2023, starting at the peak of the world wide outbreak of the corona virus and afterwards.

At the top of the list of industries we have retail, food and finance.

Large companies like Amazon, Google and Meta laid off the most employees during the period. However, these companies also have a disproportionately large number of employees in general.

Consumer and retail are leading industries in layoffs.

USA contributing the most companies to the database naturally also has the most number of layoffs.

layoffs per year

The layoffs were lowest in 2020, but subsequent years saw higher numbers in the aftermath of the pandemic. Though 2023 shows a reduction in layoffs, only 3 months out of the year are accounted for.

The top 5 companies laying off employees over the period was quite volatile.

Finally, companies in Post-IPO stage laid off the most during the period.

Thank you for taking the time to to look through this project, I hope it helped you along your analytics journey. You can see more projects from me here , visit my Github for the full dataset. Let’s connect on LinkedIn.

--

--