Company Layoffs Database: SQL Data Cleaning & EDA
In this project, I will clean up a database and perform exploratory data analysis on it. I will do this by
- removing duplicates
- standardizing data
- addressing null values or blanks
- removing any rows and columns not needed
- conduct an EDA
I start off by looking at a snapshot of our data. After first impressions, I create a duplicate of the database and store the original away to be referred to whenever needed.
Using row_number() over(), I am able to suss out where a particular row has repeat datapoints in all columns meaning it is a duplicate.
I create a second duplicate table and populate it with the unique data points, removing the duplicates.
We proceed using the second duplicate table as our working dataset.
On to standardization, the column “company” contains words and spaces. The spaces sometimes come before the words or after, creating inconsistencies that may affect future analysis. I us the trim function to solve this issue.
I move to the next column “industry”. The industry column has entries under both “Crypto” and “Cryto Currency” which I decide are the same thing. Since “Cryto” appears more frequently, I would be using that. Industry also has some null values and blanks.
Next, I look at where a company in the same location appears more than once but has industry nulls or blanks in one of the entries. To make things easier, I will set the blanks to null values, the populate the nulls with industry values inferred from the existing data.
After confirming that inferable entries have been made, I move to the next column “country”. There are 2 different “United States” because one has a period at the end. We need to remove said period so that the system recognizes them as one.
Checking the data types of the columns, I discover that date column is a text data type. I convert it to the more appropriate ‘date’ data type.
My analysis will focus on the numeric aspect of the dataset, having null values the numeric columns will be problematic.
Checking for null values in our 2 numeric columns, I find that 348 rows have both numeric columns null, this accounts for about 15% of our original data. Unfortunately, I can not infer these values from the existing data, nor can I confidently input a calculated estimate. I decided that leaving these rows in will affect my analysis so opt to remove them. Also, the row_num column I used to find the duplicates is no longer needed so I remove that as well.
Feeling content with the cleaning for now, I move on to my exploratory data analysis. Things I am curious about the data set include:
- What is the timeframe of the data set ?
- Which companies went completely under during the period ?
- Which companies are laying off the most employees ?
- Which industry laid off the most on average ?
- Which countries laid off the most ?
- How did layoffs progress over the period of interest ?
- How the stage a company is in affected layoffs.
The data set is from 11th March 2020 to to 6th March 2023, starting at the peak of the world wide outbreak of the corona virus and afterwards.
At the top of the list of industries we have retail, food and finance.
Large companies like Amazon, Google and Meta laid off the most employees during the period. However, these companies also have a disproportionately large number of employees in general.
Consumer and retail are leading industries in layoffs.
USA contributing the most companies to the database naturally also has the most number of layoffs.
The layoffs were lowest in 2020, but subsequent years saw higher numbers in the aftermath of the pandemic. Though 2023 shows a reduction in layoffs, only 3 months out of the year are accounted for.
The top 5 companies laying off employees over the period was quite volatile.
Finally, companies in Post-IPO stage laid off the most during the period.
Thank you for taking the time to to look through this project, I hope it helped you along your analytics journey. You can see more projects from me here , visit my Github for the full dataset. Let’s connect on LinkedIn.