Oluwagbenga Ajetomobi
4 min readApr 30, 2024

Data Analysis Project: Data Cleaning and Exploratory Data Analysis Of Employee Layoff Data Using MYSQL

Introduction:

In this data analysis project, I conducted a comprehensive analysis of an employee layoff dataset for the year 2022. The dataset contains information on company layoffs, including company name, location, industry, total laid off, percentage laid off, date, stage, country, and funds raised. I explored various aspects of the dataset to derive insights and identify trends using data analysis vocabularies and techniques. Link to the dataset here

Dataset Overview:

  • Dataset Name: Layoffs 2022
  • Records: 2362
  • Columns: company (text), location (text), industry (text), total_laid_off (INT), percentage_laid_off (text), date (text), stage (text), country (text), funds_raised_millions (INT)

Data Cleaning Summary:

Before proceeding with the analysis, the dataset underwent a thorough data cleaning process. The cleaning steps included:

1. Removing Duplicates:

  • Identified and removed duplicate entries based on multiple columns using ROW_NUMBER() OVER() function.
  • Created a staging table (layoffs_staging) to preserve the raw data.
  • Removed duplicate entries by deleting rows with row_num > 1.

2. Standardizing Data:

  • Identified and standardized industry names by updating inconsistent entries (e.g., “Crypto Currency” to “Crypto”).
  • Standardized country names by removing trailing periods (e.g., “United States.” to “United States”).
  • Converted the date column from text to date format using STR_TO_DATE() function

3. Handling Null Values:

  • Reviewed null values in relevant columns (total_laid_off, percentage_laid_off, funds_raised_millions) and decided to keep them as null for easier calculations during exploratory data analysis (EDA).

4. Removing Unnecessary Data:

  • Deleted rows with null values in critical columns (total_laid_off) to ensure data integrity.
  • Removed unnecessary columns (row_num) created during the data cleaning process.

Link to full data cleaning queries here.

Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is a crucial phase in any data analysis project. It involves understanding the structure, patterns, and trends within the dataset to derive meaningful insights. In this project, I conducted an EDA on the employee layoff dataset for the year 2022, focusing on various aspects such as company layoffs, industry trends, geographical distribution, and temporal patterns.

Exploratory Data Analysis (EDA):

  1. Maximum Total Layoffs: The maximum number of total layoffs in a single instance.

Query: SELECT MAX(total_laid_off) FROM layoffs_staging2;

Result: 5000 employees laid off in a single instance.

2. Percentage of Employees Laid Off: Identified the maximum and minimum percentage of employees laid off.

Query: SELECT MAX(percentage_laid_off), MIN(percentage_laid_off) FROM layoffs_staging2 WHERE percentage_laid_off IS NOT NULL;

Result: Maximum 100% layoffs observed in some companies.

3. Companies with 100% Layoffs: Identified companies where 100% of employees were laid off.

Query: SELECT * FROM layoffs_staging2 WHERE percentage_laid_off = 1;

Result: Mostly startups and companies that went out of business during the period.

4. Companies with the Biggest Layoffs: Identified companies with the biggest single layoffs and most total layoffs.

Queries:

SELECT company, total_laid_off FROM layoffs_staging2 ORDER BY total_laid_off DESC LIMIT 5;

SELECT company, SUM(total_laid_off) FROM layoffs_staging2 GROUP BY company ORDER BY 2 DESC LIMIT 10;

Results: Companies like BritishVolt and Quibi had significant layoffs.

5. Geographical Distribution: Analyzed layoffs based on location and country.

Queries:

SELECT location, SUM(total_laid_off) FROM layoffs_staging2 GROUP BY location ORDER BY 2 DESC LIMIT 10;

SELECT country, SUM(total_laid_off) FROM layoffs_staging2 GROUP BY country ORDER BY 2 DESC;

Results: Identified regions and countries with the highest number of layoffs.

6. Temporal Patterns: Examined layoffs trends over time, including yearly and monthly distributions.

Queries:

SELECT YEAR(date), SUM(total_laid_off) FROM layoffs_staging2 GROUP BY YEAR(date) ORDER BY 1 ASC;

SELECT SUBSTRING(date,1,7) as dates, SUM(total_laid_off) AS total_laid_off FROM layoffs_staging2 GROUP BY dates ORDER BY dates ASC;

Results: Identified trends and fluctuations in layoffs over the years and months.

7. Top Companies with Layoffs Per Year: Identified top companies with the most layoffs per year utilizing Common Table Expressions (CTE) and window functions.
Results: Identified companies with consistent high layoffs over multiple years.

Link to the full EDA queries here.

CONCLUSION:

The data cleaning process has successfully addressed issues such as duplicates, inconsistencies, and null values in the employee layoff dataset. By performing these operations, I have ensured that the dataset is now prepared for further analysis, such as trend analysis, pattern recognition, and predictive modeling, to gain valuable insights into the impact of layoffs on different industries, countries, and stages of companies.

The exploratory data analysis (EDA) provided valuable insights into the employee layoff dataset for the year 2022. By analyzing various aspects such as company layoffs, industry trends, geographical distribution, and temporal patterns, we can gain a deeper understanding of the dataset. These insights can inform further analysis and decision-making processes related to workforce management, industry trends, and economic impacts.