Employees Retention Analysis

14 min readFeb 26, 2024

Introduction

This data represents a (fake) company that wants to analyze their employees churn rate. This can be defined as the amount of employees leaving over a period of time.

Data Set

I got the data set from Kaggle.com
https://www.kaggle.com/datasets/mfaisalqureshi/hr-analytics-and-job-prediction?select=HR_comma_sep.csv

Each row (record) represents an employee.

Columns:
satisfaction_level 0–1 rating for how satisfied an employee is. 1 being the highest.
last_evaluation 0–1 rating for the employee’s last evaluation. 1 being the highest.
number_project number of projects an employee works on.
average_monthly_hours average number of hours an employee works per month.
time_spend_company how long an employee spent in the company.
work_accident whether or not an employee experienced an accident at work. 1 being that they experienced an accident.
left whether or not an employee has left. 1 being that they left.
promotion_last_5years whether or not an employee was promoted in the last 5 years. 1 being that they were promoted.
department which department an employee works in.
salary an employee’s salary. It’s divided into 3 categories: low-medium-high.

You can also check the files of this project on Github using this link:
https://github.com/OWanis/Employees-Retention-Analysis.git

Workflow

I will be using Excel’s Power Query to do the cleaning, and basic analysis of the data, before sending it to Power BI for insights and data visualization.

EDA (Exploratory Data Analysis)

I am using Excel’s Power Query to do the cleaning, and basic analysis of the data. This how the data looks like:

First, let’s rename some of the columns to correct spelling mistakes, and keep the headings consistent.

number_project renamed to number_of_projects
average_montly_hours renamed to average_monthly_hours
time_spend_company renamed to tenure
Work_accident renamed to work_accident
Department renamed to department

No need to change any of the data types for any of the columns.

I will be sharing the code Using M (Power Query’s coding language), but that doesn’t mean that all steps were done using manual coding, as some steps were done from the menus at the top of Power Query.

Let’s reveal some basic stats of the data.

= Table.RowCount(#"Renamed Columns")
// 14,999 rows

Let’s check if there are any null values within the table by using the Column Quality option in Power Query.

As shown, there are no empty or errored out values in any column.

Before removing the duplicate rows in the table, let’s examine if they might affect the data or not. We can do this by isolating the duplicated rows only as follows:

First, I grouped the table using all the columns together.
Then I used the Count Rows operation as an aggregation to check how many times each row is duplicated.
Last, I filtered that new Count column by removing the 1 value to isolate the duplicate rows only.

From the look of it, the numbers are really random which indicates that it’s almost impossible that that many employees entered similar results across all the columns. So it’s safe to remove the duplicate rows.

#"Removed Duplicates" = Table.Distinct(#"Renamed Columns"),
#"Number of Rows" = Table.RowCount(#"Removed Duplicates")

// 3,008 duplicate rows were removed
// 11,991 unique rows remaining

Next, I will calculate the basic stats for this data like the mean, median, standard deviation, etc…
Calculating 1 of these stats for 1 column in Power Query is quite easy, but what I want to do is create a table where I can group all these stats for each column, and that requires some manual coding in M.

let
  // first step is to include only the first 8 rows, since the last 2 rows not numeric
  NumberColumns = Table.SelectColumns(RenamedColumns, List.FirstN(Table.ColumnNames(RenamedColumns), 8)),
  
  // then we need to save the names of these columns in a list
  ColumnNames = Table.ColumnNames(NumberColumns),

  // last step is to create a column for each stat needed
  Averages = Table.FromRecords(
      // we will use each to operate on each column in the 8 columns
      List.Transform(ColumnNames, each [
            ColumnName = _,
            mean = List.Average(Table.Column(NumberColumns, _)),
            median = List.Median(Table.Column(NumberColumns, _)),
            std = List.StandardDeviation(Table.Column(NumberColumns, _)),
            min = List.Min(Table.Column(NumberColumns, _)),
            percentile25 = List.Percentile(Table.Column(NumberColumns, _), 0.25),
            percentile50 = List.Percentile(Table.Column(NumberColumns, _), 0.5),
            percentile75 = List.Percentile(Table.Column(NumberColumns, _), 0.75),
            max = List.Max(Table.Column(NumberColumns, _))
        ])
    )
in
    Averages

This results in a table.

We can observe that the mean value for the promotion_last_5years column is 0.017. Since this column is a binary column, this translates to only 1.7% of employees where promoted in the last 5 years, which seems pretty low.

The mean value for the left column is almost 0.17 which is also a binary column, so it translates to 17% of the employees left the company since this data was recorded.

Otherwise, we can see that no abnormal data is present. Most data seams within normal ranges.

Note: This process was too slow on Power Query, especially when loading the data back again to Excel. A better alternative for this was to use either Python (optimal) or even regular functions in Excel.

Analysis & Visualization

Next, I imported the data into Power BI to analyze and visualize our data.

In order to analyze why people stay or leave the company, I need to analyze each column of the data, how it affects the decision of people to leave (if it does) and how it affects the other data columns as well.

The first obvious thing to analyze is the relationship between the number_project and the average_monthly_hours and if they affect the decision of employees left column.

I want to create a Boxplot to showcase the distribution of data in the dataset, but Power BI doesn’t have one out of the box. So I used Python in Power BI instead to create one.

# dataset = pandas.DataFrame(left, average_monthly_hours, number_of_projects)
# dataset = dataset.drop_duplicates()

# Paste or type your script code here:

import matplotlib.pyplot as plt
import seaborn as sns

fig, ax = plt.subplots(figsize=(16,9))
sns.boxplot(data=dataset, x='average_monthly_hours', y='number_of_projects', hue='left', orient='h')
ax.invert_yaxis()
plt.title('Monthly hours by number of projects', fontsize='14')
plt.show()

There are a few things to see here:

Most employees who worked on only 2 projects and left the company averaged a number of hours of less than 150 hours per months. This indicates that they were either new employees that didn’t carry much responsibility, or that they were employees that didn’t put in much effort.
All employees who worked on 7 projects left the company. They averaged more than 250 hours a month, which means they were overworked.
All employees who worked on 6 projects and left the company average more than 250 hours per months, which means they were overworked. The majority who didn’t leave averaged around 200 hours per months.

The conclusion here is that the number_project and the average_monthly_hours can directly affect the number of employees who leave the company. Most employees who left are either extremely underworked or overworked, with the exception of employees who work on 3 projects, which can’t be explained through this graph.

Let’s create a histogram to compare the employees who left the company versus the ones who stayed for each number_project.

We can analyze that:

Most employees who left the company worked on only 2 projects (the minimum). These can be either employees who didn’t work hard enough, or were through their notice periods and weren’t given much responsbilities at the end of their tenure.
All employees who worked on 7 projects left the company as shown earlier.
3 projects is a good number as most employees with that number stay. Note: This shows that the anomaly found in the distribution of that column in the boxplot can be ignored as the number of employees who left is so low.
The percentage of employees leaving starts to highly increase when they handle more than 4 projects. This shows that 3–4 projects for an employee is a good average.

Next, I analyzed the relationship between the average_monthly_hours and the satisfaction_level for employees who left vs stayed using a Scatterplot to show the distribution across the data.

Before we continue, let’s calculate the average number of hours per month. If you assume a work week of 40 hours and two weeks of vacation per year, then the average number of working hours per month of employees working Monday–Friday = 50 weeks * 40 hours per week / 12 months = 166.67 hours per month.

With the assumed average of 166.67 monthly hours added as a red line, we can categorize employees into the following:

The top-right blue section: Includes employees who stayed and have an 0.5 or more satisfaction level work (mostly satisfied). They have an average from around 150–280 hours. There’s an exception of mostly overworked (200+ hours) with 0.7+ satisfaction level who left the company.
The bottom-left orange section: Below the average of 166.67 monthly hours, there’s a big section of employees who left and worked around 150 or less monthly hours and were averagely unsatisfied (less than 0.5).
The bottom-right orange section: Mostly overworked employees (250+ hours) who left the company with very low satisfaction levels. Almost all employees who worked more than 288 hours left the company along all satisfaction levels.

As a conclusion, the average_monthly_hours can directly affect the satisfaction_level of employees, which in turns affects the percentage of employees leaving the company. A big chunk of employees who are unsatisfied are either underworked or overworked, that directly causes them to leave the company.

Before moving on, let’s check the distribution of the satisfaction_leveldata using a Histogram.

Nothing different strikes the eye here. Above 0.5 satisfaction, employees are mostly satisfied and decide to stay in the company. Employees who mostly leave are in the 3 groups mentioned in the previous chart.

Time to look at the tenure data. The number of years stayed at a company can directly affect whether an employee leaves or stays in a company.

First, let’s create the same Boxplot using Python in Power BI to check the distribution of the tenure data with the satisfaction_level data for employees who left vs stayed in the company. But let’s pair it with a Histogram as well so that we can be more informed about the data.

# dataset = pandas.DataFrame(tenure, satisfaction_level, left)
# dataset = dataset.drop_duplicates()

# Paste or type your script code here:

import seaborn as sns
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(16,9))
sns.boxplot(data=dataset, x='satisfaction_level', y='tenure', hue='left', orient='h')
ax.invert_yaxis()
plt.title('Satisfaction levels by Tenure', fontsize='14')
plt.show()

Observing the 2 graphs, we can conclude that:

Most employees who made into 2 years of work within that company stay. Their satisfaction levels range between 0.6–0.8 which is a good range. This means that the company provides a good environment for new comers.
On the other end of the spectrum, no employee with 7+ years of tenure in the company left. This can be due to many reasons, like them occupying high responsibility high-paid manegerial roles, or that the company does its best to keep them around.
Starting from the 4th year, the percentage of employees who leave starts to highly increase, crossing 50% in the 5th year. These might be employees who are not recognized enough by the company despite them spending a considerable amount of time there, or they are not performing well enough.
Notice that most employees who left at the 4th year have an incredibly low satisfaction levels. Also, at the 5–6 years mark, employees who stay have significant lower satisfaction levels in comparison to other years, while employees who leave are more satisfied on average.

In short, it seems like the mid levels of 4–6 years tenure are the ones that need more efforts from the company to help them stay longer.

The next factor to analyze is salary. First, let’s compare how the salary differs in comparison the amount of time you spent in the company tenure. This way we can identify if certain groups of people are over or under-paid.

Since the salary data is categorical (low, medium, high), then we should use a Histogram. I will separate it into 2 graphs, one for 2–6 years tenure, and the other one for 7–10 years tenure.

We can deduce the following:

Employees who spent 2–6 years in the company mostly take low-medium average salaries, with few exceptions taking high salaries.
Starting from the 6th year, the amount of employees who take low salaries starts to decrease in favor of medium salaries, and later high salaries.
The 7th year mark sees a sudden-rise in the percentage of employees earning mid-to-high salaries.
The 8th year mark sees a sudden-rise in the percentage of employees earning relatively low salaries, which requires more investigating.

Overall, nothing odd strikes me here, the more you spend time in the company, the more you earn.

Now, on to the last_evaluation. Let’s compare it with the average_monthly_hours, again adding the red line to indicate the average monthly hours that should be spent by an employee in the company. This way we can check an employee needs to be overworked to be well-evaluated by his superiors, or not necessarily.

We can divide this into 3 major groups:

Employees who work an average of as low as 130 hours to as high as 275 hours. These have an average from 0.5–1 evaluation scores, and have stayed in the company.
Employees who work an average of 125–160 hours with relatively low evaluation scores of 0.45–0.6 who have left the company. These might be either underperforming employees, or misused assets.
Employees who work an averge of 220–300 hours with relatively high evaluation scores of 0.8–1 who have left the company. These are overperforming overworked employees who most probably have the left the company for working an extreme amount of hours per months.

There’s a correlation between the average_monthly_hours worked and last_evaluation score. However, working more hours doesn’t necessarily equate to having more evaluation and vice versa.

As discussed before, the problem lies within either underworked underachieving employees, or overworked overachieving employees. But these are not the majority though, but rather a few exceptions.

Let’s analyze the promotions next. I plotted the promotion_last_5years against average_monthly_hours the to check the correlation there, again mainting the same red line of supposed average working hours by an employee.

There are a couple of observations here:

There are very few employees who worked the most hours that got promoted in the last 5 years, most of them left the company.
Most of those who were promoted stayed in the company.

In Conclusion, there is a direct correlation between promotions and employees leaving the company, especially with overworked employees who might believe they were unappreciated.

Now, let’s analyze if the tenure affects promotions. Earlier, there was a sudden shift upwards in salaries noticed for employees who spent 7+ years with the company.

We can use Power Query to check the percentage of promoted employees for each tenure year.

let
    #"Grouped Rows" = Table.Group(#"Changed Type", {"tenure"}, {{"Total", each Table.RowCount(_), Int64.Type}, {"Promoted", each List.Sum([promotion_last_5years]), type nullable number}}),
    // group rows by tenure, adding columns for total amount of employees for each tenure, and the total amount of employees promoted in each tenure
    #"Sorted Rows" = Table.Sort(#"Grouped Rows",{{"tenure", 
    #"Added Custom" = Table.AddColumn(#"Sorted Rows", "Custom", each [Promoted]/[Total]),
    // added a custom column to calculate the percentage
    #"Changed Type1" = Table.TransformColumnTypes(#"Added Custom",{{"Custom", Percentage.Type}})
in
    #"Changed Type1"

As expected, there’s a whooping increase in the number of promotions from an average of 2% in the 6th year up to a staggering 19% in the 7th year. This year seem like a mark in the company, where employees who survive till that checkpoint are heavily rewarded. The 8th and 10th year also have an above average percetange of promotions in comparison to the earlier years.

Let’s check each department and see if there are any differences of employees leaving from one department to another.

From the first sight, nothing seems odd here, all departments have a fairly close percentage of employees leaving. But let’s investigate a little bit more.

Using Power Query, let’s group all rows by departments to find the percentage of employees who left per each department.

let
    #"Grouped Rows" = Table.Group(#"Changed Type", {"department"}, {{"Total", each Table.RowCount(_), Int64.Type}, {"Left", each List.Sum([left]), type nullable number}}),
    // group by department, adding a column for the total number of employees and a column for the sum of employees who left
    #"Added Custom" = Table.AddColumn(#"Grouped Rows", "Custom", each [Left]/[Total]),
    // add another column that calculates the percentage of employee who left based on the last 2 columns
    #"Changed Type1" = Table.TransformColumnTypes(#"Added Custom",{{"Custom", Percentage.Type}})
    // change type to percentage
in
    #"Changed Type1"

As shown, the percentage of employees leaving ranges from 11.93% to 18.8%. This shows that there are no unordinary differences between one department and another.

Lastly, after analyzing all the columns, let’s create a Heatmap to show the correlation between all departments. Let’s use Python as Power BI doesn’t have a native Heatmap visualization.

# dataset = pandas.DataFrame(average_montly_hours, last_evaluation, number_project, satisfaction_level, time_spend_company, Work_accident, promotion_last_5years, left)
# dataset = dataset.drop_duplicates()

# Paste or type your script code here:

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(16, 9))
sns.heatmap(dataset.corr(), vmin=-1, vmax=1, annot=True, cmap=sns.color_palette('vlag', as_cmap=True))
plt.title('Correlation Heatmap', fontdict = {'fontsize':14}, pad=12);
plt.show()

We can observe the following:

There’s a positive correlation between average_monthly_hours, number_of_projects and last_evaluation. The more an employee works, the more projects they take on, and the higher their evaluation.
The left column is mainly inversly affected by the satisfaction_level. The less satisfied an employee, the more likely they will leave and vice versa.
The satisfaction_level column is affected by each column to some degree, which makes sense as an employee’s satisfaction can be deduced from a multitude of factors, like how many hours they are working, or if they were promoted lately or not.
The left column can also be affected (to some lesser extent) by other criteria like the tenure, as long-term employees seem to be more satisfied.

Summary

Employees seem to leave the company when they are either overworked and under-appreciated, or they are under-worked and not being relied upon. While some employees might be performing poorly, the main problem here to be addressed are hard working employees who handle many projects without aid allowing them to work for many hours, and they seem to get passed for promotion. This affects the employees’ satisfaction, and causes them to leave the company.

Also, it appears that employees who spend 6 or more years tend to stay with the company. This can be traced to the management’s system of rewarding long time employees through recognition and promotions, which makes them satisfied and willing to stay in the company, especially at the 7th year mark.

If you arrived here, thanks a lot for taking the time to read my analysis.