This is Why Your Employees Quit

12 min readOct 7, 2017

How to apply the data science pipeline to understand employee turnover

“Yeah, they all said that to me…”, Bob replied as we were at Starbucks sipping on our dark roast coffee. Bob is a friend of mine and was the owner of a multi-million dollar company, that’s right, “m-i-l-l-i-o-n”. He used to tell me stories about how his company’s productivity and growth has sky rocketed from the previous years and everything has been going great. But recently, he’s been noticing some decline within his company. In a five month period, he lost one-fifth of his employees. At least a dozen of them throughout each department made phone calls and even left sticky notes on their tables informing him about their leave. Nobody knew what was happening. In that year, he was contemplating about filing for bankruptcy. Fast-forward seven months later, he’s having a conversation with his co-founder of the company. The conversation ends with, “I quit…”

That is the last thing anybody wants to hear from their employees. In a sense, it’s the employees who make the company. It’s the employees who do the work. It’s the employees who shape the company’s culture. Long-term success, a healthy work environment, and high employee retention are all signs of a successful company. But when a company experiences a high rate of employee turnover, then something is going wrong. This can lead the company to huge monetary losses by these innovative and valuable employees.

Companies that maintain a healthy organization and culture are always a good sign of future prosperity. Recognizing and understanding what factors that were associated with employee turnover will allow companies and individuals to limit this from happening and may even increase employee productivity and growth. These predictive insights give managers the opportunity to take corrective steps to build and preserve their successful business.

“You don’t build a business. You build people, and people build the business.” — Zig Ziglar

Business Problem

Bob’s multi-million dollar company is about to go bankrupt and he wants to know why his employees are leaving.

Client

Bob the Boss

Objective

My goal is to understand what factors contribute most to employee turnover and create a model that can predict if a certain employee will leave the company or not.

OSEMN Pipeline

I’ll be following a typical data science pipeline, which is call “OSEMN” (pronounced awesome).

Obtaining the data is the first approach in solving the problem.
Scrubbing or cleaning the data is the next step. This includes data imputation of missing or invalid data and fixing column names.
Exploring the data will follow right after and allow further insight of what our dataset contains. Looking for any outliers or weird data. Understanding the relationship each explanatory variable has with the response variable resides here and we can do this with a correlation matrix.
Modeling the data will give us our predictive power on whether an employee will leave.
INterpreting the data is last. With all the results and analysis of the data, what conclusion is made? What factors contributed most to employee turnover? What relationship of variables were found?

Note: The data was found from the “Human Resources Analytics” dataset provided by Kaggle’s website. Here is the link to my original kernel

Note: THIS DATASET IS SIMULATED.

Part 1: Obtaining the Data

# Import the neccessary modules for data manipulation and visual representation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as matplot
import seaborn as sns
%matplotlib inline#Read the analytics csv file and store our dataset into a dataframe called "df"
df = pd.DataFrame.from_csv('../input/HR_comma_sep.csv', index_col=None)

Part 2: Scrubbing the Data

Typically, cleaning the data requires a lot of work and can be a very tedious procedure. This dataset from Kaggle is super clean and contains no missing values. But still, I will have to examine the dataset to make sure that everything else is readable and that the observation values match the feature names appropriately.

# Check to see if there are any missing values in our data set
df.isnull().any()

# Get a quick overview of what we are dealing with in our dataset
df.head()

Part 3: Exploring the Data

3a. Statistical Overview:

The dataset has:

About 15,000 employee observations and 10 features
The company had a turnover rate of about 24%
Mean satisfaction of employees is 0.61

# Overview of summary (Turnover V.S. Non-turnover)
turnover_Summary = df.groupby('turnover')
turnover_Summary.mean()

# Looks like about 76% of employees stayed and 24% of employees left. 
# NOTE: When performing cross validation, its important to maintain this turnover ratio
turnover_rate = df.turnover.value_counts() / 14999
turnover_rate

3b. Correlation Matrix & Heatmap

Stop and Think:

What features affect our target variable the most (turnover)?
What features have strong correlations with each other?
Can we do a more in depth examination of these features?

Summary:

From the heatmap, there is a positive(+) correlation between projectCount, averageMonthlyHours, and evaluation. Which could mean that the employees who spent more hours and did more projects were evaluated highly.

For the negative(-) relationships, turnover and satisfaction are highly correlated. I’m assuming that people tend to leave a company more when they are less satisfied.

#Correlation Matrix
corr = df.corr()
corr = (corr)
sns.heatmap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values)
sns.plt.title('Heatmap of Correlation Matrix')
corr

3c. Distribution Plots (Satisfaction & Evaluation & AverageMonthlyHours)

Summary: Let’s examine the distribution on some of the employee’s features. Here’s what I found:

Satisfaction — There is a huge spike for employees with low satisfactionand high satisfaction.
Evaluation — There is a bimodal distrubtion of employees for low evaluations (less than 0.6) and high evaluations (more than 0.8)
AverageMonthlyHours — There is another bimodal distribution of employees with lower and higher average monthly hours (less than 150 hours & more than 250 hours)
The evaluation and average monthly hour graphs both share a similar distribution.
Employees with lower average monthly hours were evaluated less and vice versa.
If you look back at the correlation matrix, the high correlation between evaluation and averageMonthlyHours does support this finding.

Stop and Think:

Is there a reason for the high spike in low satisfaction of employees?
Could employees be grouped in a way with these features?
Is there a correlation between evaluation and averageMonthlyHours?

t up the matplotlib figure
f, axes = plt.subplots(ncols=3, figsize=(15, 6))# Graph Employee Satisfaction
sns.distplot(df.satisfaction, kde=False, color="g", ax=axes[0]).set_title('Employee Satisfaction Distribution')# Graph Employee Evaluation
sns.distplot(df.evaluation, kde=False, color="r", ax=axes[1]).set_title('Employee Evaluation Distribution')# Graph Employee Average Monthly Hours
sns.distplot(df.averageMonthlyHours, kde=False, color="b", ax=axes[2]).set_title('Employee Average Monthly Hours Distribution')

3d. Salary V.S. Turnover

Summary: This is not unusual. Here’s what I found:

Majority of employees who left either had low or medium salary.
Barely any employees left with high salary
Employees with low to average salaries tend to leave the company.

Stop and Think:

What is the work environment like for low, medium, and high salaries?
What made employees with high salaries to leave?

3e. Department V.S. Turnover

Summary: Let’s see more information about the departments. Here’s what I found:

The sales, technical, and support department were the top 3 departments to have employee turnover
The management department had the smallest amount of turnover

Stop and Think:

If we had more information on each department, can we pinpoint a more direct cause for employee turnover?

# Employee distribution
# Types of colors
color_types = ['#78C850','#F08030','#6890F0','#A8B820','#A8A878','#A040A0','#F8D030',  
                '#E0C068','#EE99AC','#C03028','#F85888','#B8A038','#705898','#98D8D8','#7038F8']# Count Plot (a.k.a. Bar Plot)
sns.countplot(x='department', data=df, palette=color_types).set_title('Employee Department Distribution');
 
# Rotate x-labels
plt.xticks(rotation=-45)

# Plot turnover rate for each department
f, ax = plt.subplots(figsize=(15, 5))
sns.countplot(y="department", hue='turnover', data=df).set_title('Employee Department Turnover Distribution');

3f. Turnover V.S. ProjectCount

Summary: This graph is quite interesting as well. Here’s what I found:

More than half of the employees with 2,6, and 7 projects left the company
Majority of the employees who did not leave the company had 3,4, and 5projects
All of the employees with 7 projects left the company
There is an increase in employee turnover rate as project count increases

Stop and Think:

Why are employees leaving at the lower/higher spectrum of project counts?
Does this means that employees with project counts 2 or less are not worked hard enough or are not highly valued, thus leaving the company?
Do employees with 6+ projects are getting overworked, thus leaving the company?

ax = sns.barplot(x="projectCount", y="projectCount", hue="turnover", data=df, estimator=lambda x: len(x) / len(df) * 100)
ax.set(ylabel="Percent")

3g. Turnover V.S. Evaluation

Summary:

There is a biomodal distribution for those that had a turnover.
Employees with low performance tend to leave the company more
Employees with high performance tend to leave the company more
The sweet spot for employees that stayed is within 0.6–0.8 evaluation

# Kernel Density Plot
fig = plt.figure(figsize=(15,4),)
ax=sns.kdeplot(df.loc[(df['turnover'] == 0),'evaluation'] , color='b',shade=True,label='no turnover')
ax=sns.kdeplot(df.loc[(df['turnover'] == 1),'evaluation'] , color='r',shade=True, label='turnover')
plt.title('Employee Evaluation Distribution - Turnover V.S. No Turnover')

3h. Turnover V.S. AverageMonthlyHours

Summary:

Another bi-modal distribution for employees that turnovered
Employees who had less hours of work (~150hours or less) left the company more
Employees who had too many hours of work (~250 or more) left the company
Employees who left generally were underworked or overworked.

#KDEPlot: Kernel Density Estimate Plot
fig = plt.figure(figsize=(15,4))
ax=sns.kdeplot(df.loc[(df['turnover'] == 0),'averageMonthlyHours'] , color='b',shade=True, label='no turnover')
ax=sns.kdeplot(df.loc[(df['turnover'] == 1),'averageMonthlyHours'] , color='r',shade=True, label='turnover')
plt.title('Employee AverageMonthly Hours Distribution - Turnover V.S. No Turnover')

3i. Turnover V.S. Satisfaction

Summary:

There is a tri-modal distribution for employees that turnovered
Employees who had really low satisfaction levels (0.2 or less) left the company more
Employees who had low satisfaction levels (0.3~0.5) left the company more
Employees who had really high satisfaction levels (0.7 or more) left the company more

#KDEPlot: Kernel Density Estimate Plot
fig = plt.figure(figsize=(15,4))
ax=sns.kdeplot(df.loc[(df['turnover'] == 0),'satisfaction'] , color='b',shade=True, label='no turnover')
ax=sns.kdeplot(df.loc[(df['turnover'] == 1),'satisfaction'] , color='r',shade=True, label='turnover')
plt.title('Employee Satisfaction Distribution - Turnover V.S. No Turnover')

3j. Satisfaction VS Evaluation

Summary: This is by far the most compelling graph. This is what I found:

There are 3 distinct clusters for employees who left the company

Cluster 1 (Hard-working and Sad Employee):

Satisfaction was below 0.2 and evaluations were greater than 0.75. Which could be a good indication that employees who left the company were good workers but felt horrible at their job.

Question: What could be the reason for feeling so horrible when you are highly evaluated? Could it be working too hard? Could this cluster mean employees who are “overworked”?

Cluster 2 (Bad and Sad Employee):

Satisfaction between about 0.35~0.45 and evaluations below ~0.58. This could be seen as employees who were badly evaluated and felt bad at work.

Question: Could this cluster mean employees who “under-performed”?

Cluster 3 (Hard-working and Happy Employee):

Satisfaction between 0.7~1.0 and evaluations were greater than 0.8. Which could mean that employees in this cluster were “ideal”. They loved their work and were evaluated highly for their performance.

Question: Could this cluster mean that employees left because they found another job opportunity?

sns.lmplot(x='satisfaction', y='evaluation', data=df,
           fit_reg=False, # No regression line
           hue='turnover')   # Color by evolution stage

3k. Turnover V.S. YearsAtCompany

Summary: Let’s see if theres a point where employees start leaving the company. Here’s what I found:

More than half of the employees with 4 and 5 years left the company
Employees with 5 years should highly be looked into

Stop and Think:

Why are employees leaving mostly at the 3–5 year range?
Who are these employees that left?
Are these employees part-time or contractors?

ax = sns.barplot(x="yearsAtCompany", y="yearsAtCompany", hue="turnover", data=df, estimator=lambda x: len(x) / len(df) * 100)
ax.set(ylabel="Percent")

Part 4: Modeling the Data

I’ll be using a logistic regression algorithm to model the data. Since our class is imbalanced, I would not worry too much about the accuracy of the model. Instead, we should be more focused on the precision and recall.

# Import neccessary packages
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, precision_score, recall_score, confusion_matrix, precision_recall_curve
from sklearn.preprocessing import RobustScaler# Create dummy variables for department column
df['sales'] = (df['department'] == 1).astype('int')
df['accounting'] = (df['department'] == 2).astype('int')
df['hr'] = (df['department'] == 3).astype('int')
df['technical'] = (df['department'] == 4).astype('int')
df['support'] = (df['department'] == 5).astype('int')
df['management'] = (df['department'] == 6).astype('int')
df['it'] = (df['department'] == 7).astype('int')
df['product_mng'] = (df['department'] == 8).astype('int')
df['marketing'] = (df['department'] == 9).astype('int')
df.drop('department', axis=1, inplace=True)# Create dummy variables for salary column
df['low'] = (df['salary'] == 1).astype('int')
df['medium'] = (df['salary'] == 2).astype('int')
df.drop('salary', axis=1, inplace=True)
# Create train and test splits
target_name = 'turnover'
X = df.drop('turnover', axis=1)
robust_scaler = RobustScaler()
X = robust_scaler.fit_transform(X)
y=df[target_name]
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.15, random_state=123, stratify=y)
from sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_reportlogis = LogisticRegression(class_weight = "balanced")
logis.fit(X_train, y_train)
print ("\n\n ---Logistic Model---")
logit_roc_auc = roc_auc_score(y_test, logis.predict(X_test))
print ("Logistic AUC = %2.2f" % logit_roc_auc)
print(classification_report(y_test, logis.predict(X_test)))

# Create ROC Graph
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, model.predict_proba(X_test)[:,1])plt.figure()
plt.plot(fpr, tpr, label='ROC Cure (area = %0.2f)' % logit_roc_auc)
plt.plot([0,1], [0,1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Graph')
plt.legend(loc="lower right")
plt.show()

5. Interpreting the Data

With all of this information, this is what Bob should know about his company and why his employees probably left:

Employees generally left when they are underworked (less than150hr/month or 6hr/day)
Employees generally left when they are overworked (more than250hr/month or 10hr/day)
Employees with either really high or low evaluations should be taken into consideration for high turnover rate
Employees with low to medium salaries are the bulk of employee turnover
Employees that had 2,6, or 7 project count was at risk of leaving the company
Employee satisfaction is the highest indicator for employee turnover.
Employees with 4 and 5 years at a company are endangered of leaving.

Potential Solution

Since satisfaction had the most effect in determining employee turnover, the underlying problem can be generalized down to a personal level. Or the problem is not with the employees, but persist in a deeper level of the company (their core values and purpose).

Solution 1: Develop learning programs for managers. Then use analytics to gauge their performance and measure progress. Some advice:

Be a good coach
Empower the team and do not micromanage
Express interest for team member success
Have clear vision / strategy for team
Help team with career development

Solution 2:

We can rank employees by their probability of leaving, then allocate a limited incentive budget to the highest probability instances.
OR, we can allocate our incentive budget to the instances with the highest expected loss, for which we’ll need the probability of turnover.

What Now

This problem is about people decision. When modeling the data, we shouldnotbe using this predictive metric as a solution decider. But, we can use this to arm people with much better relevant information for better decision making.

We would have to conduct more experiments or collect more data about the employees in order to come up with a more accurate finding. I would recommend to gather more variables from the database that could have more impact on determining employee turnover and satisfaction such as their distance from home, gender, age, and etc.

Reverse Engineer the Problem

After trying to understand what caused employees to leave in the first place, we can form another problem to solve by asking ourselves

“What features caused employees stay?
“What features contributed to employee retention?

There are endless problems to solve!

Any feedback or constructive criticism is greatly appreciated. Thank you :)

“You don’t build a business. You build people, and people build the business.” — Zig Ziglar

Why do you think employees leave?

This is Why Your Employees Quit

Business Problem

Client

Objective

OSEMN Pipeline

Part 1: Obtaining the Data

Part 2: Scrubbing the Data

Part 3: Exploring the Data

3a. Statistical Overview:

3b. Correlation Matrix & Heatmap

3c. Distribution Plots (Satisfaction & Evaluation & AverageMonthlyHours)

3d. Salary V.S. Turnover

3e. Department V.S. Turnover

3f. Turnover V.S. ProjectCount

3g. Turnover V.S. Evaluation

3h. Turnover V.S. AverageMonthlyHours

3i. Turnover V.S. Satisfaction

3j. Satisfaction VS Evaluation

3k. Turnover V.S. YearsAtCompany

Part 4: Modeling the Data

5. Interpreting the Data

Potential Solution

What Now

Reverse Engineer the Problem

Why do you think employees leave?

Written by Randy Lao