Analyzing Employee Attrition with 88% Accuracy

Dhruval Patel
6 min readMay 21, 2022

--

Identifying the factors which influence the attrition of employees

Photo by Christina @ wocintechchat.com on Unsplash

What is Employee Attrition?

Employee attrition refers to the progressive decrease in employee numbers. This suggests that employees are leaving more quickly than they are being hired. Employee attrition occurs when an employee retires, resigns, or is simply not replaced.

Employee attrition can occur for a variety of reasons. These include dissatisfaction with employee perks or compensation structures, a lack of staff growth possibilities, and even a terrible working environment.

Employee attrition isn’t all bad but may be troublesome since it frequently decreases talent within the organization and the workforce as a whole.

So, it becomes very important to predict employee attrition. We will cover employee attrition prediction, which is the forecast that an employee will depart (or resign from) the present firm, and we will do so using a logistic regression algorithm.

Description

IBM is an American MNC operating in around 170 countries with major business verticals as computing, software, and hardware. Attrition is a major risk to service-providing organizations where trained and experienced people are the assets of the company. The organization would like to identify the factors which influence the attrition of employees.

Statistical Tasks

  1. Import attrition dataset and import libraries such as pandas, matplotlib.pyplot, numpy, and seaborn.
  2. Exploratory data analysis
  3. Find the age distribution of employees in IBM
  4. Explore attrition by age
  5. Explore data for Left employees
  6. Find out the distribution of employees in the education field
  7. Give a bar chart for the number of married and unmarried employees
  8. Build up a logistic regression model to predict which employees are likely to attrite

1. Import Libraries

import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

2. Find the age distribution of employees in IBM

A histogram will be used to determine the age distribution in this case. Then resize the image and map the age distribution with a seaborn library.

Insight: Employees between the ages of 34 and 35 are the most numerous.

plt.figure(figsize=(10,6), dpi=80)
sns.histplot(data=df, x='Age', bins=42, kde=True).set_title
('Age Distribution of Employee');
Age Distribution of Employee

3. Explore attrition by age

Now, in order to identify attrition by age, we will group it by age and attrition is ‘yes.’ Following that, we’ll utilize a count plot to estimate how many employees will quit the organization.

Insight: Employees aged 29 and 31 are the most to quit the IBM

print(df[(df['Attrition'] == 'Yes')].groupby('Age')['Age'].count().sort_values(ascending=False))

plt.figure(figsize=(14,6), dpi=80)
sns.countplot(data=df, x='Age', hue='Attrition', order = df['Age'].value_counts().index, palette='seismic_r').set_title
('Attrition by Age');
Attrition by Age

4. Explore data for Left employees

We’ll explore a few things here.

print(df.groupby('Attrition')['Attrition'].count())

plt.figure(figsize=(5,4), dpi=80)
sns.countplot(data=df, x='Attrition', palette='seismic_r');
Attrition

Attrition by Department —

print(df[(df['Attrition'] == 'Yes')].groupby('Department')['Attrition'].count().sort_values(ascending=False))

plt.figure(figsize=(8,5), dpi=80)
sns.countplot(data=df[(df['Attrition'] == 'Yes')], x='Department', palette='cubehelix', order = df['Department'].value_counts().index).set_title('Attrition by Department');
Attrition by Department

Attrition by Age group —

agerange = []
for age in df["Age"]:
if age >= 18 and age < 24:
agerange.append("18-24")
elif age >= 25 and age < 31:
agerange.append("25-31")
elif age >= 32 and age < 38:
agerange.append("32-38")
elif age >= 39 and age < 45:
agerange.append("39-45")
elif age >= 46 and age < 52:
agerange.append("46-52")
elif age >= 53 and age < 59:
agerange.append("53-59")
else:
agerange.append("60-66")

df["AgeRange"] = agerange
print(df[(df['Attrition'] == 'Yes')].groupby('AgeRange')['AgeRange'].count().sort_values(ascending=False))

plt.figure(figsize=(8,5), dpi=80)
sns.countplot(data=df[(df['Attrition'] == 'Yes')], x='AgeRange', palette='cubehelix').set_title('Attrition by Age Group');
Attrition by Age Group

Attrition by Education —

print(df[(df['Attrition'] == 'Yes')].groupby('Education')['Attrition'].count().sort_values(ascending=False))

plt.figure(figsize=(8,5),dpi=80)
sns.countplot(data=df[(df['Attrition'] == 'Yes')], x='Education', order=df['Education'].value_counts().index, palette='cubehelix').set_title('Attrition by Education');
Attrition by Education

Attrition by Environment Satisfaction —

print(df[(df['Attrition'] == 'Yes')].groupby('EnvironmentSatisfaction')['Attrition'].count().sort_values(ascending=False))

plt.figure(figsize=(8,5),dpi=80)
sns.countplot(data=df[(df['Attrition'] == 'Yes')], x='EnvironmentSatisfaction', order=df['EnvironmentSatisfaction'].value_counts().index, palette='cubehelix').set_title('Attrition by Environment Satisfaction');
Attrition by Environment Satisfaction

Attrition by Job Satisfaction —

print(df[(df['Attrition'] == 'Yes')].groupby('JobSatisfaction')['Attrition'].count().sort_values(ascending=False))

plt.figure(figsize=(8,5),dpi=80)
sns.countplot(data=df[(df['Attrition'] == 'Yes')], x='JobSatisfaction',order=df['JobSatisfaction'].value_counts().index, palette='cubehelix');
Attrition by Job Satisfaction

Insight:

  • 237 workers quit their jobs. (133 for R&D, 92 for Sales, and 12 for Human Resources)
  • Employees aged 25–31 (62 employees) are the most likely to have gone.
  • Those who have completed a bachelor’s degree are more likely to leave. (99 employees)
  • Those who are dissatisfied with their surroundings are more likely to leave. (72 employees)
  • Those who are highly satisfied with their jobs are more likely to quit.(73 employees)

5. Find out the distribution of employees in the education field

In order to identify the distribution of employees in the education field, we will group it by education field. Following that, we’ll plot the histogram.

plt.figure(figsize=(11,6), dpi=80)
sns.histplot(data=df, x='EducationField').set_title('Attrition by Education Field');
Attrition by Education Field

Insight: The majority of employees (606 employees) work in life sciences (education field).

6. Explore data for Marital Status

print(df[(df['Attrition'] == 'Yes')].groupby('MaritalStatus')['Attrition'].count().sort_values(ascending=False))

plt.figure(figsize=(8,5),dpi=80)
sns.countplot(data=df, x='MaritalStatus', hue='Attrition', order=df['MaritalStatus'].value_counts().index, palette='seismic_r').set_title('Attrition by Marital Status');
Attrition by Marital Status

Insight: Employees (120 employees) who are not married are more likely to depart.

We have completed the statistical objectives. There are also a few category variables. So, in order to develop a logistic regression model, we must translate those into numerical values. We just substituted numerical values for it.

df['Attrition'].replace('Yes', 1, inplace=True)
df['Attrition'].replace('No', 0, inplace=True)
df['Department'].replace('Human Resources', 1, inplace=True)
df['Department'].replace('Research & Development', 2, inplace=True)
df['Department'].replace('Sales', 3, inplace=True)
df['EducationField'].replace('Human Resources', 1, inplace=True)
df['EducationField'].replace('Life Sciences', 2, inplace=True)
df['EducationField'].replace('Marketing', 3, inplace=True)
df['EducationField'].replace('Medical', 4, inplace=True)
df['EducationField'].replace('Other', 5, inplace=True)
df['EducationField'].replace('Technical Degree', 6, inplace=True)
df['MaritalStatus'].replace('Divorced', 1, inplace=True)
df['MaritalStatus'].replace('Married', 2, inplace=True)
df['MaritalStatus'].replace('Single', 3, inplace=True)

7. Build a logistic regression model to predict which employees are likely to attrite

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr = lr.fit(x_train, y_train)

# check the accuracy on the training set
print('Accuracy =', lr.score(x_train, y_train)*100,'%');
# predict dependent variable
lr_y_pred = lr.predict(x_test)
# find probability
prob = lr.predict_proba(x_test)
print(prob)
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report# Accuracy score
print('Test Accuracy Score:', accuracy_score(y_test, lr_y_pred)*100, '%\n')
# Classification report
print('Classification Report', classification_report(y_test, lr_y_pred))
# Confusion matrix
print('Confusion Matrix\n', confusion_matrix(y_test, lr_y_pred))
Model Accuracy

--

--

Dhruval Patel

I write technical blogs explaining my Data Science project walkthroughs and the concepts relating to Data Science