Analyzing Employee Attrition with 88% Accuracy

6 min readMay 21, 2022

Identifying the factors which influence the attrition of employees

Photo by Christina @ wocintechchat.com on Unsplash

What is Employee Attrition?

Employee attrition refers to the progressive decrease in employee numbers. This suggests that employees are leaving more quickly than they are being hired. Employee attrition occurs when an employee retires, resigns, or is simply not replaced.

Employee attrition can occur for a variety of reasons. These include dissatisfaction with employee perks or compensation structures, a lack of staff growth possibilities, and even a terrible working environment.

Employee attrition isn’t all bad but may be troublesome since it frequently decreases talent within the organization and the workforce as a whole.

So, it becomes very important to predict employee attrition. We will cover employee attrition prediction, which is the forecast that an employee will depart (or resign from) the present firm, and we will do so using a logistic regression algorithm.

Retail Analysis with Walmart Data — Part-1

Analyzing and Building Machine Learning model for 45 stores of Walmart

medium.com

Retail Analysis with Walmart Data — Part-2

Building Machine Learning model for 45 stores of Walmart

medium.com

Description

IBM is an American MNC operating in around 170 countries with major business verticals as computing, software, and hardware. Attrition is a major risk to service-providing organizations where trained and experienced people are the assets of the company. The organization would like to identify the factors which influence the attrition of employees.

Statistical Tasks

Import attrition dataset and import libraries such as pandas, matplotlib.pyplot, numpy, and seaborn.
Exploratory data analysis
Find the age distribution of employees in IBM
Explore attrition by age
Explore data for Left employees
Find out the distribution of employees in the education field
Give a bar chart for the number of married and unmarried employees
Build up a logistic regression model to predict which employees are likely to attrite

1. Import Libraries

import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

2. Find the age distribution of employees in IBM

A histogram will be used to determine the age distribution in this case. Then resize the image and map the age distribution with a seaborn library.

Insight: Employees between the ages of 34 and 35 are the most numerous.

plt.figure(figsize=(10,6), dpi=80)
sns.histplot(data=df, x='Age', bins=42, kde=True).set_title
('Age Distribution of Employee');

3. Explore attrition by age

Now, in order to identify attrition by age, we will group it by age and attrition is ‘yes.’ Following that, we’ll utilize a count plot to estimate how many employees will quit the organization.

Insight: Employees aged 29 and 31 are the most to quit the IBM

print(df[(df['Attrition'] == 'Yes')].groupby('Age')['Age'].count().sort_values(ascending=False))

plt.figure(figsize=(14,6), dpi=80)
sns.countplot(data=df, x='Age', hue='Attrition', order = df['Age'].value_counts().index, palette='seismic_r').set_title
('Attrition by Age');

4. Explore data for Left employees

We’ll explore a few things here.

print(df.groupby('Attrition')['Attrition'].count())

plt.figure(figsize=(5,4), dpi=80)
sns.countplot(data=df, x='Attrition', palette='seismic_r');

Attrition by Department —

print(df[(df['Attrition'] == 'Yes')].groupby('Department')['Attrition'].count().sort_values(ascending=False))

plt.figure(figsize=(8,5), dpi=80)
sns.countplot(data=df[(df['Attrition'] == 'Yes')], x='Department', palette='cubehelix', order = df['Department'].value_counts().index).set_title('Attrition by Department');

Attrition by Age group —

agerange = []
for age in df["Age"]:
    if age >= 18 and age < 24:
        agerange.append("18-24")
    elif age >= 25 and age < 31:
        agerange.append("25-31")
    elif age >= 32 and age < 38:
        agerange.append("32-38")
    elif age >= 39 and age < 45:
        agerange.append("39-45")
    elif age >= 46 and age < 52:
        agerange.append("46-52")
    elif age >= 53 and age < 59:
        agerange.append("53-59")
    else:
        agerange.append("60-66")
       
df["AgeRange"] = agerangeprint(df[(df['Attrition'] == 'Yes')].groupby('AgeRange')['AgeRange'].count().sort_values(ascending=False))

plt.figure(figsize=(8,5), dpi=80)
sns.countplot(data=df[(df['Attrition'] == 'Yes')], x='AgeRange', palette='cubehelix').set_title('Attrition by Age Group');

Attrition by Education —

print(df[(df['Attrition'] == 'Yes')].groupby('Education')['Attrition'].count().sort_values(ascending=False))

plt.figure(figsize=(8,5),dpi=80)
sns.countplot(data=df[(df['Attrition'] == 'Yes')], x='Education', order=df['Education'].value_counts().index, palette='cubehelix').set_title('Attrition by Education');

Attrition by Environment Satisfaction —

print(df[(df['Attrition'] == 'Yes')].groupby('EnvironmentSatisfaction')['Attrition'].count().sort_values(ascending=False))

plt.figure(figsize=(8,5),dpi=80)
sns.countplot(data=df[(df['Attrition'] == 'Yes')], x='EnvironmentSatisfaction', order=df['EnvironmentSatisfaction'].value_counts().index, palette='cubehelix').set_title('Attrition by Environment Satisfaction');

Attrition by Job Satisfaction —

print(df[(df['Attrition'] == 'Yes')].groupby('JobSatisfaction')['Attrition'].count().sort_values(ascending=False))

plt.figure(figsize=(8,5),dpi=80)
sns.countplot(data=df[(df['Attrition'] == 'Yes')], x='JobSatisfaction',order=df['JobSatisfaction'].value_counts().index, palette='cubehelix');

Insight:

237 workers quit their jobs. (133 for R&D, 92 for Sales, and 12 for Human Resources)
Employees aged 25–31 (62 employees) are the most likely to have gone.
Those who have completed a bachelor’s degree are more likely to leave. (99 employees)
Those who are dissatisfied with their surroundings are more likely to leave. (72 employees)
Those who are highly satisfied with their jobs are more likely to quit.(73 employees)

5. Find out the distribution of employees in the education field

In order to identify the distribution of employees in the education field, we will group it by education field. Following that, we’ll plot the histogram.

plt.figure(figsize=(11,6), dpi=80)
sns.histplot(data=df, x='EducationField').set_title('Attrition by Education Field');

Insight: The majority of employees (606 employees) work in life sciences (education field).

6. Explore data for Marital Status

print(df[(df['Attrition'] == 'Yes')].groupby('MaritalStatus')['Attrition'].count().sort_values(ascending=False))

plt.figure(figsize=(8,5),dpi=80)sns.countplot(data=df, x='MaritalStatus', hue='Attrition', order=df['MaritalStatus'].value_counts().index, palette='seismic_r').set_title('Attrition by Marital Status');

Insight: Employees (120 employees) who are not married are more likely to depart.

We have completed the statistical objectives. There are also a few category variables. So, in order to develop a logistic regression model, we must translate those into numerical values. We just substituted numerical values for it.

df['Attrition'].replace('Yes', 1, inplace=True)
df['Attrition'].replace('No', 0, inplace=True)df['Department'].replace('Human Resources', 1, inplace=True)
df['Department'].replace('Research & Development', 2, inplace=True)
df['Department'].replace('Sales', 3, inplace=True)df['EducationField'].replace('Human Resources', 1, inplace=True)
df['EducationField'].replace('Life Sciences', 2, inplace=True)
df['EducationField'].replace('Marketing', 3, inplace=True)
df['EducationField'].replace('Medical', 4, inplace=True)
df['EducationField'].replace('Other', 5, inplace=True)
df['EducationField'].replace('Technical Degree', 6, inplace=True)df['MaritalStatus'].replace('Divorced', 1, inplace=True)
df['MaritalStatus'].replace('Married', 2, inplace=True)
df['MaritalStatus'].replace('Single', 3, inplace=True)

7. Build a logistic regression model to predict which employees are likely to attrite

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr = lr.fit(x_train, y_train)

# check the accuracy on the training set
print('Accuracy =', lr.score(x_train, y_train)*100,'%');# predict dependent variable
lr_y_pred = lr.predict(x_test)# find probability
prob = lr.predict_proba(x_test)
print(prob)from sklearn.metrics import accuracy_score, confusion_matrix, classification_report# Accuracy score
print('Test Accuracy Score:', accuracy_score(y_test, lr_y_pred)*100, '%\n')# Classification report
print('Classification Report', classification_report(y_test, lr_y_pred))# Confusion matrix
print('Confusion Matrix\n', confusion_matrix(y_test, lr_y_pred))

Python vs R

The Ultimate Guide to know the basic difference between Python and R

medium.com

Hope you liked the story. Follow me for more stories like this. Find my Kaggle notebook here.

MLearning.ai Art

AI art solutions for the creative economy 🟠 State-Of-The-Art machine learning demos that are quick and easy to use 🔵…

mlearning.substack.com

Analyzing Employee Attrition with 88% Accuracy

What is Employee Attrition?

Retail Analysis with Walmart Data — Part-1

Analyzing and Building Machine Learning model for 45 stores of Walmart

Retail Analysis with Walmart Data — Part-2

Building Machine Learning model for 45 stores of Walmart

Description

Statistical Tasks

1. Import Libraries

2. Find the age distribution of employees in IBM

3. Explore attrition by age

4. Explore data for Left employees

5. Find out the distribution of employees in the education field

6. Explore data for Marital Status

7. Build a logistic regression model to predict which employees are likely to attrite

Python vs R

The Ultimate Guide to know the basic difference between Python and R

MLearning.ai Art

AI art solutions for the creative economy 🟠 State-Of-The-Art machine learning demos that are quick and easy to use 🔵…

Written by Dhruval Patel