Analyzing Employee Attrition with 88% Accuracy
Identifying the factors which influence the attrition of employees
What is Employee Attrition?
Employee attrition refers to the progressive decrease in employee numbers. This suggests that employees are leaving more quickly than they are being hired. Employee attrition occurs when an employee retires, resigns, or is simply not replaced.
Employee attrition can occur for a variety of reasons. These include dissatisfaction with employee perks or compensation structures, a lack of staff growth possibilities, and even a terrible working environment.
Employee attrition isn’t all bad but may be troublesome since it frequently decreases talent within the organization and the workforce as a whole.
So, it becomes very important to predict employee attrition. We will cover employee attrition prediction, which is the forecast that an employee will depart (or resign from) the present firm, and we will do so using a logistic regression algorithm.
Description
IBM is an American MNC operating in around 170 countries with major business verticals as computing, software, and hardware. Attrition is a major risk to service-providing organizations where trained and experienced people are the assets of the company. The organization would like to identify the factors which influence the attrition of employees.
Statistical Tasks
- Import attrition dataset and import libraries such as pandas, matplotlib.pyplot, numpy, and seaborn.
- Exploratory data analysis
- Find the age distribution of employees in IBM
- Explore attrition by age
- Explore data for Left employees
- Find out the distribution of employees in the education field
- Give a bar chart for the number of married and unmarried employees
- Build up a logistic regression model to predict which employees are likely to attrite
1. Import Libraries
import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
2. Find the age distribution of employees in IBM
A histogram will be used to determine the age distribution in this case. Then resize the image and map the age distribution with a seaborn library.
Insight: Employees between the ages of 34 and 35 are the most numerous.
plt.figure(figsize=(10,6), dpi=80)
sns.histplot(data=df, x='Age', bins=42, kde=True).set_title
('Age Distribution of Employee');
3. Explore attrition by age
Now, in order to identify attrition by age, we will group it by age and attrition is ‘yes.’ Following that, we’ll utilize a count plot to estimate how many employees will quit the organization.
Insight: Employees aged 29 and 31 are the most to quit the IBM
print(df[(df['Attrition'] == 'Yes')].groupby('Age')['Age'].count().sort_values(ascending=False))
plt.figure(figsize=(14,6), dpi=80)
sns.countplot(data=df, x='Age', hue='Attrition', order = df['Age'].value_counts().index, palette='seismic_r').set_title
('Attrition by Age');
4. Explore data for Left employees
We’ll explore a few things here.
print(df.groupby('Attrition')['Attrition'].count())
plt.figure(figsize=(5,4), dpi=80)
sns.countplot(data=df, x='Attrition', palette='seismic_r');
Attrition by Department —
print(df[(df['Attrition'] == 'Yes')].groupby('Department')['Attrition'].count().sort_values(ascending=False))
plt.figure(figsize=(8,5), dpi=80)
sns.countplot(data=df[(df['Attrition'] == 'Yes')], x='Department', palette='cubehelix', order = df['Department'].value_counts().index).set_title('Attrition by Department');
Attrition by Age group —
agerange = []
for age in df["Age"]:
if age >= 18 and age < 24:
agerange.append("18-24")
elif age >= 25 and age < 31:
agerange.append("25-31")
elif age >= 32 and age < 38:
agerange.append("32-38")
elif age >= 39 and age < 45:
agerange.append("39-45")
elif age >= 46 and age < 52:
agerange.append("46-52")
elif age >= 53 and age < 59:
agerange.append("53-59")
else:
agerange.append("60-66")
df["AgeRange"] = agerangeprint(df[(df['Attrition'] == 'Yes')].groupby('AgeRange')['AgeRange'].count().sort_values(ascending=False))
plt.figure(figsize=(8,5), dpi=80)
sns.countplot(data=df[(df['Attrition'] == 'Yes')], x='AgeRange', palette='cubehelix').set_title('Attrition by Age Group');
Attrition by Education —
print(df[(df['Attrition'] == 'Yes')].groupby('Education')['Attrition'].count().sort_values(ascending=False))
plt.figure(figsize=(8,5),dpi=80)
sns.countplot(data=df[(df['Attrition'] == 'Yes')], x='Education', order=df['Education'].value_counts().index, palette='cubehelix').set_title('Attrition by Education');
Attrition by Environment Satisfaction —
print(df[(df['Attrition'] == 'Yes')].groupby('EnvironmentSatisfaction')['Attrition'].count().sort_values(ascending=False))
plt.figure(figsize=(8,5),dpi=80)
sns.countplot(data=df[(df['Attrition'] == 'Yes')], x='EnvironmentSatisfaction', order=df['EnvironmentSatisfaction'].value_counts().index, palette='cubehelix').set_title('Attrition by Environment Satisfaction');
Attrition by Job Satisfaction —
print(df[(df['Attrition'] == 'Yes')].groupby('JobSatisfaction')['Attrition'].count().sort_values(ascending=False))
plt.figure(figsize=(8,5),dpi=80)
sns.countplot(data=df[(df['Attrition'] == 'Yes')], x='JobSatisfaction',order=df['JobSatisfaction'].value_counts().index, palette='cubehelix');
Insight:
- 237 workers quit their jobs. (133 for R&D, 92 for Sales, and 12 for Human Resources)
- Employees aged 25–31 (62 employees) are the most likely to have gone.
- Those who have completed a bachelor’s degree are more likely to leave. (99 employees)
- Those who are dissatisfied with their surroundings are more likely to leave. (72 employees)
- Those who are highly satisfied with their jobs are more likely to quit.(73 employees)
5. Find out the distribution of employees in the education field
In order to identify the distribution of employees in the education field, we will group it by education field. Following that, we’ll plot the histogram.
plt.figure(figsize=(11,6), dpi=80)
sns.histplot(data=df, x='EducationField').set_title('Attrition by Education Field');
Insight: The majority of employees (606 employees) work in life sciences (education field).
6. Explore data for Marital Status
print(df[(df['Attrition'] == 'Yes')].groupby('MaritalStatus')['Attrition'].count().sort_values(ascending=False))
plt.figure(figsize=(8,5),dpi=80)sns.countplot(data=df, x='MaritalStatus', hue='Attrition', order=df['MaritalStatus'].value_counts().index, palette='seismic_r').set_title('Attrition by Marital Status');
Insight: Employees (120 employees) who are not married are more likely to depart.
We have completed the statistical objectives. There are also a few category variables. So, in order to develop a logistic regression model, we must translate those into numerical values. We just substituted numerical values for it.
df['Attrition'].replace('Yes', 1, inplace=True)
df['Attrition'].replace('No', 0, inplace=True)df['Department'].replace('Human Resources', 1, inplace=True)
df['Department'].replace('Research & Development', 2, inplace=True)
df['Department'].replace('Sales', 3, inplace=True)df['EducationField'].replace('Human Resources', 1, inplace=True)
df['EducationField'].replace('Life Sciences', 2, inplace=True)
df['EducationField'].replace('Marketing', 3, inplace=True)
df['EducationField'].replace('Medical', 4, inplace=True)
df['EducationField'].replace('Other', 5, inplace=True)
df['EducationField'].replace('Technical Degree', 6, inplace=True)df['MaritalStatus'].replace('Divorced', 1, inplace=True)
df['MaritalStatus'].replace('Married', 2, inplace=True)
df['MaritalStatus'].replace('Single', 3, inplace=True)
7. Build a logistic regression model to predict which employees are likely to attrite
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr = lr.fit(x_train, y_train)
# check the accuracy on the training set
print('Accuracy =', lr.score(x_train, y_train)*100,'%');# predict dependent variable
lr_y_pred = lr.predict(x_test)# find probability
prob = lr.predict_proba(x_test)
print(prob)from sklearn.metrics import accuracy_score, confusion_matrix, classification_report# Accuracy score
print('Test Accuracy Score:', accuracy_score(y_test, lr_y_pred)*100, '%\n')# Classification report
print('Classification Report', classification_report(y_test, lr_y_pred))# Confusion matrix
print('Confusion Matrix\n', confusion_matrix(y_test, lr_y_pred))
Hope you liked the story. Follow me for more stories like this. Find my Kaggle notebook here.