IBM Employee Attrition

Arnav Saxena
CodeX
Published in
5 min readJan 18, 2023

Performing Exploratory Data Analysis and anticipating the factors that influence employee attrition in the organization.

Photo by Nick Fewings on Unsplash

In today’s competitive environment employee attrition is one of the major concerns for organizations across the globe as a company or organization spends millions of rupees or dollars just to retain their employees.

What is Employee Attrition?

Employee attrition is the process through which workers leave a company for whatever cause (voluntarily or involuntarily), in simple words employee attrition refers to the progressive decrease in employee numbers. This implies that employees are leaving more quickly than they are being hired.

Employee Attrition can occur for various reasons like resignation, retirement, poor work environment, a lack of staff growth possibilities, and many more.

Attrition is an inevitable part of any business although attrition is not always a terrible thing, employee attrition can be problematic since it typically results in a decline in talent within the workforce.

What is the Attrition Rate?

The attrition rate refers to the rate with which employees depart an organization over a specific time frame. A company can determine whether attrition is rising or falling by tracking attrition rates over time. When the attrition rate changes, management may be made aware of internal issues that could be contributing to employee departures. Using a logistic regression approach, we will discuss employee attrition prediction, which is the prediction that a worker will leave (or resign from) the current organization.

Description:

IBM is an American MNC operating in around 170 countries with major business verticals such as computing, software, and hardware. Attrition is a major risk to service-providing organizations where trained and experienced people are the assets of the company. The organization would like to identify the factors which influence the attrition of employees.

Statistical Tasks:

  1. Import attrition dataset and import libraries such as pandas, matplotlib, numpy, and seaborn.
  2. Perform Exploratory Data Analysis.
  3. Find the age distribution of employees.
  4. Age Attrition of employees.
  5. Attrition of employees in the education field.
  6. Departmental Attrition of Employees.
  7. Explore data for Left employees.
  8. Explore data for marital status.
  9. Create a logistic regression model to predict which employees are likely to attrite the organization.

→ 1) Import Libraries

# import libraries
import pandas as pd
import numpy as np
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn
import warnings
warnings.filterwarnings('ignore')

→ 2) Exploratory Data Analysis

# load dataset
df = pd.read_csv('IBM Employee Attrition Data.csv')
df.describe(include='all')
df.info()
# checking for NaN values
df.isna().sum()

3)Age distribution of employees.

The age distribution in this instance will be determined using a histogram. Resize the graph as per user convenience.

Insight: The majority of employees are between the ages of 34 and 35.

# Age distribution of employees
plt.figure(figsize=(10,6), dpi = 100)
sns.countplot(data = df,x=df['Age'])
plt.title('Age distribution of Employees')

→ 4) Age Attrition of Employees.

Now that we have grouped it by age and determined that attrition is there, we can detect attrition by age. After that, we’ll use a count plot to project how many employees will leave the company.

Insight: IBM attriters appear to be between the ages of 29 and 31.

# Age attrition of employees
plt.figure(figsize=(10,8), dpi = 100)
sns.countplot(data=df,x=df['Age'],hue=(df['Attrition']),palette='ocean')
plt.title('Age Attrition of Employees')
Attrition by Age

→5) Attrition of employees in the education field.

# Attrition of employees by the education field
plt.figure(figsize=(8,5))
sns.countplot(data=df,x=df['EducationField'],
order=df['EducationField'].value_counts().sort_values(ascending=True).index,
palette = 'ocean').set_title('Attrition by Education Field')

Insights: The majority of employees (606 employees) are likely to attirite work in life sciences (education field).

→6) Departmental Attrition of Employees.

# Departmental Attrition of Employees
plt.figure(figsize=(10,6))
sns.countplot(data=df[(df['Attrition']=='Yes')], x = 'Department',palette='ocean',
order=df['Department'].value_counts().sort_values(ascending=True).index)
plt.title('Departmental Attrition of Employees')

Insights: The maximum number of employees(133 employees) are from the Research & Development department which is likely to attrite the company(IBM).

→7) Explore data for Left employees.

print(df['Attrition'].value_counts())
plt.figure(figsize=(8,6),dpi = 100)
sns.countplot(data = df,x = df['Attrition'],
order = df['Attrition'].value_counts().sort_values(ascending=True).index,
palette = 'ocean').set_title('Attrition for Left Employees')

→8) Explore data for Marital status.

# number of married and unmarried employees
plt.figure(figsize=(8,5))
sns.countplot(data=df,x=df['MaritalStatus'],hue = 'Attrition',order=df['MaritalStatus'].value_counts().sort_values(ascending=True).index,palette = 'ocean').set_title('Marital Status')
Martial Status

Insights: Employees (120 employees) who are single are more likely to depart.

The statistical targets have been met. Additionally, there are a few category variables. So, we must convert those into numerical values in order to create a logistic regression model. Simply said, we replaced it with numerical values.

df['Attrition'].replace('Yes', 1, inplace=True)
df['Attrition'].replace('No', 0, inplace=True)

df['Department'].replace('Human Resources', 1, inplace=True)
df['Department'].replace('Research & Development', 2, inplace=True)
df['Department'].replace('Sales', 3, inplace=True)

df['EducationField'].replace('Human Resources', 1, inplace=True)
df['EducationField'].replace('Life Sciences', 2, inplace=True)
df['EducationField'].replace('Marketing', 3, inplace=True)
df['EducationField'].replace('Medical', 4, inplace=True)
df['EducationField'].replace('Other', 5, inplace=True)
df['EducationField'].replace('Technical Degree', 6, inplace=True)

df['MaritalStatus'].replace('Divorced', 1, inplace=True)
df['MaritalStatus'].replace('Married', 2, inplace=True)
df['MaritalStatus'].replace('Single', 3, inplace=True)

→9) Create a logistic regression model to predict which employees are likely to attrite the organization.

# Model Building
x = df.drop(['Attrition'],axis=1)
y = df['Attrition']
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,train_size=0.8,random_state=200)
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(x_train,y_train)
y_pred = lr.predict(x_test)
lr.score(x_train,y_train)

# check the accuracy on the training set
print('Accuracy =', lr.score(x_train, y_train).round(2)*100,'%');

# predict dependent variable
y_pred = lr.predict(x_test)

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
print("Model accuracy:", accuracy_score(y_test, y_pred).round(2)*100,'%')
print("*****************Classification report*****************\n", classification_report(y_test, y_pred))
print("*******************Confusion_matrix*********************\n",confusion_matrix(y_test, y_pred))
Accuracy = 84.0 %
Model accuracy: 87.0 %
*****************Classification report*****************
precision recall f1-score support

0 0.87 0.99 0.93 253
1 0.67 0.10 0.17 41

accuracy 0.87 294
macro avg 0.77 0.54 0.55 294
weighted avg 0.84 0.87 0.82 294

*******************Confusion_matrix*********************
[[251 2]
[ 37 4]]

Hope you liked the story. Follow me for more stories like this. Find my Kaggle notebook here.

--

--

Arnav Saxena
CodeX
Writer for

Data scientist, AI enthusiast, and self-help writer sharing insights on using data science and AI for good.