IBM Employee Attrition

Published in

CodeX

5 min readJan 18, 2023

Performing Exploratory Data Analysis and anticipating the factors that influence employee attrition in the organization.

In today’s competitive environment employee attrition is one of the major concerns for organizations across the globe as a company or organization spends millions of rupees or dollars just to retain their employees.

What is Employee Attrition?

Employee attrition is the process through which workers leave a company for whatever cause (voluntarily or involuntarily), in simple words employee attrition refers to the progressive decrease in employee numbers. This implies that employees are leaving more quickly than they are being hired.

Employee Attrition can occur for various reasons like resignation, retirement, poor work environment, a lack of staff growth possibilities, and many more.

Attrition is an inevitable part of any business although attrition is not always a terrible thing, employee attrition can be problematic since it typically results in a decline in talent within the workforce.

What is the Attrition Rate?

The attrition rate refers to the rate with which employees depart an organization over a specific time frame. A company can determine whether attrition is rising or falling by tracking attrition rates over time. When the attrition rate changes, management may be made aware of internal issues that could be contributing to employee departures. Using a logistic regression approach, we will discuss employee attrition prediction, which is the prediction that a worker will leave (or resign from) the current organization.

Data Analyst vs Data Scientist.

Setting up a foundation for further analysis and study in this field!!

medium.com

Sales prediction using a Linear regression model.

Analyzing and anticipating the sales for the given budget for TV, radio, and newspapers.

medium.com

What’s the difference between Data Science, Data Analytics, and Machine Learning

The ultimate guide to state the key differences between data science, data analytics, and machine learning.

medium.com

Description:

IBM is an American MNC operating in around 170 countries with major business verticals such as computing, software, and hardware. Attrition is a major risk to service-providing organizations where trained and experienced people are the assets of the company. The organization would like to identify the factors which influence the attrition of employees.

Statistical Tasks:

Import attrition dataset and import libraries such as pandas, matplotlib, numpy, and seaborn.
Perform Exploratory Data Analysis.
Find the age distribution of employees.
Age Attrition of employees.
Attrition of employees in the education field.
Departmental Attrition of Employees.
Explore data for Left employees.
Explore data for marital status.
Create a logistic regression model to predict which employees are likely to attrite the organization.

→ 1) Import Libraries

# import libraries
import pandas as pd
import numpy as np
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn
import warnings
warnings.filterwarnings('ignore')

→ 2) Exploratory Data Analysis

# load dataset
df = pd.read_csv('IBM Employee Attrition Data.csv')
df.describe(include='all')
df.info()
# checking for NaN values
df.isna().sum()

→ 3)Age distribution of employees.

The age distribution in this instance will be determined using a histogram. Resize the graph as per user convenience.

Insight: The majority of employees are between the ages of 34 and 35.

# Age distribution of employees
plt.figure(figsize=(10,6), dpi = 100)
sns.countplot(data = df,x=df['Age'])
plt.title('Age distribution of Employees')

→ 4) Age Attrition of Employees.

Now that we have grouped it by age and determined that attrition is there, we can detect attrition by age. After that, we’ll use a count plot to project how many employees will leave the company.

Insight: IBM attriters appear to be between the ages of 29 and 31.

# Age attrition of employees
plt.figure(figsize=(10,8), dpi = 100)
sns.countplot(data=df,x=df['Age'],hue=(df['Attrition']),palette='ocean')
plt.title('Age Attrition of Employees')

→5) Attrition of employees in the education field.

# Attrition of employees by the education field
plt.figure(figsize=(8,5))
sns.countplot(data=df,x=df['EducationField'],
              order=df['EducationField'].value_counts().sort_values(ascending=True).index,
              palette = 'ocean').set_title('Attrition by Education Field')

Insights: The majority of employees (606 employees) are likely to attirite work in life sciences (education field).

→6) Departmental Attrition of Employees.

# Departmental Attrition of Employees
plt.figure(figsize=(10,6))
sns.countplot(data=df[(df['Attrition']=='Yes')], x = 'Department',palette='ocean',
              order=df['Department'].value_counts().sort_values(ascending=True).index)
plt.title('Departmental Attrition of Employees')

Insights: The maximum number of employees(133 employees) are from the Research & Development department which is likely to attrite the company(IBM).

→7) Explore data for Left employees.

print(df['Attrition'].value_counts())
plt.figure(figsize=(8,6),dpi = 100)
sns.countplot(data = df,x = df['Attrition'],
              order = df['Attrition'].value_counts().sort_values(ascending=True).index, 
              palette = 'ocean').set_title('Attrition for Left Employees')

→8) Explore data for Marital status.

# number of married and unmarried employees
plt.figure(figsize=(8,5))
sns.countplot(data=df,x=df['MaritalStatus'],hue = 'Attrition',order=df['MaritalStatus'].value_counts().sort_values(ascending=True).index,palette = 'ocean').set_title('Marital Status')

Insights: Employees (120 employees) who are single are more likely to depart.

The statistical targets have been met. Additionally, there are a few category variables. So, we must convert those into numerical values in order to create a logistic regression model. Simply said, we replaced it with numerical values.

df['Attrition'].replace('Yes', 1, inplace=True)
df['Attrition'].replace('No', 0, inplace=True)

df['Department'].replace('Human Resources', 1, inplace=True)
df['Department'].replace('Research & Development', 2, inplace=True)
df['Department'].replace('Sales', 3, inplace=True)

df['EducationField'].replace('Human Resources', 1, inplace=True)
df['EducationField'].replace('Life Sciences', 2, inplace=True)
df['EducationField'].replace('Marketing', 3, inplace=True)
df['EducationField'].replace('Medical', 4, inplace=True)
df['EducationField'].replace('Other', 5, inplace=True)
df['EducationField'].replace('Technical Degree', 6, inplace=True)

df['MaritalStatus'].replace('Divorced', 1, inplace=True)
df['MaritalStatus'].replace('Married', 2, inplace=True)
df['MaritalStatus'].replace('Single', 3, inplace=True)

→9) Create a logistic regression model to predict which employees are likely to attrite the organization.

# Model Building
x = df.drop(['Attrition'],axis=1)
y = df['Attrition']
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,train_size=0.8,random_state=200)
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(x_train,y_train)
y_pred = lr.predict(x_test)
lr.score(x_train,y_train)

# check the accuracy on the training set
print('Accuracy =', lr.score(x_train, y_train).round(2)*100,'%');

# predict dependent variable
y_pred = lr.predict(x_test)

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
print("Model accuracy:", accuracy_score(y_test, y_pred).round(2)*100,'%')
print("*****************Classification report*****************\n", classification_report(y_test, y_pred))
print("*******************Confusion_matrix*********************\n",confusion_matrix(y_test, y_pred))

Accuracy = 84.0 %
Model accuracy: 87.0 %
*****************Classification report*****************
               precision    recall  f1-score   support

           0       0.87      0.99      0.93       253
           1       0.67      0.10      0.17        41

    accuracy                           0.87       294
   macro avg       0.77      0.54      0.55       294
weighted avg       0.84      0.87      0.82       294

*******************Confusion_matrix*********************
 [[251   2]
 [ 37   4]]

Hope you liked the story. Follow me for more stories like this. Find my Kaggle notebook here.

IBM Employee Attrition

What is Employee Attrition?

What is the Attrition Rate?

Data Analyst vs Data Scientist.

Setting up a foundation for further analysis and study in this field!!

Sales prediction using a Linear regression model.

Analyzing and anticipating the sales for the given budget for TV, radio, and newspapers.

What’s the difference between Data Science, Data Analytics, and Machine Learning

The ultimate guide to state the key differences between data science, data analytics, and machine learning.

Description:

Statistical Tasks:

→ 1) Import Libraries

→ 2) Exploratory Data Analysis

→ 3)Age distribution of employees.

→ 4) Age Attrition of Employees.

→5) Attrition of employees in the education field.

→6) Departmental Attrition of Employees.

→7) Explore data for Left employees.

→8) Explore data for Marital status.

→9) Create a logistic regression model to predict which employees are likely to attrite the organization.

Written by Arnav Saxena