Exploratory Data Analysis — Employee Attrition Rate

Abhilash Singh
Jul 6, 2020 · 11 min read

IBM EMPLOYEE ATTRITION DATA ANALYSIS

#   Column                    Non-Null Count  Dtype 
--- ------ -------------- -----
0 Age 1470 non-null int64
1 Attrition 1470 non-null object
2 BusinessTravel 1470 non-null object
3 DailyRate 1470 non-null int64
4 Department 1470 non-null object
5 DistanceFromHome 1470 non-null int64
6 Education 1470 non-null int64
7 EducationField 1470 non-null object
8 EnvironmentSatisfaction 1470 non-null int64
9 Gender 1470 non-null object
10 HourlyRate 1470 non-null int64
11 JobInvolvement 1470 non-null int64
12 JobLevel 1470 non-null int64
13 JobRole 1470 non-null object
14 JobSatisfaction 1470 non-null int64
15 MaritalStatus 1470 non-null object
16 MonthlyIncome 1470 non-null int64
17 MonthlyRate 1470 non-null int64
18 NumCompaniesWorked 1470 non-null int64
19 OverTime 1470 non-null object
20 PercentSalaryHike 1470 non-null int64
21 RelationshipSatisfaction 1470 non-null int64
22 StockOptionLevel 1470 non-null int64
23 TotalWorkingYears 1470 non-null int64
24 TrainingTimesLastYear 1470 non-null int64
25 WorkLifeBalance 1470 non-null int64
26 YearsAtCompany 1470 non-null int64
27 YearsInCurrentRole 1470 non-null int64
28 YearsSinceLastPromotion 1470 non-null int64
29 YearsWithCurrManager 1470 non-null int64
dtypes: int64(22), object(8)
memory usage: 344.7+ KB

DEFINING OUR METRIC: EMPLOYEE ATTRITION RATE

# Lets find out our dataset's naive attrition rate

print(len(df[df.Attrition==True])/len(df)*100)
= 16.12 %

Lets perform some hacker statistics to infer the attrition rate of the IBM population.

# Sampling mean with confidence interval --- Defining our function# a and b define the range of our confidence intervaldef conf_sample(data, a, b, func ,size):

replicates = np.empty(size)

for i in range(size):

replicate = np.random.choice(data, len(data))

replicates[i] = func(replicate)

x,y = np.percentile(replicates, [a,b])

return (x,y)
conf_sample(df['Attrition'], 2.5, 97.5, np.mean, 10000)(0.14217687074829932, 0.17959183673469387)

EXPLORATORY DATA ANALYSIS

Age

# Lets compare the empirical cumulative distributions of ages of two groups. (Attrited and Non Attrited Employees)


# Defining a function for ecdf

def ecdf(data):

y = (np.arange(1, len(data) + 1))/len(data)
x = np.sort(data)
return x,y

# PLotting the ECDFS

x_yes, y_yes = ecdf(df[df['Attrition']==True].Age)
x_no, y_no = ecdf(df[df['Attrition']==False].Age)
plt.figure(figsize=(10,5))
plt.plot(x_yes, y_yes, linestyle = 'none', marker = '.', color = 'r')
plt.plot(x_no, y_no, linestyle = 'none', marker = '.', color = 'b')
plt.ylabel('PROPORTION')
plt.title('ECDFS')
plt.legend(['Yes','No'], title = 'Attrition')

plt.annotate('Higher Difference',
xy = (35, 0.5),
xytext = (45, 0.4),
arrowprops = {'arrowstyle':'->', 'color':'gray'})

Gender

df.groupby('Gender').Attrition.mean()
Gender
Female 0.147959
Male 0.170068
Name: Attrition, dtype: float64

Business Travel

df.groupby('BusinessTravel')['Attrition'].mean()
BusinessTravel
Non-Travel 0.080000
Travel_Frequently 0.249097
Travel_Rarely 0.149569
Name: Attrition, dtype: float64

Department

df.groupby('Department')['Attrition'].mean()
Department
Human Resources 0.190476
Research & Development 0.138398
Sales 0.206278
Name: Attrition, dtype: float64

Education

What if we hypothesize that higher attrition rates among the educational levels and educational field are due to monthly rates ?

df.groupby('EducationField').MonthlyRate.mean()
EducationField
Human Resources 14810.740741
Life Sciences 14530.132013
Marketing 14076.943396
Medical 14295.056034
Other 13270.780488
Technical Degree 14210.363636
Name: MonthlyRate, dtype: float64
df.groupby('Education').MonthlyRate.mean()
Education
1 15208.100000
2 14249.946809
3 14082.809441
4 14281.989950
5 14516.687500
Name: MonthlyRate, dtype: float64

Satisfaction and Well-Being

cols = ['JobInvolvement', 'JobSatisfaction', 'RelationshipSatisfaction', 'WorkLifeBalance', 'EnvironmentSatisfaction']
fig, ax = plt.subplots(len(cols),1, figsize= (10,8), constrained_layout=True)

for i, col in enumerate(cols):

sns.barplot(col,'Attrition', data = df, ax = ax[i], ci = None)

Job Level and Job Role

cols = ['JobLevel', 'JobRole']
fig, ax = plt.subplots(len(cols),1, figsize= (17,8), constrained_layout=True)
for i, col in enumerate(cols):

sns.barplot(col,'Attrition', data = df, ax = ax[i], ci = None)

Lets Uncover Some Other Insights

a) Distance From Home

Attrition_Y = df[df['Attrition']==True]
Attrition_N = df[df['Attrition']==False]
sns.kdeplot(Attrition_Y.DistanceFromHome)
sns.kdeplot(Attrition_N.DistanceFromHome)
plt.legend(('Yes', 'No'))
# Lets look at distance from home and attrition levels among various job roles.

df.groupby(['JobRole','Attrition']).DistanceFromHome.mean().unstack()
df[df['DistanceFromHome']>10].groupby('BusinessTravel').Attrition.mean()
BusinessTravel
Non-Travel 0.148936
Travel_Frequently 0.298851
Travel_Rarely 0.193548
Name: Attrition, dtype: float64

b) Monthly Income and Attrition

sns.kdeplot(Attrition_Y.MonthlyIncome)
sns.kdeplot(Attrition_N.MonthlyIncome)
plt.legend(('Yes', 'No'))
plt.figure(figsize=(20,8))
sns.boxplot('JobRole', 'MonthlyIncome',data = df)
plt.figure(figsize=(20,8))
sns.boxplot('JobRole', 'MonthlyIncome', hue = 'Gender',data = df)

c) Correlation among other variables

df_c = df.select_dtypes('int64')
plt.figure(figsize=(15,15))
sns.heatmap(df_c.corr(), annot = True, fmt = '.2f')

FACTORS AFFECTING ATTRITION

Depending upon the business question at hand, there are numerous ways in which the dataset can be further manipulated. The finding above were really interesting and could have been made even better with deeper hypothesis testing and modelling, considering appropriate assumptions wherever needed.

a) Age:

b) Satisfaction and Mental Well-being:

c) Educational Factors:

d) Job Related Factors:

e) Daily Commute:

f) Monthly Income:

The Startup

Get smarter at building your thing. Join The Startup’s +788K followers.

Sign up for Top 10 Stories

By The Startup

Get smarter at building your thing. Subscribe to receive The Startup's top 10 most read stories — delivered straight into your inbox, once a week. Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Abhilash Singh

Written by

Data Analyst — Finding simplicity within technical jargons

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +788K followers.

Abhilash Singh

Written by

Data Analyst — Finding simplicity within technical jargons

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +788K followers.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store