Employee Attrition Analysis Using Machine Learning Methods

Monika Pdb
10 min readDec 10, 2017

--

This project was developed as a final project for the Digital Academy, course by Czechitas, supported by Google.org

By: Monika Podborská

Mentor: Miloš Minařík

Introduction

Would you guess that more data has been created since 2013 than in all of human history before that date? (source here). This fact took a while to sink in. We live in the age of information. Most of us cannot imagine their lives without a smartphone. Our data are being harvested with every search query on Google we perform, every photo we post on Facebook or every payment at a local store where we use the loyalty card. Understanding data and what can be done with them is becoming an essential trait for everybody.

In this article I will discuss my final project, which was developed for the Digital Academy course by Czechitas. My previous blog posts (hopefully) brought some insight into what was it all about and what 28 girls (most of us knew little to nothing about data analysis) learnt over the course of two months.

My motivation to enroll the Digital Academy course was to gain basic knowledge about data analysis and also my urge to change my field of work to more technical stuff. Even before the Academy I liked to work in Excel, create reports and search for useful information in them. I also enjoyed looking for and finding bugs in the CRM application I use at work. But as of lately I realized it was not enough and decided to take another step in my career.

For the final project I chose to analyze employee attrition as it is a big problem many companies are facing nowadays. Acquiring a new employee can become costly in both time and money, therefore some companies are going great lengths trying to keep their employees happy and satisfied. One of the ways how to keep employees from leaving is to analyze, why people who left the company decided to do so, predict who could be leaving next and try to take preemptive action.

In my project I analyzed data from an HR dataset, containing basic information about the employees (age, monthly income, years at company etc.) and whether they left or stayed with the company. The dataset is relatively small, containing 1470 rows (values) and 35 columns (attributes). Below this paragraph there is a table representing the first five rows of the dataset. During the Academy I fell in love with the programming language Python, so I decided to include ‘some fun with Python’ into the project as well (I do not dare to call it programming, yet). The core of the project is prediction of attrition by machine learning (ML) methods and comparison of their results.

First five rows of the dataset (better view on Github)

Practical approach

Software I used for the final project

I had found the dataset I wanted to analyze even before the academy started and had an idea what kind of analysis I wanted to do. The final decision, that I would try machine learning, was made at Meet your Mentor evening. There I met my mentor Miloš Minařík from company Safetica, who also suggested that machine learning is a good idea for the type of data I have. Therefore I started the work on the project by reading about basics of machine learning, because I had little to no knowledge in this field.

After first courses of SQL and Python I started with the exploratory data analysis (EDA). EDA consisted of taking closer look at the data, deciding which attributes are not needed in the dataset, some descriptive statistics, looking at data distribution and running correlations. This part was essential for good understanding of my data.

Our brains process pictures better than numbers or text, therefore, I tried to turn every useful information into a colorful representation. The following charts show percentage of employees who left and who stayed in a given category of an attribute (for example attribute work travel frequency, categories = non-travel, travel-rarely and travel-frequently). I picked these three charts in particular, because they represent the attributes, which I thought would play big part in deciding whether to leave the company or not for any employee. You can see that there really is a difference in levels of attrition between the groups.

Bar charts showing attrition levels for given attribute

The following Paerson correlation plot shows that there is strong correlation between some attributes (i.e. monthly income and job level, job level and total working years, total working years and monthly income…), however our targeted attribute attrition showed poor correlation with other attributes, which can tell us that correct classification of leaving employees was not a piece of cake. The attrition correlates weakly only with attributes overtime and business travel. Attrition is more likely to depend on combination of attributes rather than on a single one.

Paerson correlation plot — script can be found here

The next step was to implement ML methods. For the ML part I used tool called Weka 3.8 a software developed by Machine Learning Group at the University of Waikato as it was recommended by my mentor. It is a very complex tool, but in my opinion suitable for machine learning beginners.

Firstly the adequate methods were selected: Random Forest Classifier, Support Vector Machines and Naive Bayes Net. These methods are ones of the most used in ML and they all showed good results in the first screening. For the final result Neural Network was also in play as it was used as a voting mechanism. Neural Network was fed the results from the three methods and as an output the network decided on the overall result.

Secondly, the methods were optimized. Cost Sensitive Classifier was used for all methods except Support Vector Machines. This classifier takes into account the cost matrix. Cost matrix tells the method, that there will be ‘punishment’ for misclassifications. The matrix consists of True Positives (correctly classified as yes), True Negatives (correctly classified as no), False Positives (classified as yes, but should be no) and False Negatives (classified as no, but should be yes). In the cost matrix we decide which of the misclassifications will be punished more. This is done by changing the values in the cost matrix. The greater the value, the bigger the punishment. So, if a method has classified more data as False Positive/Negative, than you would like, then it is time to use the Cost Sensitive Classifier. On the following picture the used cost matrix for each method can be seen. Otherwise default setting was used for each method, as it has proven to be best (Weka is a very good tool, running a method with default settings usually brings decent results).

Cost matrices for methods used

As I stated in the Introduction part of the article, I really enjoy having fun with Python. Most of the visualizations were done in Python, because why choose the easy way when the purpose of the project is — among other things — to learn? I tried to do some basic ML there as well. My aim was to run all of the above-mentioned methods in Python too, however due to lack of time (and skill, to be honest ), the only method I successfully finished was the Random Forest Classifier. Here goes the script (including script for the plot showed later in the part Results & Discussion):

Script for the Random Forest Classifier

Results & Discussion

The aim of this project was to determine what are the reasons employees leave the company and who might be leaving next.

The following plot represents average results from ten Random Forests Classifier separate runs; the dots represent average importance of every attribute. The top five ranked attributes are monthly income, age, distance from home, overtime and total working years. We can see, that some of the attributes, which seemed to be important after the exploratory data analysis — monthly income and overtime — really showed to be significant for the classifier.

Feature importance of Random Forest Classifier

How did the methods succeeded in classifying employees? Below you can find the accuracy and confusion matrix (similar logic as cost matrix) for every method. The accuracy didn’t get past the magical threshold of 90 %, but can be seen as pretty decent. It is partially caused by the size of the dataset, which is relatively small (rows) and has many attributes. From the confusion matrices can be assumed, that the biggest issue for all the methods was the occurrence of false negatives. Because the aim of ML was to detect, what employees are leaving, the false negatives are the least wanted thing, therefore a cost matrix was used to ‘punish’ the misclassification of false negatives. The confusion matrix of neural network shows us that out of 237 leavers, 90 were correctly classified but also 147 employees were classified incorrectly.

Accuracy and confusion matrices for the methods used

Another comparison of the methods can be seen on the following Venn diagram. It shows us all the employees, that were classified as leavers. The part called actual attrition represents the actual leavers, the other parts are named by the method used. Total number of leavers in the dataset was 237, out of which 92 were missed by all methods (classified as non-leavers), 60 employees were correctly classified by all three methods. Two or more methods correctly classified 106 employees but also 47 were classified incorrectly.

Venn diagram showing correctly assigned observations as well as number of false negatives of every method.

What is also interesting is the profile of an average leaver. This can be very useful as it can help identify another possible leavers. The charts below represent the average leaver and the average employee. The charts only contain the attributes, that showed significant difference between the average leaver and the average employee.

Bar chart representing attributes of an average leaver — script can be found here
Bar chart representing attributes of an average employee — script can be found here

The conclusion from the above-mentioned analysis would be the following: The person which is most likely to leave is under 30, single and works overtime. Stock option level also determines the likelihood of employee attrition, as the employee with no option to purchase company stocks has probably lower interest in its overall success than those who can. Money really plays a big role here as it can be seen, that the income of an average employee is almost double that of an average leaver. The most common job roles of leavers — Laboratory Technician, Sales Executive and Sales Representative — are ones with the lowest job level.

What’s next?

The project is by no means finished, the methods should face further optimization to have the accuracy increased (algorithm tuning, excluding attributes, trying another method etc.). Another upgrade of the project would be to use the Python Weka Wrapper, a Python library with which you can work with Weka directly from Python. Unfortunately, I found about this option late in the project and also the wrapper is only a beta version and is not functioning properly on my computer, but after I defeat the wall of errors I will definitely dive into it as it is a great opportunity to continue with the project while learning Python. The desired final version would load data from a database of the employees, perform the ML part and return the attribute importance and profile of the average leaver.

Digital Academy has exceeded my expectations (and I expected quite a lot), especially in the progress I have made over the period of two months. I started the Academy as a complete beginner to data analysis. Now I feel much more confident in my IT abilities, I learnt SQL basics, was able to put together a working ML method in Python and grasped the basic concepts of machine learning classification. In the nearest future I plan to broaden my knowledge in the field (focused onPython and SQL) and find a job, where I can use these newly acquired skills.

I would like to thank my mentor, who introduced me to ML software Weka, helped me understand the methods and supported me along the way. He is very good at explaining things simply, a quick thinker, who always found a solution to a problem, an expert in his field and on top of that with good sense of humor.

I would also like to thank Czechitas, all the mentors, lectors, partners and other people involved in the Digital Academy course and made this unique learning opportunity possible. The atmosphere was always friendly, supportive, motivating and so unique, I have never experience this anywhere else before.

Work on the final project was hard, sometimes frustrating, the Academy was not a piece of cake, especially when you want to attend all the classes (as full-time employed person — say goodbye to 8-hour night sleep for two months), but I enjoyed every bit of it and would definitely recommend attending to all the girls that are still hesitating.

Sources:

Stack Overflow — very useful, has solution for (almost) every problem you can face in Python

Matplotlib documentation — Python library documentation I used the most

Anaconda — important source of Python libraries

Weka documentation — documentation for the ML software I used

A lot of sources covering ML — this, this, this, this, this, this, this, this

--

--

Monika Pdb

Lover of long walks, scotch whisky and good old metal. Software tester, data and cybersec enthusiast. (Audio)books addict.