Employees don’t leave the company; they leave their managers

Nayana Kumari
Web Mining [IS688, Spring 2021]
8 min readApr 29, 2021

A ‘K-means’ clustering approach to identify attrition

Jeffery Hamilton/Getty Images

It is not a pleasant thing to hear ‘I Quit,’ but companies can gain insight through evidence and pattern analysis to determine the reasons. Here is an analysis of employee churn. This analysis aims to help companies understand the pattern and predict who might leave the company in the future. A good workplace would always find ways to improve employee retention and employee satisfaction.

Almost all companies face the issue of higher attrition at some point in their lifetime. While attrition is a regular phenomenon and helps an organization bring in talents from other companies without ramping up the workforce, various factors contribute to this churn(attrition). I will try to apply my learning of clustering and data point distance measurements to identify a few critical factors that drive a high attrition rate.

For this analysis, I will use a dataset that contains employee attrition along with employee profiles.

The codebase for this analysis can be found here:

https://github.com/nt27web/WebMining-Clustering

Data Exploration

Let’s look at the data at hand. The format of data is .csv

Python with the following libraries will be used for this analysis

from IPython.display import display
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, MinMaxScaler

pandas: for data extract, cleansing, and manipulation

sklearn: for the k-means and related calculations

matplotlib: to plot the graphs of various datasets and calculations

Data Attributes:

First, I extracted the CSV files into a python dataframe-

data = pd.read_csv('HR-Employee-Attrition.csv')

display(data.shape)

The shape of the dataset looks like this: (1470, 35), i.e., 1470 rows, and 35 columns.

Let’s check for the null and empty values:

display(data.isnull().sum())

All the columns have filled data and no null or empty cells, which are good.

After checking some of the categorical attributes for their unique values and statistics for numerical attributes, we found the below columns appropriate for our analysis.

Columns: Age, Daily rate, and Education Field.

display(data['Age'].describe())display(data['DailyRate'].describe())display(data['EducationField'].unique())

Column Stats and values:

Along with the above columns, the below columns are distributed on a similar scale, which will help identify the areas of attrition.

Columns: Years at company, Years In Current Role, Years Since Last Promotion, Years with Current Manager.

display(data['YearsAtCompany'].describe())display(data['YearsInCurrentRole'].describe())display(data['YearsSinceLastPromotion'].describe())display(data['YearsWithCurrManager'].describe())

Result:

Some fields are categorical and have only 2/3 unique values. E.g., Gender, Marital status. We will have to remove them from the dataset as they will not help identify the cluster clearly.

So, I have taken the categorical column having the most density in terms of values -Education Field.

I will have to vectorize this field for my analysis Since k-means does not work on categorical attributes.

display(len(data))

Number of records: 1470

The final set of attributes chosen for the analysis:

‘Attrition’, ‘DailyRate’, ‘EducationField’, ‘YearsAtCompany’, ‘YearsInCurrentRole’, ‘YearsSinceLastPromotion’, ‘YearsWithCurrManager’

I created my dataset with the above-mentioned fields:

f_data = pd.DataFrame(data, columns=['Attrition', 'DailyRate', 'EducationField', 'YearsAtCompany'    , 'YearsInCurrentRole',    'YearsSinceLastPromotion', 'YearsWithCurrManager'])

Now let us check the dataset-

display(f_data.head())

I also have filtered the dataset where attrition is ‘Yes.’ As it implicates, I am interested only in the records where the employee has left the company.

m_data = f_data[f_data['Attrition'] == 'Yes']f_data = m_data.drop( ['Attrition'], axis=1)display(len(f_data))

Total number of records after filtering: 237

I will vectorize the categorical field EducationField since all my columns need to be numeric vectors to find k-means.

X = f_datay = f_data['EducationField']le = LabelEncoder()X['EducationField'] = le.fit_transform(X['EducationField'])y = le.transform(y)

Result:

I need to adjust the scale of the parameters to fit in the k-means function and make drawing the scatter plots easy.

s = X.columnsms = MinMaxScaler()X = ms.fit_transform(X)X = pd.DataFrame(X, columns=[cols])

Result:

As you may have noticed, the values in the “YearsXXX” columns (YearsAtCompany, YearsInCurrentRole, YearsSinceLastPromotion, YearsWithCurrManager) have changed to decimal since they took the same range of scale and proportion.

Before we find the k-means clusters, we need to supply a k number; basically the number of clusters we are interested in. There are several methods available to find the optimal k number. The most frequently used is called the “Elbow method.”

Elbow Method to find k-value

If a graph is plotted to have several clusters along the x-axis and the number of errors along the y-axis, a curve is formed which shows the relationship between the number of clusters and several errors. This curve has a point where it changes its trajectory of the steep decline of errors with an increase in clusters. That point is called the ‘elbow.’ Beyond that point, with the changes of the cluster, the error does not decrease steeply. We take a cluster value at that point. We also test some values above and below that number to find the optimal k value.

Let’s take a look at the elbow plot -

As you can see based on this graph, the optimum number is 2. We will also test values 3, 4 & 5 and check the accuracy. Here we check the accuracy with each number of clusters value(k).

k_means = KMeans(n_clusters=2, random_state=0)y_k_means = k_means.fit_predict(x)labels = k_means.labels_# check how many of the samples were correctly labeledcorrect_labels = sum(y == labels)print("Result: %d out of %d samples were correctly labeled." % (correct_labels, y.size))print('Accuracy score: {0:0.2f} %'.format((correct_labels *100)/ float(y.size)))

When K=3

The accuracy is very low so, k value 3 is not optimal.

Let’s take a look at the scatter plot with the x-axis as YearsInCompany and Y-axis as YearswithCurrentmanager.

plt.scatter(x['YearsAtCompany'],x['YearsWithCurrManager'], c=y_k_means, cmap='rainbow')plt.show()

Clearly, it’s tough to identify the clusters, and the number of errors is high. This also means the elements which are part of the same cluster are not geographically located together.

When K=4

This is a decent accuracy value considering the number of rows in the dataset. But we will validate using the scatter plot again.

Like the earlier scatter plots, it’s tough to identify the clusters. So, we will continue to test other k values.

When K=5

The accuracy is lower than K=4 and the scatter plot below also suggests the same.

Finally, we will test with k = 2

Clearly, we found that the accuracy level is highest among other values around the ‘elbow.’

Let us validate the same with a scatter plot using different columns on the x and y-axis.

X= DailyRate, Y=NumberofYearsinCompany

As we can see, the data points are scattered all over and can not be conclusively identified which area has had higher attrition.

Let us validate the same with a scatter plot using different columns on the x and y-axis.

X= EducationField, Y=NumberofYearsinCompany

As you can see, another inconclusive scatter plot without a clear segment/cluster.

Let us validate the same with a scatter plot using different columns on the x and y-axis.

X= NumberofYearsinCompany, Y=Yearswithcurrentmanager

Clearly, K=2 gives us the most optimal result in terms of accuracy and clear clusters (represented by two distinct colors). Also, YearsAtCompany and YearswithCurrentManager plot give us some degree of clear segments.

As you can see, the first segment (cluster in purple) represents the employees who have left the organization and who are working for longer tenure with the company having the same manager for a long period.

Employees who have stayed with the company longer but constantly worked with different managers have not left the company.

The second segment (cluster in red) is where the employees in their early years in the company, even though they have had worked with managers for a lesser number of years or have changed managers regularly, have left the organization.

People who have stayed with their manager for a longer time and are not associated with the company longer have much stayed with the company.

Based on this analysis, an organization may derive the areas of improvement as follows.

An organization may choose to focus on the employees in their early days in the company and pay extra attention if they are growing and if they feel accomplished.

Similarly, employees who are with the company for a longer period might need a change in their role, project, and manager.

Conclusions

This analysis takes a simplistic approach to how to cluster similar data points by using k-means. K-means uses Euclidean(geometric) distance between data points on a two-dimensional plot. This is a decent mechanism when the data points have many numerical values, and the categorical attributes are easier to be vectorized. This technique also uses the elbow method for identifying optimal k numbers. There are other methods available for finding the optimal k value (silhouette co-efficient etc.), which can be used for larger data sets. There are also various means available to do clustering like Agglomerative Cluster (uses cosine similarity). This analysis can be expanded using these methods, and it will be interesting to see how the result and clusters vary.

Limitations

The dataset had only about 1470 total records and 237 records where employees have left the organization. A higher number of records will yield higher accuracy and uniformly distributed values in ‘YearsXXX’ columns. The dataset will then be truly representative and will help achieve more comprehensive data segments or clusters.

With a higher number of the dataset, we could also explore more clusters and try different attributes to plot the scatter plot. This could give us more segments/groups who have left the organization.

References

--

--

Nayana Kumari
Web Mining [IS688, Spring 2021]

A Traveler at heart and Techie by profession!! Learn / Explore / Live Today.