HDSC August ’21 Premiere Project Presentation: Human Resources Analytics

A Project by Team Keras

Published in

Hamoye Blog

5 min readOct 8, 2021

HR analytics is revolutionizing the way human resources departments operate, leading to higher efficiency and better results overall. Human resources have been using analytics for years, however, the collection, processing, and analysis of data has been largely manual. Given the nature of human resources dynamics and HR KPIs, the approach has been constraining. Therefore, it is surprising that HR departments discovered the utility of machine learning so late in the game. Data analysts can derive insights from data, and predictive analytics can be done by data scientists and engineers. Clearly, machine learning can perform this task with little or no hassle, along with relatively better accuracy and speed.

Aims and Objectives

The aim of this project is to build a machine learning model that predicts whether an employee should be promoted or not. The model should also be deployed so as to be able to make live predictions. This can be achieved by training a machine learning model with past and present performances of the employees in the company, along with other demographics.

Flow Process

Data Sourcing: This has to do with sourcing for a near-perfect dataset for this project.
Data Preparation: This has to do with wrangling, cleaning, and removal of outliers. It also includes exploratory data analysis in order to derive meaningful insights from the dataset.
Model Training: During this stage, the cleaned data is fed into the model, so that the model could learn the patterns from the dataset.
Model Evaluation and Validation: After training the model, the model should be used to make some predictions. Then its performance would be evaluated and validated.
Model Deployment: The validated and final model should be deployed online in order to make live predictions by anyone.

Data Source

We got near-perfect data for this problem on Kaggle, and here is the link to the dataset [https://www.kaggle.com/bhrt97/hr-analytics-classification].

Data Preparation

As expected, the data obtained has two columns containing nulls — “education” and “previous_year_rating.” Simple, concise, and critical thinking were deployed in order to “clean” these columns in the dataset. Hence, according to the plot below, it becomes clear that there is no variable that can be said to have outliers relating to this project. The reason behind this assumption is illustrated in this image, which shows the boxplot of the “age” variable. This plot tells us that there are outliers between the ages of 55 and 60. However, we all know that the retirement age in some countries is 60 years and not any younger; at least, it is true for Nigeria. So, we truly cannot say this is an outlier, we, therefore, must include all these variables.

The “countplot” below shows that the most populous department is “Sales & Marketing,” housing most of the male employees. While most of the female employees are in the “Operations” department.

The “countplot” below shows that most of the promoted employees have Bachelor’s degrees. And the result is corroborative evidence that this training dataset is a near-perfect dataset for this project.

Model Training, Evaluation, and Validation

In order to build the model, we used heatmap to inspect the contribution of each of the variables to promotability of the potential promotees.

Obviously, the KPI is the major determining factor, with a contribution level of about 22%. However, there is a potential danger of multicollinearity due to the high correlation between “age” and “length_of_service”, and between “KPI” and “previous_year_rating”. Hence, the reduction of the final valid features from 12 to 10.

So, a base CatBoost Classifier was trained, giving a recall value of 53% — which corresponds to 446 False Negatives (FN) and 260 False Positive (FP).
But, this result falls really short of our expectations. With the aforementioned results, the model would be predicting many more people to be promoted; which is a tradeoff. This would greatly impact the company or firm adopting this solution adversely.

Therefore, the dataset was further perturbed, with the application of oversampling and undersampling techniques, but this time, with a Random Forest Classifier model. This generated a 65% recall value, corresponding to 250FN and 456FP — a 50% better result than the previous.

Model Deployment

With a good model ready, Flask — a Framework built on python, was used to deploy the model on Heroku in order to make live predictions. Here is the link — Heroku: https://hr-analytics-classification.herokuapp.com, and on GitHub: https://github.com/Oladimeji-Williams/HRAnalyticsClassification.

Results

Generally, companies that use the Key Performance Index (KPI) as a metric for their employees always take the index seriously. This dataset contains a KPI variable, and as such, we expect that if an employee does not meet his/her target KPI, he/she should not be promoted. But, if he meets his/her target KPI, along with other variabilities, he/she should be promoted. This is exactly what the model predicted.

Conclusion and Recommendation

The result of this analysis shows that the process of determining whether a potential promotee would be promoted or not can possibly be done with the knowledge of machine learning. And the accuracy of the results largely depends on the quality of the dataset fed into the model. Therefore, in order to achieve better results, the correct and only correct data from the company’s record should be used to train the model.

Additionally, for a faster operation on each person, the individual data should be called from a database directly. This way, it generates a list of employees who should be promoted only.

Thank you for reading.

Team Members

Oladimeji Williams
Taiwo Olufunke Fashola
Oluyoyin Emmanuel
Nmeso Egwuekwe
Olajuwon Oyalude
Emmanuel Nnamaeka
Israel Okanlawon