Data to predict which employees are likely to leave

No Data Scientist is the Same — part 2

Jurriaan Nagelkerke
Cmotions
Published in
6 min readMar 4, 2022

--

This article is part of our series about how different types of data scientists build similar models differently. No human is the same and therefore also no data scientist is the same. And the circumstances under which a data challenge needs to be handled change constantly. For these reasons, different approaches can and will be used to complete the task at hand. In our series we will explore the four different approaches of our data scientists — Meta Oric, Aki Razzi, Andy Stand, and Eqaan Librium. They are presented with the task to build a model to predict whether employees of a company — STARDATAPEPS — will look for a new job or not. Based on their distinct profiles discussed in the first blog you can already imagine that their approaches will be quite different.

In the previous blog we introduced our data science rock stars. In the next articles they will, all in their own way, predict which employees are most likely to leave the company.

But before we start this journey, lets take a quick look at what kind of information we actually have available for this quest. The source data is available here (https://www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists). In this notebook, we’ll download it directly from kaggle so you can reproduce all we do in the next blogs. Also, we will do some preparation to this data.

Loading the data directly from Kaggle

To get the dataset from kaggle you need your kaggle user name and api token. These are included in the kaggle.json file which you can download from your kaggle account page. See https://www.kaggle.com/docs/api for details.

Replace [YOUR_PERSONAL_KAGGLE_USER_NAME] and [YOUR_PERSONAL_KAGGLE_KEY] in the cell below with your own credentials and run this to download the data directly from Kaggle. Of course it is also possible to download the files direcly from the Kaggle website.

Now that we have downloaded the data, we can unzip it and explore it.

Data preparation

We do some preparation to the original data to meet our needs.

target

The column target indicates whether or not a data professional is open to change a job. This is also what our heroes will be predicting. Let’s make it an integer before we go on.

city

Unfortunately, the city names are not included in this dataset. To be able to interpret model results, we’ve added a city name. We couldn’t find a source that provided the names of US cities by their City Development Index. Therefore, we’ve ranked the cities on the city development index in the data and matched the data with the Innovation Index dataset provided by StatsAmerica that ranks metropolitan areas by innovation index — a somewhat similar metric. It’s unlikely we map the exact city name that goes with the city code in the data. Nevertheless, we prefer a real name in our examples over a city id which is impossible to interpret.

Next, we download the statsamerica.org city data. We use xlrd==1.2.0 to be able to load the xlsx data into a pandas dataframe.

Next, we add the city names based on the rank of the cities, both in our data as well as in the statsamerica.org data.

indicator for relevent_experience

The data also contains a textual feature indicating whether an employee has relevant work experience. Let’s turn that into a dummy (indicator) field right away:

experience_num

Another feature is about the length of the working experience. This feature now is textual with values ‘<1’ and ‘>20’. To be able to use it as a numeric feature, we create a new feature, replacing ‘>20’ with 22 and ‘<1’ with 0.

Check out our data and save

After these changes to the original dataset, the data we will work with in the next blogs looks like this:

This set has 19.158 records and various columns, that hopefully can help to predict who will be open for a job change. Most of the information is related to the education which data scientists have followed and the company where they are currently working.

Let’s save the data for later use.

For your convenience, we’ve also uploaded this data to our azure blob storage, so you can immediately load this prepared data in our later blogs. You will load the prepared data like this:

Now that we introduced you to our data science team and to the data, let the modeling begin!

--

--

Jurriaan Nagelkerke
Cmotions

Data Science and Advanced Analytics Consultant @ Cmotions