Data to predict which employees are likely to leave
No Data Scientist is the Same — part 2
This article is part of our series about how different types of data scientists build similar models differently. No human is the same and therefore also no data scientist is the same. And the circumstances under which a data challenge needs to be handled change constantly. For these reasons, different approaches can and will be used to complete the task at hand. In our series we will explore the four different approaches of our data scientists — Meta Oric, Aki Razzi, Andy Stand, and Eqaan Librium. They are presented with the task to build a model to predict whether employees of a company — STARDATAPEPS — will look for a new job or not. Based on their distinct profiles discussed in the first blog you can already imagine that their approaches will be quite different.
In the previous blog we introduced our data science rock stars. In the next articles they will, all in their own way, predict which employees are most likely to leave the company.
But before we start this journey, lets take a quick look at what kind of information we actually have available for this quest. The source data is available here (https://www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists). In this notebook, we’ll download it directly from kaggle so you can reproduce all we do in the next blogs. Also, we will do some preparation to this data.
Loading the data directly from Kaggle
To get the dataset from kaggle you need your kaggle user name and api token. These are included in the kaggle.json file which you can download from your kaggle account page. See https://www.kaggle.com/docs/api for details.
!pip install -q kaggle
Replace [YOUR_PERSONAL_KAGGLE_USER_NAME] and [YOUR_PERSONAL_KAGGLE_KEY] in the cell below with your own credentials and run this to download the data directly from Kaggle. Of course it is also possible to download the files direcly from the Kaggle website.
%shexport KAGGLE_USERNAME=[YOUR_PERSONAL_KAGGLE_USER_NAME]export KAGGLE_KEY=[YOUR_PERSONAL_KAGGLE_KEY]kaggle datasets download arashnic/hr-analytics-job-change-of-data-scientists
Now that we have downloaded the data, we can unzip it and explore it.
# import required modules
import zipfile
import pandas as pd
# open zipped dataset
with zipfile.ZipFile('/databricks/driver/hr-analytics-job-change-of-data-scientists.zip') as z:
# open the csv file in the dataset
with z.open("aug_train.csv") as f:
# read the dataset
df = pd.read_csv(f)
# display dataset
print(df.columns)
print(df.head())
Data preparation
We do some preparation to the original data to meet our needs.
target
The column target indicates whether or not a data professional is open to change a job. This is also what our heroes will be predicting. Let’s make it an integer before we go on.
#Convert target to integer:
df['target'] = df['target'].astype('int32')
city
Unfortunately, the city names are not included in this dataset. To be able to interpret model results, we’ve added a city name. We couldn’t find a source that provided the names of US cities by their City Development Index. Therefore, we’ve ranked the cities on the city development index in the data and matched the data with the Innovation Index dataset provided by StatsAmerica that ranks metropolitan areas by innovation index — a somewhat similar metric. It’s unlikely we map the exact city name that goes with the city code in the data. Nevertheless, we prefer a real name in our examples over a city id which is impossible to interpret.
# create dataframe with cities and a rank based on the city development indexcities = df.groupby('city')['city_development_index'].mean().reset_index()cities['rank'] = cities['city_development_index'].rank(ascending=False,method='first').astype(int)cities.head()
Next, we download the statsamerica.org city data. We use xlrd==1.2.0 to be able to load the xlsx data into a pandas dataframe.
!pip install xlrd==1.2.0# download innovation index values of metropolitan areas and prepare for matching to citiescitydata = pd.read_excel('http://www.statsamerica.org/ii2/docs/downloads/Metros.xlsx', sheet_name='Data')citydata = citydata[citydata.code_description=='Headline Index'][['description','index value']].rename(columns={'description':'city name'})citydata['rank'] = citydata['index value'].rank(ascending=False,method='first').astype(int)citydata = citydata[['rank','city name']]
citydata.head()
Next, we add the city names based on the rank of the cities, both in our data as well as in the statsamerica.org data.
# create dataframe with city id and city name matched on index ranks and add city name to our datacity_name = cities.join(citydata.set_index('rank'), on='rank')[['city','city name']]df = df.join(city_name.set_index('city'), on='city')df[['city','city name','city_development_index']].head()
indicator for relevent_experience
The data also contains a textual feature indicating whether an employee has relevant work experience. Let’s turn that into a dummy (indicator) field right away:
df['relevent_experience'].value_counts()# Has relevent experience 13792
# No relevent experience 5366
# Name: relevent_experience, dtype: int64#Convert relevent_experience to a dummy:
df['ind_relevent_experience'] = 0
df.loc[df['relevent_experience'] == 'Has relevent experience', 'ind_relevent_experience'] = 1df.drop(columns=['relevent_experience'], inplace = True)
df['ind_relevent_experience'] = df['ind_relevent_experience'].astype('int32')
experience_num
Another feature is about the length of the working experience. This feature now is textual with values ‘<1’ and ‘>20’. To be able to use it as a numeric feature, we create a new feature, replacing ‘>20’ with 22 and ‘<1’ with 0.
#Creating a numeric version of experience:
df['experience_num'] = df.experience# replacing special chars(like >,+) with numbers
df.experience_num.replace('>20','22',inplace=True)
df.experience_num.replace('<1','0',inplace=True)df['experience_num'] = df['experience_num'].astype('float32')
Check out our data and save
After these changes to the original dataset, the data we will work with in the next blogs looks like this:
df.head()
df.shapeOut[18]: (19158, 16)
This set has 19.158 records and various columns, that hopefully can help to predict who will be open for a job change. Most of the information is related to the education which data scientists have followed and the company where they are currently working.
Let’s save the data for later use.
output = df.to_csv('df_prepared.csv')
For your convenience, we’ve also uploaded this data to our azure blob storage, so you can immediately load this prepared data in our later blogs. You will load the prepared data like this:
df_prep = pd.read_csv('https://bhciaaablob.blob.core.windows.net/featurenegineeringfiles/df_prepared.csv')df_prep.head()
Now that we introduced you to our data science team and to the data, let the modeling begin!
This article is of part of our No Data Scientist Is The Same series. The full series is written by Anya Tonne, Jurriaan Nagelkerke, Karin Gruijs-Vodde and Tom Blanke. The series is also available on theanalyticslab.nl.
An overview of all articles on Medium within the series:
- Introducing our data science rock stars
- Data to predict which employees are likely to leave
- Good model by default using XGBoost
- Hyperparameter tuning for hyperaccurate XGBoost model
- Beat dirty data
- The case of high cardinality kerfuffles
- Guide to manage missing data
- Visualise the business value of predictive models
- No data scientist is the same!
Do you want to do this yourself? Please feel free to download the Notebook on our gitlab page.