Exploring Azure ML Studio on Employee Promotion Dataset
I was exploring the Azure ML Studio Classic and thought of working on a data set. In this I will explain the method I used for this Employee Promotion Data set.
This data set is available here:https://datahack.analyticsvidhya.com/contest/wns-analytics-hackathon-2018-1/
Data Visualization
We can visualize the data set once it is dragged into the designer.
When we look at the distribution of the output variable i.e. is_promoted we see that we have only 8.5% of employees who get promoted. This is a imbalanced classification problem where we have one class in majority and the other in minority.
By looking at the data we get to know that there are 7 categorical features : Department, Region, Education, Gender, Recruitment_channel, KPIs_met >80% , awards_won? and we have 5 numerical features : no_of_trainings, age, previous_year_rating ,length_of_service, avg_training_score.
Now we remove the employee_id column from the data set as each employee has a unique id and it is not useful for our prediction purpose.After that we split the data into train and test parts using the Split Data module.
We split the data into 70% for training and 30% for testing and for being able to replicate our prediction a random seed of 0 has been used.
Data Cleaning and Feature Engineering
The numerical data column previous year rating has missing values so it was cleaned using MICE algorithm.
We can select the columns which we want to clean using the selected columns. The minimum missing value ratio implies the minimum number of rows needed with the missing value in the feature column to start the cleaning process. Similarly the maximum missing value ratio implies the maximum number of rows to be missing to start the cleaning process.Only the feature columns which have missing value row count between these two will be cleaned. We have different methods to clean the missing data which is shown below.
Now I created two new columns by binning the age and length of service features. Tried different edges for both but the ones below gave better results.
The age column was binned into 3 groups one from less than 30 age, the next from 31 to 40 and the last one is greater then 40. Similarly for the length_of_service column we have less than 1 year ,2 to 6 years and greater than 6 years.
Now I cleaned the categorical column education using MICE and marked the new binned columns of age and length_of_service as categorical data using edit metadata module.
Now after this I created 2 new features from the avg_training_score and previous_year_rating because the promoted employees are going to be above average employees.
After this the numerical data were normalized using the Z-Score transformation i.e Standardization which means that the data points were converted to have a mean of 0 and standard deviation of 1.
As it was seen that this is a imbalanced classification problem so I tried using SMOTE. After using SMOTE the percentage of positive class i.e employees that are promoted increased from 8.6% in training data to 22%.
The last step in this was to one hot encode the categorical values which was done using the convert to indicator values module. After converting this the age column get converted into 2 columns whether the employee is a male or a female, so these 2 one hot encode columns are highly co-related so we drop any 1 of these columns using the select columns in dataset module.
Model Training
After trying different models available in microsoft azure ml studio, the Two-Class Boosted Decision Tree came out with the highest F1-Score.
The Train Model,Score Model and Evaluate Model are used to check the model’s score.
For using the test data the transformation which were done on the train data can be applied to the test data using the apply transformation module.
The model gave an F1-Score of 0.554 on the test data split which was made at the start from the data set.
Here is the link to the full experiment in Azure ML Studio Classic:https://gallery.cortanaintelligence.com/Experiment/Employee-Promotion