Analysis Model of Employee Performance With BoostingTechnique And BLSMOTE

7 min readOct 23, 2021

Table of Content

Motivation the author
Preparing the tools for analysis and building model
Metadata
Describe the data
Exploration the data
Preprocessing data
Build the model
Evaluation from the model
Conclusion

Motivation the author

The perfomance is the part of extremely important and interesting, because it is proved for its benefit, the company wants employees to work truly according to their skills to achieve well results, without any well results from all employees, then the success in achieving goals will be difficult to achieve.

Best employee performance is the one of illustratation from quality of human resources. This performance represents as person’s success. Human of resources are such as having a critical thinking, curiosity, status, organization, and educational background.

Machine learning is the study of computer algorithms that can improve automatically through experience and by the use of data (T. Mitchell, 1997). It is seen as a part of artificial intelligence. Machine learning algorithms build a model based on sample data, known as “training data”, in order to make predictions or decisions without being explicitly programmed to do so.

In statistics and machine learning, ensembling technique is the method of combining from various set of learners (individual modeling) together, to obtain better predictive performance. Generally, ensembling technique consists of bagging, boosting, and stacking.

Bagging (bootstrap aggregating) involves having each model in the ensemble vote with equal weight.
Boosting is sequential ensemble learning technique to convert weak base learners to strong learner that performs better and is less bias.
Stacking is ensemble technique that combines all machine learning algorithms via meta-learning

From those explanations, we will be analysis and predicting the employee performance with using boosting.

2. Preparing the tools for analysis and building model

Through building model and visualization, we will use R version 4.1.1. For building model, that applies XGBoost (Extreme Gradient Boosting) with package xgboost in R, and for the visualization, we will plot data with ggplot2, ggpubr, and ggcorrplot.

BLSMOTE (Borderline-SMOTE) algorithms attempt to learn the borderline of each class, where these borderline instances and the ones nearby are more likely to be misclassified than the ones far from the borderline (H.Han et al., 2005). Finding the optimal k value, we will use factoextra and NbClust.

3. Metadata

Each of number of training and testing data are 8153 and 3000. There are 21 predictor variables and the rest are predictive variable. Here the list of predictor variables:

job_level : Category type
job_duration_in_current_job_level : Numeric type
person_level : Category type
job_duration_in_current_person_level :Numeric type
job_duration_in_current_branch : Numeric type
Employee_type : Category type
gender : Category type
age : Numeric type
marital_status_maried.Y.N : Binary type
number_of_dependeces : Numeric type
Education_level : Category type
GPA : Numeric type
year_graduated : Numeric type
job_duration_from_training: Numeric type
branch_rotation: Numeric type
job_rotation: Numeric type
assign_of_otherposition: Numeric type
annual.leaves: Numeric type
sick_leaves: Numeric type
Last_achievement_.: Numeric type
Achievemnet_above_100._during3quartal: Numeric type

4. Describe the type data

glimpse(data_training)
glimpse(data_testing)

Here’s a display of the training and testing data after importing. Before we do visualization data, first thing, we make sure whether each predictor has either missing value or not.

df_na_train<-data_training%>%
  select_all()%>%
  summarise_all(funs(sum(is.na(.))))df_na_test<-data_testing%>%
  select_all()%>%
  summarise_all(funs(sum(is.na(.))))t(df_na_train)
t(df_na_test)

List number of NaN values from training and testing data

Both Last achievement and Achievement above 100 during 3 quartal consist just one NaN. For simply ways, those will be merged, and these will impute NaN by measure of center.

Proportion of Binary Class

Target class from training data is getting imbalance. It needs oversampling the minority class.

5. Exploration the Data

From these density plots, here what we got:

job duration in current job level and job duration in current person level are seem like similar distribution.
GPA, assign of other position, and sick leaves are the most likely to have many zero value data. These variables may have several outlier.
year graduated and age are seem like they are skewed to the left.

It needs enlightenment of the statement from the second point.

Through box-plot data, many predictors has outlier.

Density visualization to each 0 and 1 class

Through from this visualization, I guess that some maximum value of the predictor from 0-class is bigger than maximum value of the predictor from 1-class. Furthermore, we will explain for the category data.

Through from visualization of the category data, fortunately we don’t see any foreign category from testing data. Let’s see for each class target.

We can say that:

The proportion of JG04 of 0-class is bigger than the proportion of JG04 of 1-class
For the person level, PG03 of 0-class is bigger than PG03 of 1-class
RM type A of 0-class is bigger than RM type A of 1-class
Female employee isn’t the best performance relatively
Every employee who has been married is not the best performance
The person who has education level 4 is not the best performance

Through these statement points, we don’t see any impact whether the employee is the best or not. Hence we will make these categories data to transfrom dummy variable. (One-hot encoding).

This data needs check the multicolinearity. Before the data plots the visualization, we need the category data to transform with one-hot encoding data.

df=rbind(data_training[,!(colnames(data_training)=="Best.Performance")],data_testing)
##imputation NA value
df$Last_achievement_.[is.na(df$Last_achievement_.)]=mean(df$Last_achievement_.,na.rm=T)
df$Achievement_above_100._during3quartal[is.na(df$Achievement_above_100._during3quartal)]=mean(df$Achievement_above_100._during3quartal,na.rm=T)list_col=colnames(cat_train[,!(colnames(cat_train)=="target")])
#make dummy
df_temp=df
for (i in list_col){
  df_temp<-dummy_cols(df_temp,select_columns = i,remove_selected_columns = TRUE)
}
df_train<-cbind(df_temp[1:len_train,],Best.Performance=data_training$Best.Performance)
df_test<-df_temp[(len_train+1):(len_test+len_train),]##Turn to data frame
df_train=data.frame(df_train)
df_test=data.frame(df_test)

From this plot data, we can see there are some predictor variable which indicate multicolinearity. Then we need L2 Regularization for handle this problem.

6. Preprocessing data

Based on the proportion of the target class, it needs to generate data by BLSMOTE. Before generating data with BLSMOTE, k value must be found.

fviz_nbclust(df_train[,!(colnames(df_train)=="Best.Performance")],kmeans,method='silhouette')+theme_black()

According to the graph of the optimal k, optimal value reaches at (2,0.45). Then we got the optimal k is 2. Now we can generate target class of data.

prop.table(table(df_train$Best.Performance))
data_training_new<-BLSMOTE(df_train[,(!colnames(df_train)=="Best.Performance")],df_train$Best.Performance,K=2)
data_training_new<-data_training_new$data
prop.table(table(data_training_new$class))

Before and after from the proportion of the target class

7. Build the model

Before build the model, it makes sure that data must be convert to xgb.DMatrix. XGBoost accepts only xgb.DMatrix data.

##make x_train,y_train,x_tesst,y_test
x_train<-data_training_new[,(!colnames(data_training_new)=="class")]
y_train<-as.numeric(data_training_new$class)
x_test<-df_test
y_test<-reference_data##Modelling#xgboost
x_train_xgb<-xgb.DMatrix(as.matrix(x_train),label=y_train)
x_test_xgb<-xgb.DMatrix(as.matrix(x_test),label=y_test)

Through x_train_xgb and x_test_xgb has been obtained, the model must define the hyperparameter from XGBoost and find the best hyperparameter.

#xgboost
x_train_xgb<-xgb.DMatrix(as.matrix(x_train),label=y_train)
x_test_xgb<-xgb.DMatrix(as.matrix(x_test),label=y_test)params_xgb<-list(booster = "dart", 
                 objective = "binary:logistic",eta=0.3,gamma=1,max_depth=5,
                 min_child_weight=2,subsample=1,colsample_bytree=1,lambda=1.25,alpha=0.75)
xgb_cv<-xgb.cv(params=params_xgb,
               data=x_train_xgb,nrounds=600,nfold=5,showsd = T, 
               early.stop.round = 35, maximize = F,metrics=c('auc'))
gb_dt <- xgb.train(params = params_xgb,
                   data = x_train_xgb,
                   nrounds = xgb_cv$best_iteration,
                   print_every_n = 2,
                   eval_metric=c('auc'),
                 watchlist=list(train=x_train_xgb,eval=x_test_xgb))

8. Evaluation from the model

Through from our model has been built, then we have to see how far its performance. We will use the performance with confusion_matrix

From this image, here it is for the intepretation:

The accuracy and sensitivity are seems relatively good and balance
The other side, the specificity are sounds like not good. It’s only 19.26%, which means that the proportion of true negative is less than the sum of true negative and false positive .
Kappa’s value is 0.0324. It indicates that the accuracy of model where the data are just randomly assigned.
The prevalence indicate that 83.73% is often appear the worst performance from our testing data

From gb_dt model, we can look for the variable important. From variable important, we know how much the predictor variables contribute for the model.

There are 12 variables which doesn’t give the contribution for our model
At the last variable, job_level_JG05 doesn’t give any contribution for our model
The first variable, job_duration_in_current_branch is the most influence for our model. Futhermore for the variabel job_duration_in_current_branch, we take a look the dots. As we can see, that impact on our model are seems like relatively fair high. It means that job_duration_in_current_branch give important role when the data has been indicated low and high value.

Also we can see the percentage of the performance from our model based on the test data. We will only pick 10 predictors.

Percentage the contribution from predictors

9. Conclusion

Based on above the explanation, the percentage of accuracy our model is 73.47% but the specificity is too low. Also we can see that from 39 predictors which has been processed by engineering feature, there are 10 important variable :

job_duration_in_current_branch
number_of_dependences
job_rotation
gender_F
last_achievement
annual_leave
education_level_level_3
branch_rotation
employee_type_RM_type_B
job_duration_in_current_job_level

Reference

Mitchell, Tom (1997). Machine Learning. New York: McGraw Hill.
Han, H., Wang, W.Y. and Mao, B.H. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In Proceedings of the 2005 international conference on Advances in Intelligent Computing

Analysis Model of Employee Performance With BoostingTechnique And BLSMOTE

Table of Content

Reference

Written by Chandra Parashian N