Analysis Model of Employee Performance With BoostingTechnique And BLSMOTE

Chandra Parashian N
7 min readOct 23, 2021

--

Table of Content

  1. Motivation the author

The perfomance is the part of extremely important and interesting, because it is proved for its benefit, the company wants employees to work truly according to their skills to achieve well results, without any well results from all employees, then the success in achieving goals will be difficult to achieve.

Best employee performance is the one of illustratation from quality of human resources. This performance represents as person’s success. Human of resources are such as having a critical thinking, curiosity, status, organization, and educational background.

Machine learning is the study of computer algorithms that can improve automatically through experience and by the use of data (T. Mitchell, 1997). It is seen as a part of artificial intelligence. Machine learning algorithms build a model based on sample data, known as “training data”, in order to make predictions or decisions without being explicitly programmed to do so.

In statistics and machine learning, ensembling technique is the method of combining from various set of learners (individual modeling) together, to obtain better predictive performance. Generally, ensembling technique consists of bagging, boosting, and stacking.

  • Bagging (bootstrap aggregating) involves having each model in the ensemble vote with equal weight.
  • Boosting is sequential ensemble learning technique to convert weak base learners to strong learner that performs better and is less bias.
  • Stacking is ensemble technique that combines all machine learning algorithms via meta-learning

From those explanations, we will be analysis and predicting the employee performance with using boosting.

2. Preparing the tools for analysis and building model

Through building model and visualization, we will use R version 4.1.1. For building model, that applies XGBoost (Extreme Gradient Boosting) with package xgboost in R, and for the visualization, we will plot data with ggplot2, ggpubr, and ggcorrplot.

BLSMOTE (Borderline-SMOTE) algorithms attempt to learn the borderline of each class, where these borderline instances and the ones nearby are more likely to be misclassified than the ones far from the borderline (H.Han et al., 2005). Finding the optimal k value, we will use factoextra and NbClust.

3. Metadata

Each of number of training and testing data are 8153 and 3000. There are 21 predictor variables and the rest are predictive variable. Here the list of predictor variables:

  1. job_level : Category type
  2. job_duration_in_current_job_level : Numeric type
  3. person_level : Category type
  4. job_duration_in_current_person_level :Numeric type
  5. job_duration_in_current_branch : Numeric type
  6. Employee_type : Category type
  7. gender : Category type
  8. age : Numeric type
  9. marital_status_maried.Y.N : Binary type
  10. number_of_dependeces : Numeric type
  11. Education_level : Category type
  12. GPA : Numeric type
  13. year_graduated : Numeric type
  14. job_duration_from_training: Numeric type
  15. branch_rotation: Numeric type
  16. job_rotation: Numeric type
  17. assign_of_otherposition: Numeric type
  18. annual.leaves: Numeric type
  19. sick_leaves: Numeric type
  20. Last_achievement_.: Numeric type
  21. Achievemnet_above_100._during3quartal: Numeric type

4. Describe the type data

glimpse(data_training)
glimpse(data_testing)
Training and training data

Here’s a display of the training and testing data after importing. Before we do visualization data, first thing, we make sure whether each predictor has either missing value or not.

df_na_train<-data_training%>%
select_all()%>%
summarise_all(funs(sum(is.na(.))))
df_na_test<-data_testing%>%
select_all()%>%
summarise_all(funs(sum(is.na(.))))
t(df_na_train)
t(df_na_test)
List number of NaN values from training and testing data

Both Last achievement and Achievement above 100 during 3 quartal consist just one NaN. For simply ways, those will be merged, and these will impute NaN by measure of center.

Proportion of Binary Class

Target class from training data is getting imbalance. It needs oversampling the minority class.

5. Exploration the Data

The visualization of numeric variable

From these density plots, here what we got:

  • job duration in current job level and job duration in current person level are seem like similar distribution.
  • GPA, assign of other position, and sick leaves are the most likely to have many zero value data. These variables may have several outlier.
  • year graduated and age are seem like they are skewed to the left.

It needs enlightenment of the statement from the second point.

Box-Plot visualization

Through box-plot data, many predictors has outlier.

Density visualization to each 0 and 1 class

Through from this visualization, I guess that some maximum value of the predictor from 0-class is bigger than maximum value of the predictor from 1-class. Furthermore, we will explain for the category data.

Category Data

Through from visualization of the category data, fortunately we don’t see any foreign category from testing data. Let’s see for each class target.

Category Data for each target class

We can say that:

  • The proportion of JG04 of 0-class is bigger than the proportion of JG04 of 1-class
  • For the person level, PG03 of 0-class is bigger than PG03 of 1-class
  • RM type A of 0-class is bigger than RM type A of 1-class
  • Female employee isn’t the best performance relatively
  • Every employee who has been married is not the best performance
  • The person who has education level 4 is not the best performance

Through these statement points, we don’t see any impact whether the employee is the best or not. Hence we will make these categories data to transfrom dummy variable. (One-hot encoding).

This data needs check the multicolinearity. Before the data plots the visualization, we need the category data to transform with one-hot encoding data.

df=rbind(data_training[,!(colnames(data_training)=="Best.Performance")],data_testing)
##imputation NA value
df$Last_achievement_.[is.na(df$Last_achievement_.)]=mean(df$Last_achievement_.,na.rm=T)
df$Achievement_above_100._during3quartal[is.na(df$Achievement_above_100._during3quartal)]=mean(df$Achievement_above_100._during3quartal,na.rm=T)
list_col=colnames(cat_train[,!(colnames(cat_train)=="target")])
#make dummy
df_temp=df
for (i in list_col){
df_temp<-dummy_cols(df_temp,select_columns = i,remove_selected_columns = TRUE)
}
df_train<-cbind(df_temp[1:len_train,],Best.Performance=data_training$Best.Performance)
df_test<-df_temp[(len_train+1):(len_test+len_train),]
##Turn to data frame
df_train=data.frame(df_train)
df_test=data.frame(df_test)
Correlation of the predictor

From this plot data, we can see there are some predictor variable which indicate multicolinearity. Then we need L2 Regularization for handle this problem.

6. Preprocessing data

Based on the proportion of the target class, it needs to generate data by BLSMOTE. Before generating data with BLSMOTE, k value must be found.

fviz_nbclust(df_train[,!(colnames(df_train)=="Best.Performance")],kmeans,method='silhouette')+theme_black()
Optimal number of clusters

According to the graph of the optimal k, optimal value reaches at (2,0.45). Then we got the optimal k is 2. Now we can generate target class of data.

prop.table(table(df_train$Best.Performance))
data_training_new<-BLSMOTE(df_train[,(!colnames(df_train)=="Best.Performance")],df_train$Best.Performance,K=2)
data_training_new<-data_training_new$data
prop.table(table(data_training_new$class))
Before and after from the proportion of the target class

7. Build the model

Before build the model, it makes sure that data must be convert to xgb.DMatrix. XGBoost accepts only xgb.DMatrix data.

##make x_train,y_train,x_tesst,y_test
x_train<-data_training_new[,(!colnames(data_training_new)=="class")]
y_train<-as.numeric(data_training_new$class)
x_test<-df_test
y_test<-reference_data
##Modelling#xgboost
x_train_xgb<-xgb.DMatrix(as.matrix(x_train),label=y_train)
x_test_xgb<-xgb.DMatrix(as.matrix(x_test),label=y_test)

Through x_train_xgb and x_test_xgb has been obtained, the model must define the hyperparameter from XGBoost and find the best hyperparameter.

#xgboost
x_train_xgb<-xgb.DMatrix(as.matrix(x_train),label=y_train)
x_test_xgb<-xgb.DMatrix(as.matrix(x_test),label=y_test)
params_xgb<-list(booster = "dart",
objective = "binary:logistic",eta=0.3,gamma=1,max_depth=5,
min_child_weight=2,subsample=1,colsample_bytree=1,lambda=1.25,alpha=0.75)
xgb_cv<-xgb.cv(params=params_xgb,
data=x_train_xgb,nrounds=600,nfold=5,showsd = T,
early.stop.round = 35, maximize = F,metrics=c('auc'))
gb_dt <- xgb.train(params = params_xgb,
data = x_train_xgb,
nrounds = xgb_cv$best_iteration,
print_every_n = 2,
eval_metric=c('auc'),
watchlist=list(train=x_train_xgb,eval=x_test_xgb))

8. Evaluation from the model

Through from our model has been built, then we have to see how far its performance. We will use the performance with confusion_matrix

Confusion matrix

From this image, here it is for the intepretation:

  • The accuracy and sensitivity are seems relatively good and balance
  • The other side, the specificity are sounds like not good. It’s only 19.26%, which means that the proportion of true negative is less than the sum of true negative and false positive .
  • Kappa’s value is 0.0324. It indicates that the accuracy of model where the data are just randomly assigned.
  • The prevalence indicate that 83.73% is often appear the worst performance from our testing data

From gb_dt model, we can look for the variable important. From variable important, we know how much the predictor variables contribute for the model.

Local explanation summary
  • There are 12 variables which doesn’t give the contribution for our model
  • At the last variable, job_level_JG05 doesn’t give any contribution for our model
  • The first variable, job_duration_in_current_branch is the most influence for our model. Futhermore for the variabel job_duration_in_current_branch, we take a look the dots. As we can see, that impact on our model are seems like relatively fair high. It means that job_duration_in_current_branch give important role when the data has been indicated low and high value.

Also we can see the percentage of the performance from our model based on the test data. We will only pick 10 predictors.

Percentage the contribution from predictors

9. Conclusion

Based on above the explanation, the percentage of accuracy our model is 73.47% but the specificity is too low. Also we can see that from 39 predictors which has been processed by engineering feature, there are 10 important variable :

  • job_duration_in_current_branch
  • number_of_dependences
  • job_rotation
  • gender_F
  • last_achievement
  • annual_leave
  • education_level_level_3
  • branch_rotation
  • employee_type_RM_type_B
  • job_duration_in_current_job_level

Reference

  1. Mitchell, Tom (1997). Machine Learning. New York: McGraw Hill.
  2. Han, H., Wang, W.Y. and Mao, B.H. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In Proceedings of the 2005 international conference on Advances in Intelligent Computing

--

--