Analysis Model of Employee Performance With BoostingTechnique And BLSMOTE
Table of Content
- Motivation the author
- Preparing the tools for analysis and building model
- Metadata
- Describe the data
- Exploration the data
- Preprocessing data
- Build the model
- Evaluation from the model
- Conclusion
- Motivation the author
The perfomance is the part of extremely important and interesting, because it is proved for its benefit, the company wants employees to work truly according to their skills to achieve well results, without any well results from all employees, then the success in achieving goals will be difficult to achieve.
Best employee performance is the one of illustratation from quality of human resources. This performance represents as person’s success. Human of resources are such as having a critical thinking, curiosity, status, organization, and educational background.
Machine learning is the study of computer algorithms that can improve automatically through experience and by the use of data (T. Mitchell, 1997). It is seen as a part of artificial intelligence. Machine learning algorithms build a model based on sample data, known as “training data”, in order to make predictions or decisions without being explicitly programmed to do so.
In statistics and machine learning, ensembling technique is the method of combining from various set of learners (individual modeling) together, to obtain better predictive performance. Generally, ensembling technique consists of bagging, boosting, and stacking.
- Bagging (bootstrap aggregating) involves having each model in the ensemble vote with equal weight.
- Boosting is sequential ensemble learning technique to convert weak base learners to strong learner that performs better and is less bias.
- Stacking is ensemble technique that combines all machine learning algorithms via meta-learning
From those explanations, we will be analysis and predicting the employee performance with using boosting.
2. Preparing the tools for analysis and building model
Through building model and visualization, we will use R version 4.1.1. For building model, that applies XGBoost (Extreme Gradient Boosting) with package xgboost in R, and for the visualization, we will plot data with ggplot2, ggpubr, and ggcorrplot.
BLSMOTE (Borderline-SMOTE) algorithms attempt to learn the borderline of each class, where these borderline instances and the ones nearby are more likely to be misclassified than the ones far from the borderline (H.Han et al., 2005). Finding the optimal k value, we will use factoextra and NbClust.
3. Metadata
Each of number of training and testing data are 8153 and 3000. There are 21 predictor variables and the rest are predictive variable. Here the list of predictor variables:
- job_level : Category type
- job_duration_in_current_job_level : Numeric type
- person_level : Category type
- job_duration_in_current_person_level :Numeric type
- job_duration_in_current_branch : Numeric type
- Employee_type : Category type
- gender : Category type
- age : Numeric type
- marital_status_maried.Y.N : Binary type
- number_of_dependeces : Numeric type
- Education_level : Category type
- GPA : Numeric type
- year_graduated : Numeric type
- job_duration_from_training: Numeric type
- branch_rotation: Numeric type
- job_rotation: Numeric type
- assign_of_otherposition: Numeric type
- annual.leaves: Numeric type
- sick_leaves: Numeric type
- Last_achievement_.: Numeric type
- Achievemnet_above_100._during3quartal: Numeric type
4. Describe the type data
glimpse(data_training)
glimpse(data_testing)
Here’s a display of the training and testing data after importing. Before we do visualization data, first thing, we make sure whether each predictor has either missing value or not.
df_na_train<-data_training%>%
select_all()%>%
summarise_all(funs(sum(is.na(.))))df_na_test<-data_testing%>%
select_all()%>%
summarise_all(funs(sum(is.na(.))))t(df_na_train)
t(df_na_test)
Both Last achievement and Achievement above 100 during 3 quartal consist just one NaN. For simply ways, those will be merged, and these will impute NaN by measure of center.
Target class from training data is getting imbalance. It needs oversampling the minority class.
5. Exploration the Data
From these density plots, here what we got:
- job duration in current job level and job duration in current person level are seem like similar distribution.
- GPA, assign of other position, and sick leaves are the most likely to have many zero value data. These variables may have several outlier.
- year graduated and age are seem like they are skewed to the left.
It needs enlightenment of the statement from the second point.
Through box-plot data, many predictors has outlier.
Through from this visualization, I guess that some maximum value of the predictor from 0-class is bigger than maximum value of the predictor from 1-class. Furthermore, we will explain for the category data.
Through from visualization of the category data, fortunately we don’t see any foreign category from testing data. Let’s see for each class target.
We can say that:
- The proportion of JG04 of 0-class is bigger than the proportion of JG04 of 1-class
- For the person level, PG03 of 0-class is bigger than PG03 of 1-class
- RM type A of 0-class is bigger than RM type A of 1-class
- Female employee isn’t the best performance relatively
- Every employee who has been married is not the best performance
- The person who has education level 4 is not the best performance
Through these statement points, we don’t see any impact whether the employee is the best or not. Hence we will make these categories data to transfrom dummy variable. (One-hot encoding).
This data needs check the multicolinearity. Before the data plots the visualization, we need the category data to transform with one-hot encoding data.
df=rbind(data_training[,!(colnames(data_training)=="Best.Performance")],data_testing)
##imputation NA value
df$Last_achievement_.[is.na(df$Last_achievement_.)]=mean(df$Last_achievement_.,na.rm=T)
df$Achievement_above_100._during3quartal[is.na(df$Achievement_above_100._during3quartal)]=mean(df$Achievement_above_100._during3quartal,na.rm=T)list_col=colnames(cat_train[,!(colnames(cat_train)=="target")])
#make dummy
df_temp=df
for (i in list_col){
df_temp<-dummy_cols(df_temp,select_columns = i,remove_selected_columns = TRUE)
}
df_train<-cbind(df_temp[1:len_train,],Best.Performance=data_training$Best.Performance)
df_test<-df_temp[(len_train+1):(len_test+len_train),]##Turn to data frame
df_train=data.frame(df_train)
df_test=data.frame(df_test)
From this plot data, we can see there are some predictor variable which indicate multicolinearity. Then we need L2 Regularization for handle this problem.
6. Preprocessing data
Based on the proportion of the target class, it needs to generate data by BLSMOTE. Before generating data with BLSMOTE, k value must be found.
fviz_nbclust(df_train[,!(colnames(df_train)=="Best.Performance")],kmeans,method='silhouette')+theme_black()
According to the graph of the optimal k, optimal value reaches at (2,0.45). Then we got the optimal k is 2. Now we can generate target class of data.
prop.table(table(df_train$Best.Performance))
data_training_new<-BLSMOTE(df_train[,(!colnames(df_train)=="Best.Performance")],df_train$Best.Performance,K=2)
data_training_new<-data_training_new$data
prop.table(table(data_training_new$class))
7. Build the model
Before build the model, it makes sure that data must be convert to xgb.DMatrix. XGBoost accepts only xgb.DMatrix data.
##make x_train,y_train,x_tesst,y_test
x_train<-data_training_new[,(!colnames(data_training_new)=="class")]
y_train<-as.numeric(data_training_new$class)
x_test<-df_test
y_test<-reference_data##Modelling#xgboost
x_train_xgb<-xgb.DMatrix(as.matrix(x_train),label=y_train)
x_test_xgb<-xgb.DMatrix(as.matrix(x_test),label=y_test)
Through x_train_xgb and x_test_xgb has been obtained, the model must define the hyperparameter from XGBoost and find the best hyperparameter.
#xgboost
x_train_xgb<-xgb.DMatrix(as.matrix(x_train),label=y_train)
x_test_xgb<-xgb.DMatrix(as.matrix(x_test),label=y_test)params_xgb<-list(booster = "dart",
objective = "binary:logistic",eta=0.3,gamma=1,max_depth=5,
min_child_weight=2,subsample=1,colsample_bytree=1,lambda=1.25,alpha=0.75)
xgb_cv<-xgb.cv(params=params_xgb,
data=x_train_xgb,nrounds=600,nfold=5,showsd = T,
early.stop.round = 35, maximize = F,metrics=c('auc'))
gb_dt <- xgb.train(params = params_xgb,
data = x_train_xgb,
nrounds = xgb_cv$best_iteration,
print_every_n = 2,
eval_metric=c('auc'),
watchlist=list(train=x_train_xgb,eval=x_test_xgb))
8. Evaluation from the model
Through from our model has been built, then we have to see how far its performance. We will use the performance with confusion_matrix
From this image, here it is for the intepretation:
- The accuracy and sensitivity are seems relatively good and balance
- The other side, the specificity are sounds like not good. It’s only 19.26%, which means that the proportion of true negative is less than the sum of true negative and false positive .
- Kappa’s value is 0.0324. It indicates that the accuracy of model where the data are just randomly assigned.
- The prevalence indicate that 83.73% is often appear the worst performance from our testing data
From gb_dt model, we can look for the variable important. From variable important, we know how much the predictor variables contribute for the model.
- There are 12 variables which doesn’t give the contribution for our model
- At the last variable, job_level_JG05 doesn’t give any contribution for our model
- The first variable, job_duration_in_current_branch is the most influence for our model. Futhermore for the variabel job_duration_in_current_branch, we take a look the dots. As we can see, that impact on our model are seems like relatively fair high. It means that job_duration_in_current_branch give important role when the data has been indicated low and high value.
Also we can see the percentage of the performance from our model based on the test data. We will only pick 10 predictors.
9. Conclusion
Based on above the explanation, the percentage of accuracy our model is 73.47% but the specificity is too low. Also we can see that from 39 predictors which has been processed by engineering feature, there are 10 important variable :
- job_duration_in_current_branch
- number_of_dependences
- job_rotation
- gender_F
- last_achievement
- annual_leave
- education_level_level_3
- branch_rotation
- employee_type_RM_type_B
- job_duration_in_current_job_level
Reference
- Mitchell, Tom (1997). Machine Learning. New York: McGraw Hill.
- Han, H., Wang, W.Y. and Mao, B.H. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In Proceedings of the 2005 international conference on Advances in Intelligent Computing