Amazon Employee Access Challenge

shriram s
Analytics Vidhya
Published in
14 min readJul 3, 2020

Table of Contents

  1. Introduction
  2. Usage of ML for this problem
  3. Data Overview
  4. Performance Metric
  5. Exploratory Data Analysis
  6. Feature Engineering
  7. Feature Selection
  8. Building ML models
  9. Comparison between models
  10. Introduction to CatBoost Model
  11. Deployment of ML models.
  12. Further Improvements
  13. Code Reference: GitHub
  14. References

Introduction

When an employee starts to work at a company, he/she needs to obtain the necessary access to fulfill their role. This process is often done manually by an employee raising a request to provide the necessary access and the supervisor would pick up the request and manually grant the access to the employee. This is often a time-consuming process and needs human intervention at most stages. The idea is to replace this manual process by using a machine learning model trained using the existing data that contains details about the employee’s role, department name, and so on. This model would help to automatically grant or revoke access and reduce the human involvement required in this process.

Usage of ML for this problem

We aim to develop a Machine Learning model that takes an employee’s access request as input which contains details about the employee’s attributes like role, department, etc.. and the model has to decide whether to provide access or not. This problem can be seen as a Binary Classification Problem where our machine learning model should predict one of the two classes(approve/deny) as the outcome.

Data Overview

This dataset is available as a part of the Kaggle competition Amazon.com — Employee Access-Challenge. The data consists of real historical data collected from 2010 & 2011. The dataset consists of 2 files train.csv and test.csv. Train.csv contains 32,769 values and test.csv contains 58,921 values.

Importing the data using Pandas.

List of columns given in the data:

ACTION: ACTION is 1 if the resource was approved, 0 if the resource was not

RESOURCE: An ID for each resource

MGR_ID: The EMPLOYEE ID of the manager of the current EMPLOYEE ID record; an employee may have only one manager at a time

ROLE_ROLLUP_1: Company role grouping category id 1 (e.g. US Engineering)

ROLE_ROLLUP_2: Company role grouping category id 2 (e.g. US Retail)

ROLE_DEPTNAME: Company role department description (e.g. Retail)

ROLE_TITLE: Company role business title description (e.g. Senior Engineering Retail Manager)

ROLE_FAMILY_DESC: Company role family extended description (e.g. Retail Manager, Software Engineering)

ROLE_FAMILY: Company role family description (e.g. Retail Manager)

ROLE_CODE: Company role code; this code is unique to each role (e.g. Manager)

Performance Metric

The performance metric used here for this problem is AUC score

What is the AUC score?

AUC is one of the important evaluation metrics for checking the classification model’s performance.AUC also called AUROC stands for Area Under the Receiver Operating Characteristics Curve.
Before diving into the AUC score, let’s understand a few concepts.

Confusion matrix
The confusion matrix is a specific table layout that is used to visualize the performance of an algorithm

TP: True Positive
TN: True Negative
FP: False Positive
FN: False Negative

Let’s consider you have two class labels say 1(Positive) and 0(Negative).
True Positive: You predicted the class label to be positive and its actual label is positive (i.e Ground truth is also positive)
True Negative: Your prediction is Negative and its actual label is also Negative
False Positive: Your prediction is Positive but the actual label is Negative. This is also called as Type I error
False Negative:
Your prediction is Negative but the actual label is Positive.
This is also called as Type II error.

Ideally, TPR, TNR values need to be high, and FPR, FNR values need to be small for a good classification model.

A ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. The ROC curve is plotted with TPR against the FPR where TPR is on the y-axis and FPR is on the x-axis.ROC is a probability curve and AUC represents the degree or measure of separability. It tells how much model is capable of distinguishing between classes. Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s.

Exploratory Data Analysis (EDA)

Exploratory data analysis is the process of analyzing data sets to understand the distribution of data, obtain its main characteristics, and visualize the distributions of the dataset.

From the above plot, we could notice that the given dataset is imbalanced as most of the requests are approved and only very few requests are rejected in the given data.

This plot suggests that most of the MGR_ID variable values lie within the range of 0–5000 and the densities of Approved requests are more in this range.

Looking at the above plot, we could see a high spike which suggests that only a few values have occurred most of the time for the ROLE_ROLLUP_2 variable when compared with others.

We could see multiple spikes in the above plot which suggests that the majority of the values for ROLE_FAMILY variable are from these two ranges of values and here too the densities of Approved requests are more when compared to rejected requests.

Using this heatmap, we could conclude that there is no correlation between most of the given features since most of the values in the above plot are zeros.

Feature Engineering

Feature engineering is the process of using domain knowledge to extract features from raw data. These features can be used to improve the performance of machine learning algorithms.

The feature engineering process involves the below-mentioned steps:

  • Brainstorming or testing features.
  • Deciding what features to create.
  • Checking how the features work with your model.
  • Improving your features if needed.
  • Go back to brainstorming/creating more features until the work is done.

Some of the feature engineering techniques used in this problem are

  • One Hot Encoding
  • Frequency Encoding
  • Response Encoding / Targer Encoding
  • SVD encoding

One Hot Encoding

Performing One Hot Encoding

One Hot Encoding is a strategy where each category value is converted into a new column and assigned a value of 1 or 0 based on the presence or absence of that feature. In the above image, for the 4th sample, the value of 1 is given to only one column(octopus), and rest other values are zeros.

Frequency Encoding

Frequency Encoding is a way to utilize the frequency of the categories as labels. Let's say we have a total of 9 points, and out of which 4 points belong to category A. So the frequency encoded value for category A would be 0.44 (4/9). If a particular category occurs more often than others for a given feature, then that particular category would get a high frequency-encoded value which comparing with other categories.

In Frequency encoding, values depend on the frequency of the data. For example, in the above image A has occurred 4 times(out of 9) and the value is 0.44 (4/9), the value of 0.33 (3/9) for B, and so on.

Target Encoding

Target Encoding / Mean Encoding is a strategy where we use the target variables(output class labels) to extract new features. As the name suggests, Target Encoding /Mean encoding is calculated using the mean value of the target variable on a training data.
Target encoding can be achieved using the below steps:

  1. Choose a categorical variable.

2. Group by the categorical variable and obtain the aggregated sum of Target variable

3. Group by the categorical variable and obtain aggregated count of Target variable.

4. Divide the step 2 / step 3 results and join it back with the train.

SVD Encoding

The Singular-Value Decomposition(SVD) is a matrix decomposition method for reducing a matrix (A) to its constituent parts to make certain subsequent matrix calculations simpler.

A = U . Sigma . V^T

A popular application of SVD is for dimensionality reduction. Let's consider we have a dataset with many columns, dimensionality reduction helps us to reduce it to a smaller subset of features that are most relevant to the prediction problem.

The result is a matrix with a lower rank that is said to approximate the original matrix. To do this we can perform an SVD operation on the original data and select the top k largest singular values in Sigma. These columns can be selected from Sigma and the rows selected from V^T.

In this problem, we’ll construct a matrix of co-occurrences for each pair of features. Each row corresponds to a unique value in feature A, while each column corresponds to a unique value in feature B. Each element is the count of rows where the value in A appears together with the value in B.
You then use singular value decomposition to find two smaller matrices that equal the count matrix when multiplied. The above steps can be achieved using sklearn’s TruncatedSVD.

Feature Selection

While performing One Hot Encoding with the given data, it resulted in around 16k features. Having too many features is not very useful sometimes since irrelevant or less important features don’t improve the performance of the model and too many features increase the training time of the model.

Feature Selection is the process where you automatically or manually select those features which contribute most to your prediction variable. Having irrelevant features in your data can decrease the accuracy of the models and make your model learn based on irrelevant features.

There are various ways to perform feature selection and one of the methods I have used is the Chi-squared method.

What is a Chi-squared value?

A chi-square test is used in statistics to test the independence of two events. Given the data of two variables, we can get observed count O and expected count E. Chi-Square measures how expected count E and observed count O deviates each other.

Higher the Chi-Square value, the feature is more dependent on the response and it can be selected for model training.

Building ML Models

Now, we have the data readily available and we also have used some feature engineering techniques to get some new features, let’s go ahead and try various classification models and compare each model’s performance.

Small Brief on Various Classification algorithms/models:

K-Nearest NeighBour

K Nearest Neighbor

KNN algorithm works by finding the distances between a query and all other points in the given dataset. After calculating the distance between them, K closest points to the query point is chosen (K- Nearest Neighbors). In the case of the classification task, the majority vote is used to find the class label(most frequent label among the nearest neighbors). In the case of regression, the average value is calculated for the nearest neighbors labels.

Support Vector Machines (SVM)

The objective of the support vector machine algorithm is to find a hyperplane in N-dimensional space(N — the number of features) that distinctly classify the data points. To separate the two classes of data points, many possible hyperplanes could be chosen. Our objective is to find a plane that has the maximum margin, i.e the maximum distance between data points of both classes. Maximizing the margin distance provides some reinforcement so that future data points can be classified with more confidence.

Logistic Regression

Logistic Regression

The goal of logistic regression is to find an optimal hyperplane(N-Dim) that separates the positive points from negative points. Logistic regression assumes that the given data is linearly separable.

Random Forest

Random Forests are trained via the bagging method. Bagging or Bootstrap Aggregating, consists of randomly sampling subsets of the training data, fitting a model to these smaller data sets, and aggregating the predictions. Each tree is trained here using different subsets of data. On summarizing, Tree bagging consists of sampling subsets of the training set, fitting a Decision Tree to each, and aggregating their result. The advantage of using the Random Forest model is that since every model is trained using different subsets of data, each model is independent of each other and the model training can be parallelized.

XGBoost (eXtreme Gradient BOOSTing)

Gradient Boosting

The boosting algorithm takes a more iterative approach. It’s still technically an ensemble technique in that many models are combined to perform the final one, but takes a more clever approach. Rather than training all of the models in isolation of one another, boosting trains models in succession, with each new model being trained to correct the errors made by the previous ones.

Hyperparameter Tuning

A hyperparameter is a parameter whose value is set before the learning process begins. Choosing the best hyperparameter is important because it helps to prevent your model from either overfitting or underfitting.

The two most used methods for hyperparameter tuning are Grid Search and Random Search. Here we are using Random Search to find the best hyperparameters

Random search differs from grid search in such a way that unlike Grid search we don't give a set of values to explore for each hyperparameter; rather, we provide a statistical distribution for each hyperparameter from which values may be randomly sampled.

Comparison between Models

Given that we had created some new features using some feature engineering techniques, now its time to try those features with various models to see which features are more important and which models perform better than the other.

Below is a snippet of code where hyperparameter tuning is done for a logistic regression model, the model is trained using the response encoded /target encoded data and the class labels are predicted for the test data.

Similar to the above, we might use different models using different features and find out which model gives the best performance.

The above screenshot shows the test AUC scores for different models that were trained using different sets of features obtained using various feature engineering techniques.

By comparing all the above models, One Hot Encoding looks to be a good feature engineering technique as almost all models give a good performance score with the Logistic Regression model giving the highest score of 0.88167 among the models that were trained using the One Hot Encoded features.

Response encoded features seems to perform decently for all the models while the SVD encoded features doesn’t perform well with KNN, SVM, and LR, it gives a decent AUC score with Random Forest and XGBoost.

Random Forest and XGBoost are interesting models here because if you take a look at the test AUC score for the models that were trained using no additional features ( i.e. model is trained using the given raw data without any feature transforms ), both the models give a good AUC score(0.876 and 0.880 respectively). Also, the performance of both these models improves (test score increase to 0.8856)when we use Frequency encoded features.

If you compare all these scores, 0.88561 seems to be the highest score obtained using the Random Forest model trained using the raw data + frequency encoded features.

Introduction to CatBoost Model

CatBoost is an algorithm for gradient boosting on decision trees. Developed by Yandex researchers and engineers, it is the successor of the MatrixNet algorithm that is widely used within the company for ranking tasks, forecasting, and making recommendations. It is universal and can be applied across a wide range of areas and to a variety of problems.

Advantages of CatBoost Model

1.Great quality without parameter tuning

CatBoost model tends to find the best hyperparameters automatically so we don't need to explicitly perform hyperparameter tuning.

2.Categorical features support

CatBoost model is developed to work well with categorical features so that we don't need to spend time converting them into numerical features and then train the model.

3.Improved accuracy

CatBoost model tends to give better accuracy than other boosting models.

4.Fast prediction

Using the CatBoost model, we can predict the class labels faster than other models.

Below is the code snippet of using CatBoost Classifier for this problem using the given data

As you can see from the above, the CatBoost model gives the BEST TEST AUC score of 0.91446(public score) / 0.90876 (private score) which is highest if you compare all the models.

How Good is this score?
This private score of 0.90876 gets you to the top 5% of Private leaderboard in the Kaggle competition.

Deployment of ML models

The deployment of machine learning models is the process for making your models available in production environments, where they can provide predictions to other software systems. It is only once models are deployed to production that they start adding value, making deployment a crucial step.

There are various ways to deploy your ML model into production and here I have used one of the simplest ways to deploy a model using Flask and AWS instance.

A Simple HTML page that gets the needed details from the user

Given below is a simple code that gets the data from the HTML page, predicts the class label using the user data, and display the status in another HTML page using Flask.

HTML page that states whether the request is approved or rejected.

Further Improvements

You can try using other feature engineering techniques to come up with new features that improve the performance of the model.

You can also try ensemble techniques like cascading to combine multiple models to improve the performance of the model.

Code Reference

https://github.com/Shriram016/Amazon-Employee-access-challenge

Contact
Email Id: shri16ram@gmail.com
Linkedin: https://www.linkedin.com/in/shriram016/
Mobile: +91–7200681570

References

  1. https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5
  2. https://en.wikipedia.org/wiki/Sensitivity_and_specificity
  3. https://developers.google.com/machine-learning/crashcourse/classification/roc-and-auc
  4. https://en.wikipedia.org/wiki/Feature_engineering
  5. https://medium.com/analytics-vidhya/different-type-of-feature-engineering-encoding-techniques-for-categorical-variable-encoding-214363a016fb
  6. https://towardsdatascience.com/the-5-feature-selection-algorithms-every-data-scientist-need-to-know-3a6b566efd2
  7. https://www.kdnuggets.com/2017/10/random-forests-explained.html
  8. https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5
  9. https://www.kaggle.com/dmitrylarko/notebooks
  10. https://towardsdatascience.com/machine-learning-basics-with-the-k-nearest-neighbors-algorithm-6a6e71d01761
  11. https://www.appliedaicourse.com/

--

--