How to Make A Complete Machine Learning Project?

A Machine Learning Project A-Z!

Adarsh Verma
Deep Data Science
11 min readMar 5, 2020

--

Photo by Stanley Dai on Unsplash

Machine learning is a trending field and it has the ability to solve many business problems effectively. A machine learning project starts with a question, or with the idea of seeking something. We find a data set (or multiple data sets and integrate) accordingly, perform exploratory data analysis and then machine learning algorithms to seek answers or perform some kind of prediction. We can also build meaningful inferences based real-world applications.

Note: This is a long article, please skip to the section which is more relevant to you.

Code: Download from Github Click here

Dataset: Downloaded from Kaggle. Click here

[Question/Problem Statement] In this project, we are focusing on the opinions of the employees about the companies they have worked for or still working; seeking how opinions differ when the employees are working for the company and when they have left the company. Furthermore, these opinions can be used to categorize the employment status of the employees such as a current employee (label-0) or former employee (label-1). I have used Machine learning algorithms to find out if the employees are former or current employees of a company depending on the feedback they gave. Please check the code to follow along.

Project Name: Employee’s status (Former or Current ) classification based on company reviews given by employees

The project is divided into 4 main parts:

  • Data Pre-processing & Exploratory Analysis
  • Experiment 1: Predictive Modeling with Numeric Data only
  • Experiment 2: Predictive Modeling with Text Data only
  • Experiment 3: Predictive Modeling with Numeric & Text Data

This project is developed from scratch while focusing on every step of a machine learning project. It consists of the following steps:

A. Data Pre-processing

  • Class Imbalance
  • Feature Separation
  • Data Cleaning
  • Treatment of missing values
  • Feature Engineering
  • Text Pre-processing with NLP
  • Conversion of Categorical features into Numerical features
  • Data Standardization

B. Model Training & Selection Using K-fold Cross-Validation

  • K-fold cross validation
  • Train-test split
  • Model selection

Models used — Dummy Classifier (base line); Logistic Regression; Linear Support Vector Classifier (SVC); Decision Tree Classifier; K Neighbors Classifier; Multinomial Naïve Bayes;

Ensemble/Meta-modelling — ADABoost, Bagging Classifier, Random Forest Classifier, Voting Classifier

C. Model EvaluationAccuracy, Precision, Recall, F1-Score, AUC

D. Statistical Testing — Paired t-test

E. Fine Tuning the Model for Performance

Goal: The classification task we want to perform on our dataset is between current and former employees (job status) through the reviews and other features. Hence, the model, given a feedback instance, will classify the into a former or a current employee of the company.

Have a look at the data:

A. Data Pre-processing

The dataset contains over 67k employee reviews for Google, Amazon, Facebook, Apple and Microsoft web scrapped from Glassdoor. The original dataset contains 17 features which contain both numeric (ratings) and text data (feedback comments).

  • Class Imbalance — Class imbalance is a very common challenge while handling a classification problem in Machine Learning. As we are considering ‘job_status’ as our target label for the classification, we need to consider if all the classes of ‘job_status’ are equally distributed in the dataset. Our dataset has 2 categories of ‘job_status’ namely ‘current_employee’ and ‘former_employee’, and we discovered that ‘current_employee’ has a significantly greater number of instances (~42k) than ‘former_employee’ (~24k). The difference between the number of instances for the two class labels was above 15,000 and this may lead to a biased model. Of course, there are many solutions for this problem but specific to our data set is under-sampling will work just fine. We chose 20k instances randomly from both of the classes and created a balanced data set.
Class Imbalance (left) handling with Under-sampling; Balanced dataset (right)
  • Feature Separation: Our dataset has a uniqueness to it because it contains three categories of data type — numeric, categorical and text data. I chose this dataset because it’s challenging for pre-processing but at the same time it gives back more experience in terms of data preprocessing and a variety of model building. Furthermore, there is a wide array of Machine learning algorithms that work well on numeric data while there are other machine learning algorithms that work better on text data. So, the type of algorithms we choose depends on the nature of the data. The diversity in the context of our dataset makes it appropriate to separate text and numeric data and run different experiments on them. For this step, data was vertically divided into text and numerical data for further exploration, processing, and predictive modeling.
  • Text pre-processing — We can extract key features from text such as word count, average word length, frequency of words, TF-IDF etc. through data exploration. Please follow the below articles for text data pre-processing and feature extraction and intuition behind extracting those features.

Text and numeric data are pre-processed separately. The below sections perform data pre-processing on the numeric data.

Numeric & Categorical data pre-processing —

  • Imputing missing values and feature conversion — to Numeric Data: Our dataset has ratings such as work-life balance ratings, culture, and values ratings etc. given by either current or former employees for different companies. We found that there were missing values in these ratings. Missing values should be treated carefully and the treatment method varies from data to data while keeping the context in mind. Since these ratings were ordinal in nature, we replace the missing values with the numeric value ‘2.5’ (mean rating). The reason why we chose ‘2.5’ is that it is not too high nor too low and thus, will not have a significant impact on our models, and we can potentially avoid biases.
  • We also discovered that our dataset contains categorical data, and we want to convert them to numeric data. For instance, ‘company’ has 6 different categories, so we can use dummy encoding that map employees to their respective companies. The way it works is value ‘1’ will be set if the employee works or worked in that particular company and the rest will be set to ‘0’. Figure 4 shows a snippet of our dummy encoding technique. For other categorical features such as ‘location’ and ‘job title’, we can replace the categorical data with their respective frequency distribution, which is a numeric data. The formula used for frequency distribution:

Frequency distribution for feature A = sum(∀x in A)/sum(all different values in feature A)

Dummy Encoding for ’Company’ feature

Feature extraction — Numeric Data: Some features in our dataset contain values that can be separated and form new features of its own. One such feature is ‘Date’ where we separated its values and formed 3 new features namely ‘Post month’, ‘Post Year’ and ‘Post Day’. Another such column is ‘Job title’ which is in the format — <Current/former Employee> — <Job Title>. Two examples: Current Employee — Software Engineer and Current Employee — Manager. Hence we separated job status <Current/former Employee> and job title into two columns: job_status and designation.

Target Variable Conversion — ‘job_status’ is the target which is converted from categorical to numerical values by assigning the values 0 and 1. We found that using ‘job_status’ as our target label will be most appropriate with respect to the dataset that we have. Current Employee = 0, Former Employee = 1

Data Standardization— Since our input space has a range of values from 0–3000 (helpful count), we decided to apply data standardization, which will bring the data between 0 and 1. Otherwise variables might not contribute equally to the model and larger values might dominate. This can be easily done with Scikit-learn’s MinMaxScaler function.

B. Model Training & Selection Using K-fold Cross-Validation

Please refer to the below post for train-test split, k-cross validation, accuracy, precision, recall and F1-Score in a simplified way!

In this project, we used 10-cross validation. Dataset was divided in 10 sets; models were trained for 10 iterations; in each iteration 9 sets (90% data ) were used for training and 1 set (10%) for testing. Models used for k-cross validation:

Dummy Classifier (base line); Logistic Regression; Linear Support Vector Classifier (SVC); Decision Tree Classifier; K Neighbors Classifier; Multinomial Naïve Bayes;

Ensemble/Meta-modelling — ADABoost, Bagging Classifier, Random Forest Classifier, Voting Classifier

Experiment 1: Predictive Modeling for Numeric Data — For our model construction, we have implemented six models i.e. Linear classifier, Tree-based model, Distance-based, Rule-based model, Probabilistic model, and Ensembles. The numeric dataset has 18 features which became our inputs for the models and ‘job_status’ as the target label. The section below provides a brief description of each model and its performance on the numeric data.

Sample of Numeric Data

K-cross validation on numeric data: The below figure shows the accuracy performance of various Machine learning models that were tested on our numeric data. As we can see here that Decision Tree Classifier performed the best with an accuracy of 73.7% over a 10-fold cross-validation while Dummy Classifier performed the worst with an accuracy of just 49.7%, much like random guessing. From this, we decided to apply ensembling on our Decision Tree Classifier which we hope will give better performance on the accuracy metric.

Accuracy performance of different ML models on Numeric Data

Ensemble: Ensembles are used to increase the performance of different models, become less prone to overfitting, and also increase the diversity in the learning style. We have used 10 decision trees for Random Forest while for Bagging — 20 decision trees were used. For Boosting, AdaBoost has used 10 decision trees. Below figure shows the accuracy of the ensemble methods we used. The 3 ensemble models didn’t have a significant difference in their accuracy performance with a range between 73%-75%.

Now to check the precision, recall, f1-score and support, the dataset was split into 2 parts with 80% as the training set and 20% as the testing set.

Accuracy performance of different ensemble models on Numeric Data

A custom voting classifier was also used which is essentially a meta-meta-model consisting of AdaBoost, Bagging classifier and Random Forest. The voting was set to ‘hard’ so that voting is done on the labels and not on the probabilities. The classifier was applied to the dataset (8/10 part training and 2/10 part testing) and the accuracy was measured. Below figure shows the accuracy with f1-score, support, precision, recall of the custom voting classifier.

Different statistical metric values on customer Voting Classifier on Numeric Data

From all the four ensemble models and the class label ‘cur- rent_employee’, AdaBoost has the highest precision score i.e. 0.74, Random Forest has the highest recall i.e. 0.77 while both Random Forest and the custom voting classifier has highest f1- score i.e. 0.75. For the class label ‘former_employee’, Random forest has the highest precision score i.e. 0.76, AdaBoost with the highest recall i.e. 0.73 and the custom voting classifier with the highest f1-score i.e. 0.74 and the rest with a score of 0.73. On the overall accuracy, the voting classifier performed the best with an accuracy of 74%.

AUC (Area Under Curve) for Ensembles: Finally, for the numeric data, we calculated the AUC, and we can see here that the custom classifier has the highest AUC with a value of 0.74146 . From visual observations, all of them are almost performing the same but the custom voting classifier is the winner here and Random Forest runner up!

AUC of the four ensemble models on Numeric Data

Experiment 2: Predictive Modeling for Text Data: The final features selected after text processing were the calculated TF-IDF for feedback, the average word length and the feedback word count, which became the input space with ‘job_status’ as the target label. For the selected features, we ran our four models i.e. Logistic Regression and Linear SVC (Support Vector Classifier), Multinomial Naïve Bayes classifier and Random Forest Classifier with a 5 fold cross-validation. Below figure shows the mean accuracy of the four models. From the results, Logistic Regression was performing better than the other models. However, with text features, we couldn’t achieve significant accuracy!

Accuracy performance of different machine learning models on Text Data

Experiment 3: Predictive Modeling on Combined Data and Selected Models: For the predictive modeling on the combined dataset — numeric + text features, we selected all the features on the numeric dataset and three features for the text dataset namely ‘feedback_word_count’, ‘avg_word_length’ and ‘sentiment_score’. Totally, the final input space contained 22 features with 40,000 instances with no class imbalance. The target label remained the same with ‘job_status’ feature for classification.

The feature ‘sentiment score’ was re-scaled between 0–1 since it contains negative values and Multinomial Naïve Bayes cannot work with negative values. Again,10-fold cross-validation was applied on all the models while keeping track of the accuracy metric. We found that Decision Tree Classifier was performing better than the other models.

Again we build ensemble models considering Decision Tree Classifier as base model and found Random Forest performed the best with an accuracy of ~74%. We discarded other base models.

AUC(Area Under Curve) for ensemble models: We calculated the AUC, and we can see here that Random Forest has the highest AUC with a value of 0.7399 for our dataset.

ROC Curves of the four ensemble models on Combined Dataset

F. Statistical Testing

As we can see there’s not much difference between the accuracies of Random Forest and Voting Classifier. Hence to chose one final model we need to perform statistical test if there’s an actual difference between accuracies or just by chance. As we have only one dataset, we need to use a standard test that can test the models on multiple folds of the single dataset but with multiple algorithms. Student’s t-test is suitable statistical test for this problem. We need to identify if, for a significance level 0.05, is there a statistical difference or the difference is by chance? Our null hypothesis is “there’s no difference in the accuracies of Random Forest and Voting Classifier”.

With significance level α = 0.05 ; all the p values we found were smaller than 0.05, hence we reject the null hypothesis. That means there’s significant difference between the algorithm’s results. So we choose Random Forest as our final model for future predictions.

G. IMPLICATIONS AND FUTURE WORK

Employee satisfaction is one of the reasons behind the great productivity of the employees, and it also keeps the attrition rate minimum. In this project, we have explored the employee’s review dataset and tried to find out the reviews given by employees are from the former employee or current employees. The classification between the current and former employees was done depending on their reviews with ~73% accuracy.

Further work can be done on top of this project. Data could be collected and integrated from different sources to make the models more accurate for the predictions. Intentions of the current employees can also be determined if they’re thinking of leaving the company. Hence, employee satisfaction and retention can be improved by human resources in companies. There are also many hidden insights present in the data, which can be mined and provided to the companies to help them improve the policies for the betterment of the employees.

Cheers! Shoot comments if need more clarification on any part :)

--

--