Prediction Model for an unbalanced Dataset

Published in

Decisive Intentions

9 min readNov 4, 2023

Brake testing is a time-consuming and costly operation for factories. The plant needs to shut down the robots to perform the test. It is worth exploring to see if it is possible to predict the test results based on historical data. At the time of this writing, we approach this problem as a binary classification problem.

Data Preprocessing

The receiving data needs to be cleaned and wrangled to be ready to be fed to a machine learning model. In this section, the necessary steps are explained.

Feature Engineering

Brake test data is in the form of a time series, as it records several features for each robot during the brake test over time. There are limited tools and methods suitable for working with time series, so it would be helpful if we could transform the problem into a classification problem, which is easier to solve. To do that, a moving window can be used. For example, if you average `torque_peak` for the past 5 days (excluding the current day) and assign it to a variable in the current day, see the table below for more clarity. Please note that we do that for each axis and individual robots separately, so in the below example, all other axes and robots are not shown.

The following metrics were calculated: average, standard deviation, and correlation (of the variable and timestamp) for all variables of interest for a 3-day and 5-day time window: [‘avgMotTemp’, ‘observed_brake_torque’, ‘observed_static_friction’, ‘torque_const_move’, ‘torque_limit_cmd’, ‘torque_peak’] => [‘avgMotTemp_avg_5’, ‘avgMotTemp_stddev_5’, ‘avgMotTemp_corr_5’, ‘observed_brake_torque_avg_5’, …]. We added an additional 3-day time window because we have a few robots with an available 5-day time window.

Another variable that was added is `test_timestamp` which is basically the result of `axis_test_date` in seconds.

Output variables

Dealing with binary class output variables is easier and more efficient than working with multiclass ones. Furthermore, labels other than `BT_BRAKE_TORQUE_OK` and `BT_BRAKE_NOT_OK` are so rare. So we decided to fold other labels into these two.

Imputation

An n-day time window could result in Null values if the rows in the window are less than n. Furthermore, the Corr function could produce Null if one of the variables has zero variance. As a result, we need to impute the Null values:

Considering 3-day time window instead of a 5-day time window, as most of the robots have no data for 5 consecutive days
Fill N/A using mean with the same Robot Type and overall_test_result
Then fill N/A using mean with the same overall_test_resut
Then drop rows with N/A values as Scikit-Learn can’t handle N/As
Remove all the robots without any recorded failure to make the dataset more balanced.

Here is the result:

Analytics

Predicting brake tests is a classification problem, and because only 0.6 percent of observations are related to brake failure, we are dealing with a highly imbalanced dataset. Based on my initial investigations, Random Forest looks like a good choice for this problem for the following reasons:

It’s highly parallelizable compared to booster-based models
In Scikit Learn the Random Forest function has a large set of parameters, including the ones to deal with imbalanced datasets
Note: As mentioned before, the dataset is highly imbalanced, as less than 0.1% of the observations are labeled ‘BT_BRAKE_NOT_OK’ so it makes sense to use some advanced methods, such as undersampling ‘BT_BRAKE_TORQUE_OK’. Spoiler alert: It turned out that Random Forest can manage an imbalanced dataset even without using any balancing techniques.
The ensembling nature of RF reduce the possibility of overfitting

Why undersampling methods would not work?

With only around 2000 failed brake tests, undersampling provides us with a too small dataset to train a model with. Some of my colleagues did try undersampling and got a great F1-score of 0.92, but there is a major flaw in their approach. They test the model with undersampled data, which is wrong because it does not represent real-world data. I reproduced their work and got a high F1-score of 0.96, but after testing it with the imbalanced test dataset, the F1-score dropped to 0.2.

The main reason for this phenomenon is that even if the model misclassifies 0.5 percent of the passed tests as failing, the Precision score falls below 0.5 and results in a low F1-Score.

Even more complicated methods, such as Instance Hardness Threshold do not do better.

Trying oversampling methods

Oversampling means generating more observations of the minority class. Just duplicating current observation will not work, so we need to use more advanced methods such as the Synthetic Minority Oversampling Technique (SMOTE) [CBHK2002] and the Adaptive Synthetic (ADASYN) [HBGL2008]. All of which perform perfectly on 5-fold cross-validation but poorly on unseen imbalanced data. Here is the recall-precision threshold for a random forest model fitted with oversampled data by ADASYC. Applying SMOTE results in a similar performance.

Leveraging Ensemble Learning

The Scikit-learn RandomForest supports three types of weighted sampling: ‘balanced’, ‘balanced subsample’, and no weight assignment. We will see if these methods help us build a better prediction model.

I tried a randomized hyperparameter grid search to get an idea of the range of hyperparameters values.

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 100, stop = 600, num = 10)]
# Number of features to consider at every split
max_features = ['sqrt', 'log2', None]
# Maximum number of levels in tree
max_depth = [50, 100]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [10, 20, 40, 80]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4, 8]
# Method of selecting samples for training each tree
bootstrap = [True, False]
criterion = ['gini']
class_weight=['balanced','balanced_subsample', None]

This is how the best-performing model trained with all the features looks:

Model with rank: 1
Mean validation score: 0.734 (std: 0.010)
Parameters: {'random_state': 42, 'n_estimators': 100, 'min_samples_split': 80, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': 50, 'criterion': 'gini', 'class_weight': None, 'bootstrap': False}

Model with rank: 2
Mean validation score: 0.734 (std: 0.013)
Parameters: {'random_state': 42, 'n_estimators': 377, 'min_samples_split': 40, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 100, 'criterion': 'gini', 'class_weight': None, 'bootstrap': True}

Model with rank: 3
Mean validation score: 0.715 (std: 0.015)
Parameters: {'random_state': 42, 'n_estimators': 211, 'min_samples_split': 10, 'min_samples_leaf': 1, 'max_features': 'log2', 'max_depth': None, 'criterion': 'gini', 'class_weight': None, 'bootstrap': False}

The performance of the model is shown in the form of a learning curve diagram. It helps us check the bias-variance tradeoff and check the progress of the model. The current diagram shows that we cannot do much better with the current model and data.

Interestingly, “ class_weight” is None in the top 3 best-performing models. So, the best model is achieved when we ignore the imbalance problem altogether.

Feature Importance

Before talking about feature selection, it could be interesting to look at the correlation matrix (for a 3-day time window) to see if variables are correlated or not.

Using the sk-learn “ SelectFromModel “ function, we can find the most important feature in which their score is the above median of all scores. Please bear in mind that, because of the random nature of ensemble methods, the order of the list is not deterministic. I also ran this for the combination of a 3-day and 5-day time window, but as expected, the first 5-day feature (‘observed_static_friction_stddev_5’) appears in 16th place. Here is the correlation matrix of the top features for a 3-day time window.

Note: “torque_limit_cmd_stddev_3” added later based on a model with a 0.3 threshold. See the threshold tuning section.

Fit the model to the data with selected features

I used the same hyperparameter grid as before to narrow the exhaustive search space. Here are the top results:

Model with rank: 1
Mean validation score: 0.744 (std: 0.010)
Parameters: {'random_state': 42, 'n_estimators': 377, 'min_samples_split': 40, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 100, 'criterion': 'gini', 'class_weight': None, 'bootstrap': True}

Model with rank: 2
Mean validation score: 0.743 (std: 0.013)
Parameters: {'random_state': 42, 'n_estimators': 100, 'min_samples_split': 80, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': 50, 'criterion': 'gini', 'class_weight': None, 'bootstrap': False}

Model with rank: 3
Mean validation score: 0.719 (std: 0.014)
Parameters: {'random_state': 42, 'n_estimators': 211, 'min_samples_split': 10, 'min_samples_leaf': 1, 'max_features': 'log2', 'max_depth': None, 'criterion': 'gini', 'class_weight': None, 'bootstrap': False}

And the learning curve for the top-performing

model. We improved the result slightly:

Threshold tuning

Considering that “class_weight” remained None in the top model performers, it is worth checking if changing the decision threshold improves the results. Here is the precision-recall curve:

Normally, in binary class datasets, sk-learn considers 0.5 as the threshold to decide which class to choose. The above shows that the 0.5 threshold results in low recall. Move the threshold to 0.3 to help us achieve the highest F1 score possible.

By using the original hyperparameter grid and applying the 0.3 thresholds, the F score increased to 0.751

Model with rank: 1
Mean validation score: 0.751 (std: 0.015)
Parameters: {'random_state': 42, 'n_estimators': 400, 'min_samples_split': 40, 'min_samples_leaf': 4, 'max_features': 'sqrt', 'max_depth': None, 'criterion': 'gini', 'class_weight': None, 'bootstrap': True}

Model with rank: 2
Mean validation score: 0.750 (std: 0.014)
Parameters: {'random_state': 42, 'n_estimators': 366, 'min_samples_split': 40, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': None, 'criterion': 'gini', 'class_weight': None, 'bootstrap': True}

Model with rank: 3
Mean validation score: 0.750 (std: 0.016)
Parameters: {'random_state': 42, 'n_estimators': 100, 'min_samples_split': 80, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 100, 'criterion': 'gini', 'class_weight': None, 'bootstrap': True}

Exhaustive Hyperparameter search

After finding out the important features, general values of hyperparameters, and the decision threshold, now it’s time to use an exhaustive search:

Expand source

{'n_estimators': [395, 400, 410],
               'criterion': ['gini'],
               'max_features': ['sqrt'],
               'max_depth': [50, None],
               'class_weight' : [None],
               'min_samples_split': [37,38, 39],
               'min_samples_leaf':[2,3],
               'bootstrap': [True],
                'random_state': [42],
              'n_jobs' : [10]}

results:

Model with rank: 1
Mean validation score: 0.754 (std: 0.016)
Parameters: {'bootstrap': True, 'class_weight': None, 'criterion': 'gini', 'max_depth': 50, 'max_features': 'sqrt', 'min_samples_leaf': 3, 'min_samples_split': 38, 'n_estimators': 400, 'n_jobs': 10, 'random_state': 42}

Model with rank: 1
Mean validation score: 0.754 (std: 0.016)
Parameters: {'bootstrap': True, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 3, 'min_samples_split': 38, 'n_estimators': 400, 'n_jobs': 10, 'random_state': 42}

Model with rank: 3
Mean validation score: 0.753 (std: 0.016)
Parameters: {'bootstrap': True, 'class_weight': None, 'criterion': 'gini', 'max_depth': 50, 'max_features': 'sqrt', 'min_samples_leaf': 3, 'min_samples_split': 38, 'n_estimators': 410, 'n_jobs': 10, 'random_state': 42}

Conclusion

Despite all our efforts, we could not reach a F1-score above 0.76. I believe that’s because of the following reasons:

The data does not cover robots and dates consistently. That results in a low learning rate as the model does not have enough data for a given robot to find the patterns.
Too many brake tests with inconsistent results are being run on each robot per day. That makes building a classification model harder.
The low rate of brake test failure results is a highly imbalanced dataset that is hard to approach with common methods.

What should I do next?

The possible solutions to the mentioned problems could be:

Trying to clean data by applying more domain knowledge i.e. removing or merging multiple test results per day per robot into one observation
Acquiring more consistent brake tests from manufacturers
Extracting more variables and information from the brake test logs currently, we use “[‘avgMotTemp’, ‘observed_brake_torque’, ‘observed_static_friction’, ‘torque_const_move’, ‘torque_limit_cmd’, ‘torque_peak’]”, There are as many more.
Trying new ways to deal with the time series problem other than folding it by aggregate functions e.g. Recurrent Neural Networks

Originally published at http://aminrashidi.com.