Predicting a Failure in the APS of a Scania Truck using Machine Learning — Part 1

Published in

Nerd For Tech

12 min readMay 6, 2021

Photo by Ricardo Gomez Angel on Unsplash

A truck has many sub-systems. One such subsystem is called the APS or the Air Pressure System (APS). The APS is responsible for providing the necessary air pressure inside a truck. Modern heavy haul trucks use air brakes instead of the traditional hydraulic brakes used in other lighter vehicles. These brakes need a constant supply of pressurized air to stay disengaged so that the vehicle can keep moving. If due to any circumstances, this supply of pressurized air gets compromised, the brakes will no longer disengage and the truck will come to a stop. When this happens, the owners have to send a repair vehicle that diagnoses and repairs the malfunction. However, the APS is widespread throughout the truck and the trailer and consists of a large number of pipes that actually supply this air. This makes it very hard to diagnose whether the problem is related to the APS or not. This costs the fleet owners a lot of time and money that could be saved.

Thus, having a machine learning algorithm/system that can predict whether a fault in the truck is due to the APS or not is going to be very helpful for the concerned people as it can reduce the downtimes and the overall money spent in breakdowns up to a certain extent.

The end to end solution to this problem will be published as a series of two parts. In the first part, we will be discussing the basics of the problem such as the problem statement, the constraints, the performance metrics etc and perform the EDA and Data Preprocessing of the dataset. In the second part, we will be training actual Machine Learning models on the processed data and then selecting the best model that will be deployed on an AWS EC2 instance in the form of a webapp made using Streamlit. Read part 2 here.

Problem Formulation for ML

This problem can be treated as a simple binary classification model where the positive class means that the problem in the truck is due to a fault in the APS while the negative class means otherwise. Thus, given a new datapoint referring to a particular breakdown, we need to classify whether the breakdown is a result of a malfunction in the APS.

Business Constraints

Latency: The time taken to make predictions after getting the data must be fairly low to avoid any unnecessary increase in the maintenance time and cost.
Cost of misclassification: The cost of misclassification is very high, especially wrongly classifying a positive class datapoint as it can lead to a complete breakdown of the truck and incur some serious costs. Making a misclassification here is just going to waste the time of the repairmen as they will be searching for problems in the wrong places by keeping their trust in the classification of the model.

Dataset Description

The dataset was provided as part of the IDA-2016 Industrial Challenge and contains readings taken from Scania Trucks in their daily use and are collected and given by Scania themselves. The names of all the features are anonymized due to proprietary reasons.

The dataset for this case study can be found here:

https://archive.ics.uci.edu/ml/datasets/APS+Failure+at+Scania+Trucks

The dataset is divided into two parts, a train set and a test set. The train set contains 60,000 rows while the test set contains 16,000 rows. There are 171 columns in the dataset, one of them being the class label of the datapoint, resulting in 170 features for each data point. The features are a combination of both numerical features and features constructed from histogram bins. All the features are numerical in nature. Just by looking at the dataset, one can tell that it’s a highly imbalanced dataset with 59,000 belonging to the negative class while only 1,000 belong to the positive class. Another thing to note is that the data points contain a lot of missing values. Some features even have more than half of the values missing among them. A lot of data points in both the train and test datasets contain at least one, missing value in them.

Performance Metrics

The following performance metrics are used:

Misclassification Cost: This metric was provided as part of the competition. It is given by:

Misclassification Cost: (cost_2 x FN) + (cost_1 x FP)
Where, cost_1 = 10 and cost_2 = 500

This metric represents the cost that the fleet owners will incur as a result of misclassifications made by the model. In this case, cost_1 refers to the cost that an unnecessary check needs to be done by a mechanic at the workshop, while cost_2 refers to the cost of missing a faulty truck, which may cost a breakdown in the future.

Confusion, Precision and Recall matrices: The confusion, precision and recall metrics will tell us about the performance of the models based on each class, which is very important given that the dataset is so biased towards the negative class.
F-1 Score: The f-1 score in a single metric can tell us about the overall performance of the model based on the precision and recall of the models using a single number which makes it slightly easier to interpret than the matrices. We will be using the Macro-averaged F-1 score because it takes into account the f1 scores of both the classes and hence can tell us the overall performance of the model in both the classes.

Getting Started

First, let's start with importing the libraries and loading the data.

importing libraries and loading the data

Exploratory Data Analysis

At first, we are going to analyse the very basic things about the complete dataset, such as the class distribution of the data etc. before moving on to the actual EDA on the features themselves.

First, let’s see the class distribution of the data:

Class-Distribution of data.

As we can see, out of the 60,000 datapoints in the dataset, 59,000 (~98.3%) belong to the negative class while only 1,000 (~1.7%) belong to the positive class. Thus, the dataset is very highly imbalanced and hence we will need to use techniques that can counter this imbalance otherwise we will be getting a very bad performance on the positive class.

Now, let's see how many features are numerical and how many of them are histogram-based features:

Feature distribution in the dataset.

We can see that there are 7 histograms from which different histogram features have been constructed and for each histogram feature, we have 10 bins, thus making in total 70 features out of the 170 features being histogram features while the remaining 100 features are numerical.

Now, let’s analyze the missing values in the dataset:

While just looking at the data, we realized that there are quite a lot of missing values in the dataset. Now, let us actually analyze this missing data and see exactly how many missing values are there in the dataset. Since we know that there are quite a lot of missing values in the features, we will be analyzing the missing values for each data point and each feature as well.

2. Missing values analysis for each feature:

First, let us see how many features have no missing values and their names.

Features with no missing values.

As we can see, there is only one feature with no missing values, i.e. aa_000.

Now, let us do some further analysis by getting the minimum and the maximum number of missing values in any feature and the distribution of the number of missing values in a feature.

The minimum number of missing values that we can find for a feature is 167 for feature bt_000 and the most number of missing values are 49264 for feature br_000, which is most of the dataset. Due to this large value, the mean gets shifted to about 5000 missing values for each feature which is almost 9 times the median value of 688. This might indicate that there are some features with an absurdly large number of missing values.
Further analysis confirms this assumption where we find out that there are only 28 features that have missing values more than the average and most of them have at least 9550 missing values out of the 60000 train datapoints which is a lot percentage-wise.

Now, let us see the actual number of missing values for each feature by plotting a graph.

The number of missing values in each feature.

Most of the features have less than 5% of missing values and some others that have missing values between 5–25% and a few between 25–50%. However, some features have more than 70% missing values. Based on this percentage of missing values, we are going to make our imputation strategy.

3. EDA on the Features

Now, let us do the EDA on the actual features. Since there are two types of features, i.e., numerical and histogram features, we are going to perform EDA separately on both of these features. However, we have 100 numerical features and 70 histogram features. Thus, EDA on all of those is not possible. Therefore, we are going to select the top 15 features from both types and then perform EDA on them. We will use the feature importances given by RandomForestClassifier for getting the important features.

Function for feature selection:

def top_important_features(data, labels, top_x=15, verbose=10):
    '''
        This function uses random forests to get feature importances of the features and returns top_x important features
        and their feature importances.
        Returns:
            tuple of (features, importances).
    '''
    # training a random forest
    rf = RandomForestClassifier(n_estimators=200, n_jobs=-1, verbose=verbose, random_state=42)
    rf.fit(data, labels)
    
    # get the feature importances
    feat_imp = rf.feature_importances_
    imp_ind = np.argsort(feat_imp)[::-1] # getting the indices in decreasing order of importance
    top15_ft = data.columns[imp_ind][:15]
    top15_imp = feat_imp[imp_ind][:15]
    
    return(top15_ft, top15_imp)

Function to perform univariate analysis on the given features:

def univariate_analysis(features):
    """
        This function takes a list of features and performs univariate analysis on them by plotting CDF, PDF, Boxplots and 
        printing mean, standard deviation and median for that feature.
    """
    for ft in features:
        print('---UNIVARIATE ANALYSIS OF', ft, '---')
        values = train_eda[ft]
        values0 = train_eda[train['class']=='neg'][ft]
        values1 = train_eda[train['class']=='pos'][ft]
        desc_neg = values0.describe() # for printing the mean and standard deviation values of the feature for individual classes
        desc_pos = values1.describe()print("FOR NEGATIVE CLASS:- 1. Mean:", round(desc_neg['mean'], 3), '2. Standard Deviation:', round(desc_neg['std'], 3), '3. Median:', round(desc_neg['50%'], 3))
        print("FOR POSITIVE CLASS:- 1. Mean:", round(desc_pos['mean'], 3), '2. Standard Deviation:', round(desc_pos['std'], 3), '3. Median:', round(desc_pos['50%'], 3))# plots
        fig, ax = plt.subplots(ncols=3, figsize=(18,6))
        sns.kdeplot(values0, ax=ax[0], shade=True, label='Neg')
        sns.kdeplot(values1, ax=ax[0], shade=True, label='Pos')
        sns.kdeplot(values0, ax=ax[1], cumulative=True, label='Neg')
        sns.kdeplot(values1, ax=ax[1], cumulative=True, label='Pos')
        sns.boxplot(data=train, x=Y, y=ft, ax=ax[2])ax[0].set_title('PDF of '+ft+' for Class 0')
        ax[1].set_title('CDF of '+ft)
        ax[2].set_title('Boxplot of '+ft)
        ax[0].legend()
        ax[1].legend()
        plt.show()
        print('-'*100)

EDA on Numerical Features:

First, let’s select the top 15 numerical features.

Now that we have got the top 15 features, let's do a univariate analysis on them.

There is a general trend among most of the 15 features. The positive class values are much more spread compared to the negative class. The negative class has a comparatively smaller spread as shown by its dense PDF and very steep CDF for most of the features
The means, standard deviations and medians all follow the above trend, with all the three values for all these features being very large, around 10 or 100 or sometimes even 1000 and more times of the negative class for the positive class.
Features ci_000, aq_000, bj_000, ck_000, aa_000, dn_000, cq_000, ap_000, by_000, bx_000, bt_000, bb_000 have separated IQRs for the positive class and negative class as shown in the boxplot. The CDFs and PDFs are also much more spread out for the positive class as compared to the negative class.
For features, ai_000, al_000 and am_0, more than 95% of the datapoints have a value equal to or very close to 0 for the negative class.

Now, let us see the correlation of the selected features with the class label.

Correlation analysis of important features wrt class labels.

Features ci_000, aq_000, bj_000, ck_000, aa_000, dn_000, cq_000, ap_000, by_000, bx_000, bt_000, bb_000 have a higher correlation around +0.5 which explains why the positive class datapoints had a higher positive value compared to negative class datapoints.
The features am_0, al_000 and ai_000 that were having most of the values around 0 as seen from their univariate analysis have a smaller value of correlation less than 0.4. Feature ai_000 has the least correlation among all the features around 0.11–0.13.

Let's get some more insights for the feature ai_000 by performing bivariate analysis on it with the other top features.

Bivariate analysis of feature ai_000

For most of the datapoints, the feature ai_000 has a value equal to 0. There are some datapoints that have a nonzero value for both classes.
This feature can be used with other important features for prediction. The other features are trying to make a separation between the two classes on their own axis as it can be seen that for most of the plots the positive class (orange) points have higher values in the x-axis as compared to negative class points.

EDA on histogram features

First, let's select the top 15 features for the histogram features.

Feature selection from histogram features

Now that we have got the top 15 features, let’s do a univariate analysis on them.

Univariate analysis on histogram features

The same trend is being followed in the case of histogram features as well. The datpoints of positive class, in general, have larger means, medians and standard deviations and have a much higher spread compared to the negative class datapoints.
For features ag_002, ag_001, cn_000, cn_001, az_000 and ay_005 have more than 95% of the features with very small values (comparatively) for the negative class.
Furthermore, features ag_002, ag_003, ag_001, cn_000, cn_001 and ay_005 have a median value of 0, showing that atleast 50% of values are equal to 0 for the negative class.
From the importances plot, we can see that the most important features were less than or equal to 0.03 while only two features i.e., ag_002 and ee_005 have values more than 0.06 and 0.05 respectively.

Now, let us see the correlation of the selected features with the class label.

correlation analysis of the histogram features with a class label.

Features ee_005, ag_003, ba_000, cn_001, ee_000, ba_003, cs_004 and ba_004 have a correlation of more than or equal to 0.4.
ee_005 has the highest correlation coefficient value of around 0.49.
ag_001 and az_000 have the lowest correlation coefficient value of around 0.18 and 0.2 respectively.

Since ag_001 and az_000 have the least correlation, let's do a bivariate analysis of both these features using scatterplots and see if using two features improves anything. First, let us do the bivariate analysis of ag_001.

Most of the values for ag_001 are relatively small for all the datapoints.
The larger values of this feature are all for the positive class.

Now, it’s time for doing the same for the feature az_000.

Bivariate Analysis for feature az_000

The feature az_000 has a lot of intersection for both the classes.
For this feature too, most of the values are small.

Multivariate Analysis of all the features.

For doing multivariate analysis, we are going to perform t_SNE and plot the results.

t-SNE plot for all the features

From the t_SNE plot, we can see that negative class points are scattered everywhere. However, they are in well-defined clusters.
The positive class datapoints, however, are also in clusters but these clusters are intersecting quite a lot with the negative class clusters as we can see in the plot.
In 2D, there is a lot of mixing but given that the points are present in clearly visible clusters, in higher dimensions, there are very high chances of getting a separation between clusters of both the classes.

Missing data imputation

For imputing missing data, we are going to use the following strategy:

or features with less than 5% missing values, we will be doing mean imputation.
For features with missing values between 5 and 15%, we will use median imputation because of its robustness to outliers.
For features with missing values between 15–70%, we will use model-based imputation. For this, we are going to use two different techniques:-

MICE- Multiple Imputation by Chained Equations is a robust, informative method of dealing with missing data in datasets. The procedure imputes missing data in a dataset through an iterative series of predictive models. For learning more about MICE, refer this.
KNN Based Imputation

For features with >70% missing values, we will drop them entirely from the dataset.

Let us perform missing value imputation now. First, we are going to drop the features with >70% missing values, perform mean imputation and median imputation.

Note: strategy_list is just a nested list containing 4 different lists containing feature names based on the strategy defined above.

Now, let us perform MICE imputation and KNN-based imputation on the features having 15–70% of missing values, impute all the data and then save the thus formed dataframe and the objects.

We then load those objects from the disk, perform imputation on the test dataset and save them.

Data Normalization

Now that we have all the data imputed successfully and all the features constructed, it's time for performing feature scaling. For feature scaling, we will perform normalization. We are going to normalize both the datasets

Normalizing the datasets

This concludes part-1 of this two-part blog. In the second blog, we are going to discuss the modelling part and the deployment of the best model in an AWS EC2 instance. And finally, conclude this solution. Read part 2 here.