APS Failure at Scania Trucks Data Set: Predicting if a Truck needs Servicing

Published in

Analytics Vidhya

5 min readSep 13, 2019

In this article we will build a classifier to detect if Scania truck is needed to be serviced or not from APS Failure at Scania Trucks Data Set.

Outline:
The dataset consists of data collected from heavy Scania trucks in everyday usage. The system in focus is the Air Pressure system (APS) which generates pressurized air that are utilized in various functions in a truck, such as braking and gear changes. The dataset’s positive class consists of component failures for a specific component of the APS system. The negative class consists of trucks with failures for components not related to the APS. The data consists of a subset of all available data, selected by experts.
Our goal is to minimize the cost associated with:
1) Unnecessary checks done by mechanic. (10$)
2) Missing a faulty truck, which may cause breakdown. (500$)

Objective :- Here our main objective is to correctly predict if truck needed to be serviced or not and minimize the cost of service.

Dataset Details :

The training set contains 60000 examples in total in which
59000 belong to the negative class and 1000 positive class.
The test set contains 16000 examples.

Number of Attributes : 171
Attribute Information: The attribute names of the data have been anonymized for proprietary reasons. It consists of both single numerical counters and histograms consisting of bins with different conditions. Typically the histograms have open-ended conditions at each end. For example if we measuring the ambient temperature ‘T’ then the histogram could be defined with 4 bins where:
bin 1 collect values for temperature T < -20
bin 2 collect values for temperature T >= -20 and T < 0
bin 3 collect values for temperature T >= 0 and T < 20
bin 4 collect values for temper
The attributes are as follows: class, then anonymized operational data. The operational data have an identifier and a bin id, like ‘Identifier_Bin’. In total there are 171 attributes, of which 7 are histogram variables. Missing values are denoted by ‘na’.

GitHub Repo : https://github.com/subhande/APS-Failure-at-Scania-Trucks-Data-Set

Brief Overview Of Project :

Here we will build a classifier which will minimize the servicing cost. Here in the first step we will impute missing values with mean, median and mode impute. In second step we will apply Logistic Regression, Linear Regression, Random Forest, GBDT to minimize the cost. In third step we will use oversampled data and apply Random Forest and GBDT to minimize the servicing cost.

This is a binary classification problem.

Preparing Data For Model :

From EDA analysis, we know that it is a very highly imbalanced data. This dataset has lots of missing values. Almost every column has some missing values. We will impute the missing values . Various techniques are available to impute missing values.

Here we will use three most common impute techniques which are mean, median and most frequent(mode). We will be using sklearn SimpleImpute to impute missing values.

Here we will use standardized data for Logistic Regression , Lr. SVM and non-standardized for Random Forest and XGBoost ( Tree based methods depends on relative order).

ML Models :

We will train Logistic Regression, Linear SVM, Random Forest and XGBoost on mean, median, mode imputed data. We will do oversampling of mean, median, mode data and train Random Forest and XGBoost on oversampled data i.e total we will be training 4*3 + 2*3 = 18 models in total. We will be using these below three pipelines to train the models. We have divided the train data in two parts train, cv part. We will use cv part data for probability calibration.

Here we have used Logistic Regression, Linear SVM, Random Forest and XGBoost. We Observed than tree based methods are working better (We know that with tabular data Random Forest and XGBoost) like Random Forest and XGBoost.

XGBoost on Median Imputed Data:-

Here we are using median imputed data as we have lots of missing values. We are doing hyperparameter training using RandomSearchCV and 10 fold cross-validation.

Train Cost : 25000 || Test Cost : 52680

Now we will calibrate the probability score to minimize cost further using validation data. Here we will use precision_recall_curve to get the thresholds and will use different threshold values to minimize the cost. We will select the threshold value which gives minimum cost.

Train Cost : 15570 || Test Cost : 11140

After probability calibration we have reduced the cost significantly.

XGBoost on Oversampled Median Imputed Data:-

Here we are using median imputed data as we have lots of missing values. We have oversampled the median imputed data using SMOTE. We are doing hyperparameter training using RandomSearchCV and 10 fold cross-validation.

Train Cost : 19420 || Test Cost : 27860

Train Cost : 13740 || Test Cost : 8660

After probability calibration we have reduced the cost significantly.

Performance Comparison Of All The Models

Imbalanced Data :-

Balanced Data(Oversampled) :-

XGBoost model with median impute and oversampled data using SMOTE performed best.
Median Imputed data performed than other imputes.
XGBoost and Random Forest performed better than other models.
The loweset cost on test data is now 8660$ which is huge improvement.
Refer to Github Repo for detailed code.