APS Failure at Scania Trucks

Published in

The Startup

28 min readJun 29, 2019

Introduction

The dataset consists of data collected from heavy Scania trucks in everyday usage. The system in focus is the Air Pressure system (APS) which generates pressurized air that are utilized in various functions in a truck, such as braking and gear changes. The dataset positive class consists of component failures for a specific component of the APS system. The negative class consists of trucks with failures for components not related to the APS. The data consists of a subset of all available data, selected by experts.

The dataset was released by Scania CV AB on the UCI Machine Learning Repository. The challenge was to predict the failure of Scania Air Pressure System (APS) in trucks to enable preventive maintenance and thereby reduce the maintenance costs. The dataset is anonymized and contains binned values due to proprietary reasons.

Our goal is to minimize the costs associated with:

Unnecessary checks done by a mechanic. ($10)
Missing a faulty truck, which may cause a breakdown in the future. ($500)

However the main objective will be to predict and minimize the cost of failures associated with these combinations of readings.

Exploring the Dataset

The training set contains 60000 examples in total in which 59000 belong to the negative class and 1000 positive class. The test set contains 16000 examples.

Number of Attributes: 171
Attribute Information: The attribute names of the data have been anonymized for proprietary reasons. It consists of both single numerical counters and histograms consisting of bins with different conditions. Typically the histograms have open-ended conditions at each end. For example if we measuring the ambient temperature ‘T’ then the histogram could be defined with 4 bins where:
bin 1 collect values for temperature T < -20
bin 2 collect values for temperature T >= -20 and T < 0
bin 3 collect values for temperature T >= 0 and T < 20
bin 4 collect values for temperature T > 20

Let us look at the dataset by loading and displaying the training data:

Lets Perform Exploratory Data Analysis on the Dataset

From the EDA analysis, it is clear that the given dataset is highly imbalanced. Next, let us check whether the train dataset has any null values. The train dataset has lot of null values. Let us replace all the ‘na’ with numpy ‘NaN’ values and then plot the bar plots to check the number of missing values in each columns. We observe that most of the columns are having missing values, if we remove all these columns with missing values we will end up having datapoints with very less data for training the model. This could lead the model to overfit. Hence we will remove all the columns in the train dataset where the number of missing values are more than 70% i.e columns having 42k missing values in the train dataset. Once we remove all the columns with missing values more than 70%, we reduce the number of features from 171 to 160 columns.

Even after removing columns with missing values, we still have columns where the missing values are more than 30% and we need to impute them. Various techniques are available to impute those missing values. The most common imputation techniques are mean, median and most frequent. We will use mean, median and most frequent imputation techniques to impute missing values in the train data which can be readily implemented using sklearn imputation library .

# Imputation using median
impute_median = SimpleImputer(missing_values= np.nan, strategy='median',copy = True, verbose= 2)
train_imputed_median = pd.DataFrame(impute_median.fit_transform(train),columns=train.columns)
train_imputed_median.to_csv("Train_imputed_median")# Imputation using mean
impute_mean = SimpleImputer(missing_values= np.nan, strategy='mean',copy = True, verbose= 2)
train_imputed_mean = pd.DataFrame(impute_mean.fit_transform(train),columns=train.columns)
train_imputed_mean.to_csv("Train_imputed_mean")# Imputation using most frequent
impute_most_frequent = SimpleImputer(missing_values= np.nan, strategy='most_frequent',copy = True, verbose= 2)
train_imputed_most_frequent = pd.DataFrame(impute_most_frequent.fit_transform(train),columns=train.columns)
train_imputed_most_frequent.to_csv("Train_imputed_most_frequent")

In sklearn, we have the SimpleImputer Library, which can implement the missing values using the three techniques mentioned above. In SimpleImpute, by just mentioning the strategy as mean, median and most frequent, we can implement all the three imputation techniques.

Having performed the Exploratory Data Analysis on the train dataset, a similar approach is performed on the test dataset.

It can noted that the test dataset too is highly imbalanced, where the negative class dominates the positive class. The test dataset also has large number of missing values and we remove those columns in test dataset, which we removed in train dataset. Imputation techniques are applied to the test dataset, to fill the missing values using the SimpleImputer library.

Machine Learning Models

Since the dataset is highly imbalanced, the negative class is undersampled and the positive class is upsampled, so that we have a balanced dataset. To upsample the positive class, we have used SYNTHETIC MINORITY OVERSAMPLING TECHNIQUE (SMOTE) which is the most widely used upsampling technique which can be implemented using the imblearn library. It works by creating synthetic samples from the minor class instead of creating copies. The algorithm selects two or more similar instances (using a distance measure) and perturbing an instance one attribute at a time by a random amount within the difference to the neighboring instances.

Following ML models are tried to get the minimum cost using Median, mean and most frequent imputation techniques.

Logistic Regression
XgBoost Classifier
Random Forest Classifier

Note: Before applying the data to any of the machine learning model mentioned above, it is necessary to perform column standardization. We have used standard scalar library from sklearn to perform column standardization of our train and test data.

Error metric used is F1 score.
Total cost = FN * 500 + FP *10
Where FN is also called as Type 2 error and FP is called as Type 1 error. The challenge is to find an optimal model so that the total cost is minimized. The cost can be minimized by reducing FN, therefore we can tune the ML model to have a better recall, but this could lead to increase in the number of FP. To reduce the FP we will then need to tune my models to have a better precision, but that in turn would lead to increase in FN.

F1 is an overall measure of a model’s accuracy that combines precision and recall. A good F1 score means that you have low false positives and low false negatives.

1. Machine Learning Models using Median Impute

a) Logistic Regression:

Let us First try Logistic Regression model. Below is the code that is used to predict the cost of maintenance. Hyperparameter tunning is done using Gridsearch. In logistic Regression parameter C is the hyper parameter. Since the original train data was highly imbalanced, while performing hyper parameter tuning using Gridsearch we tried different values for C and we also tried with L1 and L2 regularizer .

# Defining the LR model and performing the hyper parameter tuning using gridsearch
#weights = np.linspace(0.05, 0.95, 20)
params = {'C' : [
                10**-4,10**-3,10**-2,10**-1,1,10**1,10**2,10**3],
          'penalty': ['l1', 'l2']#,'class_weight': [{0: x, 1: 1.0-x} for x in weights]
         }clf = LogisticRegression(n_jobs= -1,random_state=42)
clf.fit(train_std,y_train)
model = GridSearchCV(estimator=clf,cv = 2,n_jobs= -1,param_grid=params,scoring='f1',verbose= 2,)
model.fit(train_std,y_train)
print("Best estimator is", model.best_params_)

After doing hyperparameter tuning, the best parameters that we obtained were C = 1000 and L2 regularizer.

# model fitting using the best parameter.
clf = LogisticRegression(n_jobs= -1,random_state=42,C= 1000,penalty= 'l2')
clf.fit(train_std,y_train)
y_pred = clf.predict(test_std)
con_mat =confusion_matrix (y_test, y_pred)
print("-"*117)
print('Confusion Matrix: ', '\n',con_mat)
print("-"*117)
print("Type 1 error (False Positive) = ", con_mat[0][1])
print("Type 2 error (False Negative) = ", con_mat[1][0])
print("-"*117)
print("Total cost = ", con_mat[0][1] * 10 + con_mat[1][0] * 500)
print("-"*117)

After fitting the Logistic Regression model with the best parameter, we observed that the FN were 34 and FP were 356 and hence the total cost we obtained was 20560.

If somehow we can reduce the number of FN, we can drastically reduce the cost, since total cost is equal to (FN * 500 + FP *10). The possible way of doing it is by adjusting the probability threshold of the fitted model. A best probability threshold is decided by using cross validation which gives the least cost. In this process, we will end up finding the optimal values for FN and FP. The code below does exactly the same mentioned above. We have chosen 10 fold cross validation to find the optimal threshold.

# Cross Validation to find the best threshold.
trail = 10
plot = []
for x in range(0,trail):
    train, test, y_tr, y_ts = train_test_split(train_std, y_train, stratify = y_train, train_size = 0.7)
    clf.fit(train,y_tr)
    pred = clf.predict_proba(test)[:,1]
    precision, recall, thresholds = precision_recall_curve(y_ts, pred)
    thresholds = np.append(thresholds,1)
      
    costs = []
    for threshold in thresholds:
        y_pred_thres = pred > threshold
        c = confusion_matrix(y_ts,y_pred_thres)
        cost = c[0,1] * 10 + c[1,0] * 500
        costs.append(cost)
        
    plot.append({'threshold': thresholds, 'precision':precision,'recall': recall, 'costs':costs})
    
# Plot of recall, precision v/s threshold and cost v/s threshold
plt.figure(figsize=(15,5))
plt.subplot(1,3,1)
for x in plot:
    plt.plot(x['threshold'],x['precision'],'r')
    plt.plot(x['threshold'],x['recall'],'g')
    
plt.legend(('precision','recall'))
plt.xlabel('Threshold')
plt.ylabel("Precision/Recall")plt.subplot(1,3,3)
for x in plot:
    plt.plot(x['threshold'],x['costs'],'y')
plt.legend(('costs'))
plt.xlabel('Threshold')
plt.ylabel("cost")
plt.show()

Plot of Precision — Recall Trade-off for different values of Probability Threshold for Logistic Regression Model

What we observe from the above plot is the trade-off between precision and recall for different values of probability threshold, done using cross validation. At threshold 0.2, we observe that precision is almost more than 95% and recall is almost around 98%. We want our recall to be near to 100% and at the same time we also want our precision to be high, so that we get optimal values of FP and FN, thus giving us the optimal cost.

Once the optimal threshold was decided, we preformed predict_proba with optimal threshold, to get the predicted probabilities of each class and from the confusion matrix we found the FN and FP. Now the cost reduced from 20560 to 18570, with FN as 24 and FP as 657.

# model fitting and finding the cost with best threshold.
clf.fit(train_std,y_train)
y_pred_prob = clf.predict_proba(test_std)[:,1]  > 0.2
con_mat =confusion_matrix (y_test, y_pred_prob)
print("-"*117)
print("Type 1 error (False Positive) = ", con_mat[0][1])
print("Type 2 error (False Negative) = ", con_mat[1][0])
print("-"*117)
print("Total cost = ", con_mat[0][1] * 10 + con_mat[1][0] * 500)
print("-"*117)

b) XgBoost:

Next we will apply XgBoost Classifier. Below is the code that is used to predict the cost of maintenance. Hyperparameter tunning is done using Gridsearch. In XgBoost maximum depth and number of estimators are the hyper parameter.

# model fitting and hyper parameter tuning to find the best parameter.
x_cfl=XGBClassifier()prams={
    #'learning_rate':[0.01,0.03,0.05,0.1,0.15,0.2],
     'n_estimators':[100,200,500,1000,2000],
     'max_depth':[3,5,10],
    #'colsample_bytree':[0.1,0.3,0.5,1],
   # 'subsample':[0.1,0.3,0.5,1]
}
model=GridSearchCV(x_cfl,param_grid=prams,verbose=10,n_jobs=-1,scoring='f1',cv=5)
model.fit(train_std,y_train)
print("Best estimator is", model.best_params_)

The best parameters we obtained after doing hyper parameter tuning were, max_depth was 10 and n_estimators was 2000.

#model fitting with the best parameter.
clf = XGBClassifier(max_depth= 10,  n_estimators= 2000,n_jobs= -1)
clf.fit(train_std,y_train)
y_pred = clf.predict(test_std)
con_mat =confusion_matrix (y_test, y_pred)
print("-"*117)
print('Confusion Matrix: ', '\n',con_mat)
print("-"*117)
print("Type 1 error (False Positive) = ", con_mat[0][1])
print("Type 2 error (False Negative) = ", con_mat[1][0])
print("-"*117)
print("Total cost = ", con_mat[0][1] * 10 + con_mat[1][0] * 500)
print("-"*117)

The total cost for XgBoost model trained with the best parameters was 18790. The number of FN were 34 and FP were 179. As compared to Logistic Regression, the cost of XgBoost model is much lower. We can still reduce the cost of XgBoost model, by adjusting the probability threshold, thus setting a trade-off between precision and recall as we did in the logistic regression.

The code below finds the optimal threshold, so that the cost is kept minimum by adjusting the values of FN and FP. As in the case of Logistic Regression, we have used 10 fold cross validation for finding the optimal threshold.

#Cv to find the best threshold to minimize the total cost.
trail = 10
plot = []
for x in range(0,trail):
    train, test, y_tr, y_ts = train_test_split(train_std, y_train, stratify = y_train, train_size = 0.7)
    clf.fit(train,y_tr)
    pred = clf.predict_proba(test)[:,1]
    precision, recall, thresholds = precision_recall_curve(y_ts, pred)
    thresholds = np.append(thresholds,1)
      
    costs = []
    for threshold in thresholds:
        y_pred_thres = pred > threshold
        c = confusion_matrix(y_ts,y_pred_thres)
        cost = c[0,1] * 10 + c[1,0] * 500
        costs.append(cost)
        
    plot.append({'threshold': thresholds, 'precision':precision,'recall': recall, 'costs':costs})plt.figure(figsize=(15,5))
plt.subplot(1,3,1)
for x in plot:
    plt.plot(x['threshold'],x['precision'],'r')
    plt.plot(x['threshold'],x['recall'],'g')
    
plt.legend(('precision','recall'))
plt.xlabel('Threshold')
plt.ylabel("Precision/Recall")plt.subplot(1,3,3)
for x in plot:
    plt.plot(x['threshold'],x['costs'],'y')
plt.legend(('costs'))
plt.xlabel('Threshold')
plt.ylabel("cost")
plt.show()

Plot of Precision — Recall Trade-off for different values of Probability Threshold for XgBoost

At 0.25 threshold, we observe that precision is almost more than 95% and recall is almost around 98%. Once the optimal threshold was decided, we preformed predict_proba with optimal threshold, to get the predicted probabilities of each class and from the confusion matrix we found the FN and FP. The cost reduced from 18790 to 14940, with FN as 245 and FP as 244.

#model fitting and predicting using the best threshold.
clf.fit(train_std,y_train)
y_pred_prob = clf.predict_proba(test_std)[:,1]  > 0.25
con_mat =confusion_matrix (y_test, y_pred_prob)
print("-"*117)
print("Type 1 error (False Positive) = ", con_mat[0][1])
print("Type 2 error (False Negative) = ", con_mat[1][0])
print("-"*117)
print("Total cost = ", con_mat[0][1] * 10 + con_mat[1][0] * 500)
print("-"*117)

c) Random Forest

Having tried with Logistic Regression and XgBoost, Next we Tried Random Forest. Gridsearch was used for hyper parameter tuning . The best parameters found for Random forest were maximum depth of 10 and number of estimators as 1000.

# model fitting and hyperparamter tunning
x_cfl=RandomForestClassifier()
prams={
     'n_estimators':[100,200,500,1000,2000],
     'max_depth':[3,5,10] }
model=GridSearchCV(x_cfl,param_grid=prams,verbose=10,n_jobs=-1,scoring='f1',cv=5)
model.fit(train_std,y_train)
print("Best estimator is", model.best_params_)

Nest we fitted the Random Forest with the best parameters.

# model fitting using the best parameter and predicting the cost
clf = RandomForestClassifier(n_estimators= 1000 , max_depth=10,n_jobs= -1)
clf.fit(train_std,y_train)
y_pred = clf.predict(test_std)
con_mat =confusion_matrix (y_test, y_pred)
print("-"*117)
print('Confusion Matrix: ', '\n',con_mat)
print("-"*117)
print("Type 1 error (False Positive) = ", con_mat[0][1])
print("Type 2 error (False Negative) = ", con_mat[1][0])
print("-"*117)
print("Total cost = ", con_mat[0][1] * 10 + con_mat[1][0] * 500)
print("-"*117)

Total cost for Random forest with best parameters was 11,210 with FN as 14 and FP as 421. We can still further reduce the cost by adjusting the probability threshold of the model. By doing so we are basically adjusting the trade-off between precision and recall as we did in the case of Logistic Regression and XgBoost.

In the code below, we perform CV to find the optimal value of the threshold. We have chosen 10 folds for cross validation.

# CV to find the best threshold.
trail = 10
plot = []
for x in range(0,trail):
    train, test, y_tr, y_ts = train_test_split(train_std, y_train, stratify = y_train, train_size = 0.7)
    clf.fit(train,y_tr)
    pred = clf.predict_proba(test)[:,1]
    precision, recall, thresholds = precision_recall_curve(y_ts, pred)
    thresholds = np.append(thresholds,1)
      
    costs = []
    for threshold in thresholds:
        y_pred_thres = pred > threshold
        c = confusion_matrix(y_ts,y_pred_thres)
        cost = c[0,1] * 10 + c[1,0] * 500
        costs.append(cost)
        
    plot.append({'threshold': thresholds, 'precision':precision,'recall': recall, 'costs':costs})plt.figure(figsize=(15,5))
plt.subplot(1,3,1)
for x in plot:
    plt.plot(x['threshold'],x['precision'],'r')
    plt.plot(x['threshold'],x['recall'],'g')
    
plt.legend(('precision','recall'))
plt.xlabel('Threshold')
plt.ylabel("Precision/Recall")plt.subplot(1,3,3)
for x in plot:
    plt.plot(x['threshold'],x['costs'],'y')
plt.legend(('costs'))
plt.xlabel('Threshold')
plt.ylabel("cost")
plt.show()

Plot of Precision — Recall Trade-off for different values of Probability Threshold for Random Forest Model

At 0.25 threshold, we observe that precision is almost more than 95% and recall is almost around 98%. We want our recall to be near to 100% and at the same time we also want our precision to be high. We then predicted the probabilities of each class using this optimal threshold and found that the FN and FP were 4 and 792 respectively. Hence the total cost reduced from 11,210 to 9,920.

clf.fit(train_std,y_train)
y_pred_prob = clf.predict_proba(test_std)[:,1]  > 0.25
con_mat =confusion_matrix (y_test, y_pred_prob)
print("-"*117)
print("Type 1 error (False Positive) = ", con_mat[0][1])
print("Type 2 error (False Negative) = ", con_mat[1][0])
print("-"*117)
print("Total cost = ", con_mat[0][1] * 10 + con_mat[1][0] * 500)
print("-"*117)

So far with Mean imputation we got the least cost of 9,920 with Random Forest trained with the best parameters and using an optimal threshold of 0.25.

In the next section, we will follow the same procedure for finding the least cost using median imputation followed by most frequent imputation.

2. Machine Learning Models using Mean Impute

a) Logistic Regression

For Logistic Regression Hyperparameter tunning is done using Gridsearch by varying the values of C and we tried with L1 and L2 regularizer.

# model fitting and hyper parameter tuning using gridsearch
params = {'C' : [
                10**-4,10**-3,10**-2,10**-1,1,10**1,10**2,10**3],
          'penalty': ['l1', 'l2']
         }clf = LogisticRegression(n_jobs= -1,random_state=42)
clf.fit(train_std,y_train)
model = GridSearchCV(estimator=clf,cv = 2,n_jobs= -1,param_grid=params,scoring='f1',verbose= 2,)
model.fit(train_std,y_train)
print("Best estimator is", model.best_params_)

The best parameters that we obtained for logistic regression with mean imputation are C = 1000 and L2 regularizer. After finding the best parameters, we trained the Logistic Regression model with the best parameters and we obtained an cost of 23680 with FP as 368 and FN as 40. Below is the code that we used to train the Logistic regression model using the best parameter.

#model fitting using the best parameter.
clf = LogisticRegression(n_jobs= -1,random_state=42,C= 1000,penalty= 'l2')
clf.fit(train_std,y_train)
y_pred = clf.predict(test_std)
con_mat =confusion_matrix (y_test, y_pred)
print("-"*117)
print('Confusion Matrix: ', '\n',con_mat)
print("-"*117)
print("Type 1 error (False Positive) = ", con_mat[0][1])
print("Type 2 error (False Negative) = ", con_mat[1][0])
print("-"*117)
print("Total cost = ", con_mat[0][1] * 10 + con_mat[1][0] * 500)
print("-"*117)

Since we have got a cost of 23680 usinglogistic regression with best parameters, we can still further reduce the cost by adjusting the probability threshold of the logistic regression model as we did in the above cases.

In code below, we have performed 10 fold cross validation to get the optimal threshold.

# CV for determining the best threshold
trail = 10
plot = []
for x in range(0,trail):
    train, test, y_tr, y_ts = train_test_split(train_std, y_train, stratify = y_train, train_size = 0.7)
    clf.fit(train,y_tr)
    pred = clf.predict_proba(test)[:,1]
    precision, recall, thresholds = precision_recall_curve(y_ts, pred)
    thresholds = np.append(thresholds,1)
      
    costs = []
    for threshold in thresholds:
        y_pred_thres = pred > threshold
        c = confusion_matrix(y_ts,y_pred_thres)
        cost = c[0,1] * 10 + c[1,0] * 500
        costs.append(cost)
        
    plot.append({'threshold': thresholds, 'precision':precision,'recall': recall, 'costs':costs})plt.figure(figsize=(15,5))
plt.subplot(1,3,1)
for x in plot:
    plt.plot(x['threshold'],x['precision'],'r')
    plt.plot(x['threshold'],x['recall'],'g')
    
plt.legend(('precision','recall'))
plt.xlabel('Threshold')
plt.ylabel("Precision/Recall")plt.subplot(1,3,3)
for x in plot:
    plt.plot(x['threshold'],x['costs'],'y')
plt.legend(('costs'))
plt.xlabel('Threshold')
plt.ylabel("cost")
plt.show()

From the above plot we observe that at threshold 0.2, precision is almost more than 95% and recall is almost around 98%. We want our recall to be near to 100% and at the same time we also want our precision to be high. Hence we can say that 0.2 is the optimal threshold.

We next found the predicted probabilities for each class using predict_proba attribute of the model, using 0.2 as our optimal threshold. We found that the number of FP were 730 and FN were 25. Thus the cost was reduced from 23,680 to 19,800.

clf.fit(train_std,y_train)
y_pred_prob = clf.predict_proba(test_std)[:,1]  > 0.20
con_mat =confusion_matrix (y_test, y_pred_prob)
print("-"*117)
print("Type 1 error (False Positive) = ", con_mat[0][1])
print("Type 2 error (False Negative) = ", con_mat[1][0])
print("-"*117)
print("Total cost = ", con_mat[0][1] * 10 + con_mat[1][0] * 500)
print("-"*117)

b) XgBoost

For XgBoost we choose to try different values for max depth and number of estimators. Gridsearch was used to find the best values for max depth and number of estimators.

x_cfl=XGBClassifier(n_jobs = -1)
prams={
    'n_estimators':[100,200,500,1000,2000],
     'max_depth':[3,5,10],
    
}
model=GridSearchCV(x_cfl,param_grid=prams,verbose=10,n_jobs=-1,scoring='f1',cv=5)
model.fit(train_std,y_train)
print("Best estimator is", model.best_params_)

After performing hyper parameter tuning, we found max depth of 5 and number of estimators to be 500 for XgBoost classifier.

clf = XGBClassifier(max_depth= 5,n_estimators= 500,n_jobs= -1)
clf.fit(train_std,y_train)
y_pred = clf.predict(test_std)
con_mat =confusion_matrix (y_test, y_pred)
print("-"*117)
print('Confusion Matrix: ', '\n',con_mat)
print("-"*117)
print("Type 1 error (False Positive) = ", con_mat[0][1])
print("Type 2 error (False Negative) = ", con_mat[1][0])
print("-"*117)
print("Total cost = ", con_mat[0][1] * 10 + con_mat[1][0] * 500)
print("-"*117)

For XgBoost classifier we obtained a cost of 19260 with FN as 35 and FP as 176. By adjusting the probability threshold we can still further reduce the cost as we did for other models mentioned above.

# CV to determine the best threshold
trail = 10
plot = []
for x in range(0,trail):
    train, test, y_tr, y_ts = train_test_split(train_std, y_train, stratify = y_train, train_size = 0.7)
    clf.fit(train,y_tr)
    pred = clf.predict_proba(test)[:,1]
    precision, recall, thresholds = precision_recall_curve(y_ts, pred)
    thresholds = np.append(thresholds,1)
      
    costs = []
    for threshold in thresholds:
        y_pred_thres = pred > threshold
        c = confusion_matrix(y_ts,y_pred_thres)
        cost = c[0,1] * 10 + c[1,0] * 500
        costs.append(cost)
        
    plot.append({'threshold': thresholds, 'precision':precision,'recall': recall, 'costs':costs})plt.figure(figsize=(15,5))
plt.subplot(1,3,1)
for x in plot:
    plt.plot(x['threshold'],x['precision'],'r')
    plt.plot(x['threshold'],x['recall'],'g')
    
plt.legend(('precision','recall'))
plt.xlabel('Threshold')
plt.ylabel("Precision/Recall")plt.subplot(1,3,3)
for x in plot:
    plt.plot(x['threshold'],x['costs'],'y')
plt.legend(('costs'))
plt.xlabel('Threshold')
plt.ylabel("cost")
plt.show()

Plot of Precision — Recall Trade-off for different values of Probability Threshold for XgBoost Classifier

From the plot we observe at threshold of 0.2, precision is almost more than 95% and recall is almost around 98%. Therefore for XgBoost classifier 0.2 would be the optimal threshold.

After predicting the probabilities of the XgBoost classifier using prdeict_proba, we found that the cost of the XgBoost classifier reduced to 15,570 with FN as 26 and FP as 257.

XgBoost with mean imputation gave us a least cost as compared to logistic regression classifier using mean imputation.

# model fitting and prediction using the best threshold
clf.fit(train_std,y_train)
y_pred_prob = clf.predict_proba(test_std)[:,1]  > 0.20
con_mat =confusion_matrix (y_test, y_pred_prob)
print("-"*117)
print("Type 1 error (False Positive) = ", con_mat[0][1])
print("Type 2 error (False Negative) = ", con_mat[1][0])
print("-"*117)
print("Total cost = ", con_mat[0][1] * 10 + con_mat[1][0] * 500)
print("-"*117)

c) Random Forest

Next, we tried Random Forest for mean imputation. We choose max depth and number of estimators as our hyper parameters for random forest. We tuned the random forest model using gridsearch technique.

#model fitting and hyper parameter tuning to find the best parameters
x_cfl=RandomForestClassifier()
prams={
     'n_estimators':[100,200,500,1000,2000],
     'max_depth':[3,5,10]
    
    
}
model=GridSearchCV(x_cfl,param_grid=prams,verbose=10,n_jobs=-1,scoring='f1',cv=5)
model.fit(train_std,y_train)
print("Best estimator is", model.best_params_)

We got max depth as 10 and number of estimators as 2000 for random forest as the best parameters.

#model fitting using best parameter
clf = RandomForestClassifier(n_estimators= 2000 , max_depth=10,n_jobs= -1)
clf.fit(train_std,y_train)
y_pred = clf.predict(test_std)
con_mat =confusion_matrix (y_test, y_pred)
print("-"*117)
print('Confusion Matrix: ', '\n',con_mat)
print("-"*117)
print("Type 1 error (False Positive) = ", con_mat[0][1])
print("Type 2 error (False Negative) = ", con_mat[1][0])
print("-"*117)
print("Total cost = ", con_mat[0][1] * 10 + con_mat[1][0] * 500)
print("-"*117)

Random forest classifier with best parameters gave us a cost of 11680 with 418 FN and 15 FP. As we did for other models, next we tried adjusting the probability threshold using 10 fold cross validation.

#CV to determine the best threshold
trail = 10
plot = []
for x in range(0,trail):
    train, test, y_tr, y_ts = train_test_split(train_std, y_train, stratify = y_train, train_size = 0.7)
    clf.fit(train,y_tr)
    pred = clf.predict_proba(test)[:,1]
    precision, recall, thresholds = precision_recall_curve(y_ts, pred)
    thresholds = np.append(thresholds,1)
      
    costs = []
    for threshold in thresholds:
        y_pred_thres = pred > threshold
        c = confusion_matrix(y_ts,y_pred_thres)
        cost = c[0,1] * 10 + c[1,0] * 500
        costs.append(cost)
        
    plot.append({'threshold': thresholds, 'precision':precision,'recall': recall, 'costs':costs})plt.figure(figsize=(15,5))
plt.subplot(1,3,1)
for x in plot:
    plt.plot(x['threshold'],x['precision'],'r')
    plt.plot(x['threshold'],x['recall'],'g')
    
plt.legend(('precision','recall'))
plt.xlabel('Threshold')
plt.ylabel("Precision/Recall")plt.subplot(1,3,3)
for x in plot:
    plt.plot(x['threshold'],x['costs'],'y')
plt.legend(('costs'))
plt.xlabel('Threshold')
plt.ylabel("cost")
plt.show()

Plot of Precision — Recall Trade-off for different values of Probability Threshold for Random Forest Classifier

At threshold 0.25, we observe that precision is almost more than 95% and recall is almost around 98%. Hence the optimal threshold for random forest with mean imputation was found to be 0.25.

Next we found the predicted probabilities for each class using this optimal threshold and found the cost to be 10670 with FN as 6 and FP as 767.

clf.fit(train_std,y_train)
y_pred_prob = clf.predict_proba(test_std)[:,1]  > 0.25
con_mat =confusion_matrix (y_test, y_pred_prob)
print("-"*117)
print("Type 1 error (False Positive) = ", con_mat[0][1])
print("Type 2 error (False Negative) = ", con_mat[1][0])
print("-"*117)
print("Total cost = ", con_mat[0][1] * 10 + con_mat[1][0] * 500)
print("-"*117)Fo

For mean imputation, we see that random forest out performs XgBoost and Logistic Regression in terms of getting the least cost.

3. Machine Learning Models using Most Frequent Impute

a) Logistic Regression

For most frequent imputation, we first start with logistic regression classifier. Hyperparameter tuning is done using Gridsearch by taking different values of C and choosing between L1 and L2 regularizer.

# model fitting and hyper parameter tuning using gridsearchparams = {'C' : [
                10**-4,10**-3,10**-2,10**-1,1,10**1,10**2,10**3],
          'penalty': ['l1', 'l2']
         }clf = LogisticRegression(n_jobs= -1,random_state=42)
clf.fit(train_std,y_train)
model = GridSearchCV(estimator=clf,cv = 2,n_jobs= -1,param_grid=params,scoring='f1',verbose= 2,)
model.fit(train_std,y_train)
print("Best estimator is", model.best_params_)

C = 1000 and L2 regularizer was found to be the best parameter for logistic regression. After training the logistic regression classifier with the best parameter, we found the cost to be 21010 with FN as 351 and FP as 35.

# model fitting using best parameter
clf = LogisticRegression(n_jobs= -1,random_state=42,C= 1000,penalty= 'l2')
clf.fit(train_std,y_train)
y_pred = clf.predict(test_std)
con_mat =confusion_matrix (y_test, y_pred)
print("-"*117)
print('Confusion Matrix: ', '\n',con_mat)
print("-"*117)
print("Type 1 error (False Positive) = ", con_mat[0][1])
print("Type 2 error (False Negative) = ", con_mat[1][0])
print("-"*117)
print("Total cost = ", con_mat[0][1] * 10 + con_mat[1][0] * 500)
print("-"*117)

Likewise we did in other models, we adjusted the probability threshold of the classifier. We used 10 fold cross validation to find the optimal threshold.

#CV for determining the best threshold
trail = 10
plot = []
for x in range(0,trail):
    train, test, y_tr, y_ts = train_test_split(train_std, y_train, stratify = y_train, train_size = 0.7)
    clf.fit(train,y_tr)
    pred = clf.predict_proba(test)[:,1]
    precision, recall, thresholds = precision_recall_curve(y_ts, pred)
    thresholds = np.append(thresholds,1)
      
    costs = []
    for threshold in thresholds:
        y_pred_thres = pred > threshold
        c = confusion_matrix(y_ts,y_pred_thres)
        cost = c[0,1] * 10 + c[1,0] * 500
        costs.append(cost)
        
    plot.append({'threshold': thresholds, 'precision':precision,'recall': recall, 'costs':costs})plt.figure(figsize=(15,5))
plt.subplot(1,3,1)
for x in plot:
    plt.plot(x['threshold'],x['precision'],'r')
    plt.plot(x['threshold'],x['recall'],'g')
    
plt.legend(('precision','recall'))
plt.xlabel('Threshold')
plt.ylabel("Precision/Recall")plt.subplot(1,3,3)
for x in plot:
    plt.plot(x['threshold'],x['costs'],'y')
plt.legend(('costs'))
plt.xlabel('Threshold')
plt.ylabel("cost")
plt.show()

Plot of Precision — Recall Trade-off for different values of Probability Threshold forLogistic Regression Classifier

At 0.2, we observe that precision is almost more than 95% and recall is almost around 98%. Therefore we choose our optimal threshold to be 0.2.

Next we predicted the probabilities for each class using predict-proba attribute of logistic regression classifier and found the cost to be 17150 with FN as 21 and FP as 665.

# model fitting and prediciting using the best threshold
clf.fit(train_std,y_train)
y_pred_prob = clf.predict_proba(test_std)[:,1]  > 0.20
con_mat =confusion_matrix (y_test, y_pred_prob)
print("-"*117)
print("Type 1 error (False Positive) = ", con_mat[0][1])
print("Type 2 error (False Negative) = ", con_mat[1][0])
print("-"*117)
print("Total cost = ", con_mat[0][1] * 10 + con_mat[1][0] * 500)
print("-"*117)

b) XgBoost

For XgBoost, hyper parameter tuning was done using Gridsearch to find the best values for max depth and number of estimators.

# model fitting and hyperparameter tunning using gridsearch
x_cfl=XGBClassifier()prams={
    
     'n_estimators':[100,200,500,1000,2000],
     'max_depth':[3,5,10]
}
model=GridSearchCV(x_cfl,param_grid=prams,verbose=10,n_jobs=-1,scoring='f1',cv=5)
model.fit(train_std,y_train)
print("Best estimator is", model.best_params_)

We found that max depth of 10 and number of estimators as 1000 for XgBoost classifier. Using the best values for max depth and number of estimators,XgBoost gave a cost of 18860 with FN as 33 and FP as 236.

# model fitting using the best parameter
clf = XGBClassifier(max_depth= 10, n_estimators= 1000,n_jobs= -1)
clf.fit(train_std,y_train)
y_pred = clf.predict(test_std)
con_mat =confusion_matrix (y_test, y_pred)
print("-"*117)
print('Confusion Matrix: ', '\n',con_mat)
print("-"*117)
print("Type 1 error (False Positive) = ", con_mat[0][1])
print("Type 2 error (False Negative) = ", con_mat[1][0])
print("-"*117)
print("Total cost = ", con_mat[0][1] * 10 + con_mat[1][0] * 500)
print("-"*117)

We still can reduce the cost by adjusting the probability threshold as we did for other models.

#CV to determine the best threshold
trail = 10
plot = []
for x in range(0,trail):
    train, test, y_tr, y_ts = train_test_split(train_std, y_train, stratify = y_train, train_size = 0.7)
    clf.fit(train,y_tr)
    pred = clf.predict_proba(test)[:,1]
    precision, recall, thresholds = precision_recall_curve(y_ts, pred)
    thresholds = np.append(thresholds,1)
      
    costs = []
    for threshold in thresholds:
        y_pred_thres = pred > threshold
        c = confusion_matrix(y_ts,y_pred_thres)
        cost = c[0,1] * 10 + c[1,0] * 500
        costs.append(cost)
        
    plot.append({'threshold': thresholds, 'precision':precision,'recall': recall, 'costs':costs})plt.figure(figsize=(15,5))
plt.subplot(1,3,1)
for x in plot:
    plt.plot(x['threshold'],x['precision'],'r')
    plt.plot(x['threshold'],x['recall'],'g')
    
plt.legend(('precision','recall'))
plt.xlabel('Threshold')
plt.ylabel("Precision/Recall")plt.subplot(1,3,3)
for x in plot:
    plt.plot(x['threshold'],x['costs'],'y')
plt.legend(('costs'))
plt.xlabel('Threshold')
plt.ylabel("cost")
plt.show()

At threshold 0.25, we observe that precision is almost more than 95% and recall is almost around 98%. Hence for XgBoost with most frequent imputation we get the optimal threshold as 0.25.

Next we predict the probability of each class using predict_proba with optimal threshold, we get the cost as 15,310 with FN as 24 and FP as 331.

# model fitting and predicting using the best threshold
clf.fit(train_std,y_train)
y_pred_prob = clf.predict_proba(test_std)[:,1]  > 0.25
con_mat =confusion_matrix (y_test, y_pred_prob)
print("-"*117)
print("Type 1 error (False Positive) = ", con_mat[0][1])
print("Type 2 error (False Negative) = ", con_mat[1][0])
print("-"*117)
print("Total cost = ", con_mat[0][1] * 10 + con_mat[1][0] * 500)
print("-"*117)

c) Random Forest

For Random Forest, we tuned the model using gridsearch to find the best values for max depth and number of estimators from different set of values.

# model fitting and hyperparameter tunning using gridsearch
x_cfl=RandomForestClassifier()

prams={
     'n_estimators':[100,200,500,1000,2000],
     'max_depth':[3,5,10]
}
model=GridSearchCV(x_cfl,param_grid=prams,verbose=10,n_jobs=-1,scoring='f1',cv=5)
model.fit(train_std,y_train)
print("Best estimator is", model.best_params_)

For Random Forest, we found the best parameters to be max depth = 10 and number of estimators = 1000. Random forest with best parameters gave a cost of 10330 with FP as 433 and FN as 12.

# model fitting using the best parameter
clf = RandomForestClassifier(n_estimators= 1000 , max_depth=10,n_jobs= -1)
clf.fit(train_std,y_train)
y_pred = clf.predict(test_std)
con_mat =confusion_matrix (y_test, y_pred)
print("-"*117)
print('Confusion Matrix: ', '\n',con_mat)
print("-"*117)
print("Type 1 error (False Positive) = ", con_mat[0][1])
print("Type 2 error (False Negative) = ", con_mat[1][0])
print("-"*117)
print("Total cost = ", con_mat[0][1] * 10 + con_mat[1][0] * 500)
print("-"*117)

As in the we did for other models, we will further try to reduce the cost for random forest by adjusting the probability threshold of the classifier.

#CV for determining the best threshold
trail = 10
plot = []
for x in range(0,trail):
    train, test, y_tr, y_ts = train_test_split(train_std, y_train, stratify = y_train, train_size = 0.7)
    clf.fit(train,y_tr)
    pred = clf.predict_proba(test)[:,1]
    precision, recall, thresholds = precision_recall_curve(y_ts, pred)
    thresholds = np.append(thresholds,1)
      
    costs = []
    for threshold in thresholds:
        y_pred_thres = pred > threshold
        c = confusion_matrix(y_ts,y_pred_thres)
        cost = c[0,1] * 10 + c[1,0] * 500
        costs.append(cost)
        
    plot.append({'threshold': thresholds, 'precision':precision,'recall': recall, 'costs':costs})plt.figure(figsize=(15,5))
plt.subplot(1,3,1)
for x in plot:
    plt.plot(x['threshold'],x['precision'],'r')
    plt.plot(x['threshold'],x['recall'],'g')
    
plt.legend(('precision','recall'))
plt.xlabel('Threshold')
plt.ylabel("Precision/Recall")plt.subplot(1,3,3)
for x in plot:
    plt.plot(x['threshold'],x['costs'],'y')
plt.legend(('costs'))
plt.xlabel('Threshold')
plt.ylabel("cost")
plt.show()

At threshold 0.2, we observe that precision is almost more than 95% and recall is almost around 98%. Next we predicted the probabilities for each class using the optimal threshold as 0.2 and we got a cost of 10310 with FN as 3 and FP as 881.

#model fitting and predicting using the best threshold
clf.fit(train_std,y_train)
y_pred_prob = clf.predict_proba(test_std)[:,1]  > 0.20
con_mat =confusion_matrix (y_test, y_pred_prob)
print("-"*117)
print("Type 1 error (False Positive) = ", con_mat[0][1])
print("Type 2 error (False Negative) = ", con_mat[1][0])
print("-"*117)
print("Total cost = ", con_mat[0][1] * 10 + con_mat[1][0] * 500)
print("-"*117)

We notice that for most frequent imputation also, random forest out performs other two classifiers we used in terms of getting the minimum cost.

4. Feature Engineering

Let us try doing some kind of Feature Engineering to reduce the total cost. From the above machine learning models on three different Imputation techniques, Random Forest with median imputation provides the least cost. Therefore we will perform feature engineering on train data with median imputation. We will use Random Forest model, since it gave the best performance in terms of finding the least cost.

We have tried two methods for Feature Engineering. In the First method we have used median imputation to fill the missing data and we created new feature called missing indicator, which indicates True where the missing values are there in the datasets and False otherwise. In the second method, we have used Principal Component Analysis to reduce the dimensions of the dataset thereby taking into consideration features with maximum explained variance.

In sklearn we have class MissingIndicator which we have used to create the indicator for missing values in the train and test data as shown in the code below.

# Feature engineering using missing value indicator 
#Train data
missing_impute = MissingIndicator()
miss = missing_impute.fit_transform(train)
train_miss_indi = pd.DataFrame(miss)#Test data
miss = missing_impute.transform(test)
test_miss_indi = pd.DataFrame(miss)

After creating the new features, we upsampled the minority class using SMOTE technique and downsampled the majority class. Then we standardized the train and the test data.

After standardizing the data, next we performed hyper parameter tuning of random forest classifier using gridseacrh and we found that max depth = 10 and number of estimators as 100.

Code below was used for hyper tuning the best parameters for random forest.

# model fitting with best parameters
clf = RandomForestClassifier(n_estimators= 100 , max_depth=10)
clf.fit(train_std,y_train)
y_pred = clf.predict(test_std)
con_mat =confusion_matrix (y_test, y_pred)
print("-"*117)
print('Confusion Matrix: ', '\n',con_mat)
print("-"*117)
print("Type 1 error (False Positive) = ", con_mat[0][1])
print("Type 2 error (False Negative) = ", con_mat[1][0])
print("-"*117)
print("Total cost = ", con_mat[0][1] * 10 + con_mat[1][0] * 500)
print("-"*117)

Next, we trained the random forest with max depth as 10 and number of estimators as 100 and found the cost to be 12260 with FN as 16 and FP as 426. As we did in other models, we will fine tune the random forest classifier bu adjusting its probability threshold. We will use 10 fold cross validation to get the optimal probability threshold. Code below does exactly the same.

# CV to determine the best threshold
trail = 10
plot = []
for x in range(0,trail):
    train, test, y_tr, y_ts = train_test_split(train_std, y_train, stratify = y_train, train_size = 0.7)
    clf.fit(train,y_tr)
    pred = clf.predict_proba(test)[:,1]
    precision, recall, thresholds = precision_recall_curve(y_ts, pred)
    thresholds = np.append(thresholds,1)
      
    costs = []
    for threshold in thresholds:
        y_pred_thres = pred > threshold
        c = confusion_matrix(y_ts,y_pred_thres)
        cost = c[0,1] * 10 + c[1,0] * 500
        costs.append(cost)
        
    plot.append({'threshold': thresholds, 'precision':precision,'recall': recall, 'costs':costs})plt.figure(figsize=(15,5))
plt.subplot(1,3,1)
for x in plot:
    plt.plot(x['threshold'],x['precision'],'r')
    plt.plot(x['threshold'],x['recall'],'g')
    
plt.legend(('precision','recall'))
plt.xlabel('Threshold')
plt.ylabel("Precision/Recall")plt.subplot(1,3,3)
for x in plot:
    plt.plot(x['threshold'],x['costs'],'y')
plt.legend(('costs'))
plt.xlabel('Threshold')
plt.ylabel("cost")
plt.show()

At threshold 0.2, we observe that precision is almost more than 95% and recall is almost around 98%. Once the optimal threshold was decided. Next we predicted the probabilities of each class using predict_proba.

clf.fit(train_std,y_train)
y_pred_prob = clf.predict_proba(test_std)[:,1]  > 0.20
con_mat =confusion_matrix (y_test, y_pred_prob)
print("-"*117)
print("Type 1 error (False Positive) = ", con_mat[0][1])
print("Type 2 error (False Negative) = ", con_mat[1][0])
print("-"*117)
print("Total cost = ", con_mat[0][1] * 10 + con_mat[1][0] * 500)
print("-"*117)

The total cost we got now was 10420.

In second method of Feature Engineering, we will perform dimensionality reduction using PCA.

Again we wil use train and test data with median imputation since it performed well as compared to mean and most frequent imputation.

Before performing dimensionality reduction, we computed the covariance between the features using spearman correlation and plotted the covariance using heatmap plot. In pandas dataframe, by using the method corr we can get the correlation between the features in the dataframe. By mentioning method as ‘spearman’ in corr method we can easily compute the spearman correlation between the features and by using seaborn heatmap function we plotted the heatmap. Below is the code to get the spearman correlation between the features and to plot the heat map.

corelation_matrix = train.corr(method='spearman')sns.heatmap(data = corelation_matrix,xticklabels=train.columns,yticklabels=train.columns)

Heatmap showing covariance between the features.

From the above plot we observe that, many features are correlated with each other. Ideally we want all the diagonal elements of the heatmap to be equal to 1 and rest all the elements to be as close as possible to zero, then such a heatmap indicates that the features are highly uncorrelated with each other. What we observe in the above plot is that most of the elements in the heatmap are close to 1 hence we can conclude that most of the features are correlated to each other.

In any machine learning algorithm, we want our features to be uncorrelated with each other. One way to remove correlation between features is to perform dimensionality reduction. PCA is the most widely used technique to perform dimensionality reduction.

In the code below we have performed PCA.

#Performing standardization and PCA for dimensionality reduction
start = datetime.now()
std = StandardScaler()
train_std = std.fit_transform(train)
train_pca = PCA(n_components= 160,random_state=42)
train_pca.fit_transform(train_std)
print("Time required to run this cell", datetime.now() - start)

From the above plot, we observe that with 90 features we have a explained variance of 97%. Therefore we keep the top 90 features from train data and same from the test data.

After selecting the top 90 features we performed, median imputation on these 90 features. After median imputation we upsampled the minority class using SMOTE technique and upsampled the majority class. Next we standardized the train and test data.

After standardization, we performed hyper parameter tuning to get the best parameters for random forest.

# model fitting and hyperparameter tuning using gridsearch
x_cfl=RandomForestClassifier()

     'n_estimators':[100,200,500,1000,2000],
     'max_depth':[3,5,10]
}
model=GridSearchCV(x_cfl,param_grid=prams,verbose=10,n_jobs=-1,scoring='f1',cv=5)
model.fit(train_std,y_train)
print("Best estimator is", model.best_params_)

We got max depth as 10 and number of estimators as 500. Next we trained the random forest with max depth as 10 and number of estimators as 500 and we obtained a cost of 11,100 with FN as 11 and FP as 560.

As in the other models, we will further reduce the cost by adjusting the probability threshold using 10 fold cross validation using the code below.

#CV to determine the best threshold
trail = 10
plot = []
for x in range(0,trail):
    train, test, y_tr, y_ts = train_test_split(train_std, y_train, stratify = y_train, train_size = 0.7)
    clf.fit(train,y_tr)
    pred = clf.predict_proba(test)[:,1]
    precision, recall, thresholds = precision_recall_curve(y_ts, pred)
    thresholds = np.append(thresholds,1)
      
    costs = []
    for threshold in thresholds:
        y_pred_thres = pred > threshold
        c = confusion_matrix(y_ts,y_pred_thres)
        cost = c[0,1] * 10 + c[1,0] * 500
        costs.append(cost)
        
    plot.append({'threshold': thresholds, 'precision':precision,'recall': recall, 'costs':costs})plt.figure(figsize=(15,5))
plt.subplot(1,3,1)
for x in plot:
    plt.plot(x['threshold'],x['precision'],'r')
    plt.plot(x['threshold'],x['recall'],'g')
    
plt.legend(('precision','recall'))
plt.xlabel('Threshold')
plt.ylabel("Precision/Recall")plt.subplot(1,3,3)
for x in plot:
    plt.plot(x['threshold'],x['costs'],'y')
plt.legend(('costs'))
plt.xlabel('Threshold')
plt.ylabel("cost")
plt.show()

At threshold 0.3, we observe that precision is almost more than 95% and recall is almost around 98%. Once the optimal threshold was decided. Next we predicted the probabilities of each class using predict_proba.

# model fitting and predicting using the best threshold
clf.fit(train_std,y_train)
y_pred_prob = clf.predict_proba(test_std)[:,1]  > 0.30
con_mat =confusion_matrix (y_test, y_pred_prob)
print("-"*117)
print("Type 1 error (False Positive) = ", con_mat[0][1])
print("Type 2 error (False Negative) = ", con_mat[1][0])
print("-"*117)
print("Total cost = ", con_mat[0][1] * 10 + con_mat[1][0] * 500)
print("-"*117)

Total cost we obtained now was 11480 with FN as 3 and FP as 998.

Results

We can conclude that Random Forest model provides the least cost for all the three imputation techniques. Random Forest with median imputation technique provides the least cost after adjusting its probability threshold.

For binary classifiers, the threshold is by default set to 0.5. By adjusting the threshold towards the lower end, its possible to reduce the FN at the cost of increase in FP. Since the total cost is equal to 500 * FN and 10 * FP, to have a minimal cost it is required to have a lower FN. This is achieved by having a proper precision recall tradeoff.

Table below shows the total cost, FN and FP for each of the models mentioned.

Table showing the total cost for different models with median Imputation
+---------------+----+-----+------------+
|     Model     | FN |  FP | Total Cost |
+---------------+----+-----+------------+
|       LR      | 24 | 657 |   18570    |
|    XgBoost    | 25 | 244 |   14940    |
| Random Forest | 4  | 792 |    9920    |
+---------------+----+-----+------------+
=====================================================================================================================
Table showing the total cost for different models with mean Imputation
+---------------+----+-----+------------+
|     Model     | FN |  FP | Total Cost |
+---------------+----+-----+------------+
|       LR      | 25 | 730 |   19800    |
|    XgBoost    | 26 | 257 |   15570    |
| Random Forest | 6  | 767 |   10670    |
+---------------+----+-----+------------+
=====================================================================================================================
Table showing the total cost for different models with most frequent Imputation
+---------------+----+-----+------------+
|     Model     | FN |  FP | Total Cost |
+---------------+----+-----+------------+
|       LR      | 21 | 665 |   17150    |
|    XgBoost    | 24 | 331 |   15310    |
| Random Forest | 3  | 881 |   10310    |
+---------------+----+-----+------------+
=====================================================================================================================
Table showing the total cost for Random Forest model with feature engineered features and median Imputation
+---------------+----+-----+------------+
|     Model     | FN |  FP | Total Cost |
+---------------+----+-----+------------+
| Random Forest | 3  | 892 |   10420    |
+---------------+----+-----+------------+
=====================================================================================================================
Table showing the total cost for Random Forest model with reduced dimensions and median Imputation
+---------------+----+-----+------------+
|     Model     | FN |  FP | Total Cost |
+---------------+----+-----+------------+
| Random Forest | 3  | 998 |   11480    |
+---------------+----+-----+------------+

Random forest works well cause it is simply a collection of decision trees whose results are aggregated into one final result. Random Forests has the ability to limit overfitting without substantially increasing error due to bias. One way Random Forests reduce variance is by training on different samples of the data.

We can try CatBoost classifiers to see how it performs in terms of getting the least cost. One can also explore other imputation techniques like soft SVD, KNN imputation, hot deck, cold deck, Stochastic regression technqiues.

Refer to my github account for the detailed code.

APS Failure at Scania Trucks

Introduction

Exploring the Dataset

Lets Perform Exploratory Data Analysis on the Dataset

Machine Learning Models

1. Machine Learning Models using Median Impute

b) XgBoost:

2. Machine Learning Models using Mean Impute

b) XgBoost

3. Machine Learning Models using Most Frequent Impute

4. Feature Engineering

Results

Written by mrunal sawant