The machines will fail at any stage because of some mechanical problems. So we have to predict these problems as early as possible for that we have to do predictive modelling.
Mapping Real world problem to an ML Problem
- Type of ML problem — Classification problem
- Data set contains 30 columns including 29 features and 1 class label or target variable. So here failure column was target variable it has two values 0 and 1. So I am assuming this as binary classification problem.
Exploratory Data Analysis(EDA)
- By looking at target variable value counts it is clearly indicating that dataset is imbalanced.
- First I did univariate and bivariate analysis I.e plotting Pdf, Cdf and scatter plot between features and target variable but I didn’t get any conclusion from this plots because there are only two classes and both are overlapping very much in every plot.
- Printed the Pearson correlation heat map of features and I found out that every feature is less correlated or dependent on another feature except the features like voltmean_3h and voltmean_24h, pressuremean_3h and pressuremean_24h etc.
- I did univariate feature reduction I.e remove the low correlation features with target variable(failure). Here I set the threshold value so if the feature is having less correlation value compare to that threshold then I am removing that feature.
Feature Engineering, Performance metrics and Model training
- The training dataset is imbalanced, So algorithms like k-nearest neighbor and Log regression will not work perfectly especially KNN because it depends on nearest neighbor so if I train the model on this dataset then if I predict values on test data it will give me mostly failure=0 class because this class label is more dominating in train dataset.
- So because of this problem I chosen ensemble models like Random forest classifier and XGBoost classifier. The one more advantage by using bagging and boosting algorithms is we will reduce overfitting problem. I.e train error and test error is low. Along with training a model I did hyper parameter tuning by using Randomsearchcv.
- After training the model I did feature importance in both RF and Xgboost and found out top 18 features out of 26 then created new dataset with these features and trained the model again with these features. In both models I got almost same features with higher importance.
- I chosen confusion matrix, precision and recall as performance metrices. Because in confusion matrix we can print true positive, true negative , fp and fn. So here we can found out how many datapoints in test data is correctly predicted and how many are incorrectly predicted. RF correctly predicting 99% values.
We know that dataset is imbalanced but I don’t want to change actual dataset so that is why first I trained the model with actual dataset without doing any sampling. But once completion of that process I did synthetic minority oversampling technique. But after training the model with this data I found out that actual dataset is giving better performance compare to this sampled dataset.
I have choosen Random forest classifier with original dataset as final predicted values.