Vijay Anaparthi
Aug 2 · 3 min read

The machines will fail at any stage because of some mechanical problems. So we have to predict these problems as early as possible for that we have to do predictive modelling.

Mapping Real world problem to an ML Problem

  1. Type of ML problem — Classification problem
  2. Data set contains 30 columns including 29 features and 1 class label or target variable. So here failure column was target variable it has two values 0 and 1. So I am assuming this as binary classification problem.

Exploratory Data Analysis(EDA)

  1. By looking at target variable value counts it is clearly indicating that dataset is imbalanced.
  2. First I did univariate and bivariate analysis I.e plotting Pdf, Cdf and scatter plot between features and target variable but I didn’t get any conclusion from this plots because there are only two classes and both are overlapping very much in every plot.


  1. Printed the Pearson correlation heat map of features and I found out that every feature is less correlated or dependent on another feature except the features like voltmean_3h and voltmean_24h, pressuremean_3h and pressuremean_24h etc.
  2. I did univariate feature reduction I.e remove the low correlation features with target variable(failure). Here I set the threshold value so if the feature is having less correlation value compare to that threshold then I am removing that feature.

Feature Engineering, Performance metrics and Model training

  1. The training dataset is imbalanced, So algorithms like k-nearest neighbor and Log regression will not work perfectly especially KNN because it depends on nearest neighbor so if I train the model on this dataset then if I predict values on test data it will give me mostly failure=0 class because this class label is more dominating in train dataset.
  2. So because of this problem I chosen ensemble models like Random forest classifier and XGBoost classifier. The one more advantage by using bagging and boosting algorithms is we will reduce overfitting problem. I.e train error and test error is low. Along with training a model I did hyper parameter tuning by using Randomsearchcv.
  3. After training the model I did feature importance in both RF and Xgboost and found out top 18 features out of 26 then created new dataset with these features and trained the model again with these features. In both models I got almost same features with higher importance.
  4. I chosen confusion matrix, precision and recall as performance metrices. Because in confusion matrix we can print true positive, true negative , fp and fn. So here we can found out how many datapoints in test data is correctly predicted and how many are incorrectly predicted. RF correctly predicting 99% values.


We know that dataset is imbalanced but I don’t want to change actual dataset so that is why first I trained the model with actual dataset without doing any sampling. But once completion of that process I did synthetic minority oversampling technique. But after training the model with this data I found out that actual dataset is giving better performance compare to this sampled dataset.


I have choosen Random forest classifier with original dataset as final predicted values.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store