Predicting Hazardous Seismic Bumps Part III: improving model performance for imbalanced datasets
A newbie evaluation on trying out Machine Learning Models for Classification and Data Augmentation to Support better results using Scikit-learn and XGBoost.
We are immensely grateful to Nabanita Roy for pointing out this very interesting dataset, her previous work formed the base for which we were able to build on, you can have a look at her work here (Part I and Part II)!
Luciana Azubuike was our fantastic mentor in this #WaiLEARN Project and we hope you get to learn as much as we did in this article.
Based on the data exploration and performance analysis carried out by Nabanita, the performance of the previous models were very low for the recall metric having just a 0.06 score for the best performing model. In other words, out of all the Actual Positive (hazardous) shifts, only 6% of the shifts have been predicted as Positive. Our focus was on improving this and improving trust in the model through Explainability.
1. Premise
Predicting Positive here is predicting “ the possibility of hazardous situation occurrence, where an appropriate supervision service can reduce a risk of rockburst (e.g. by distressing shooting) or withdraw workers from the threatened area. Good prediction of increased seismic activity is therefore a matter of great practical importance. “
We wouldn’t want to send miners into a mine with a substandard model!
On the other hand, Precision for the best performing model is only 0.67, which means that out of all the shifts predicted as hazardous, (1–0.67) = 23% were actually low risk, which would mean a non insignificant number of shifts where miners may have been told to stay home, and thus a lower productivity.
Our shared implementation notebook can be found here!
2. Our Contribution
Pre-processing the Data Further
- Seismic: result of shift seismic hazard assessment in the mine working obtained by the seismic method (a — lack of hazard, b — low hazard, c — high hazard, d — danger state);
- Seismoacoustic: result of shift seismic hazard assessment in the mine working obtained by the seismoacoustic method;
- Shift: information about type of a shift (W — coal-getting, N -preparation shift);
The encoding of these attributes was implemented by a simple mapping of the known values as seen below based on the obvious order:
- a — lack of hazard < b — low hazard< c — high hazard< d — danger state
This encoding allowed for easier interpretability as we will see later. It was also relevant for the plotting of the correlation matrix as unless attributes are numeric, they are completely ignored.
Explanatory Data Analysis: Checking the Correlations between Features and the Target
To depict the correlation of final features to the Target (Class), a heatmap was implemented (shown in Figure 1):
Augment data using Synthetic Minority Oversampling Technique (SMOTE)
The original data has a lot more Class 0 than Class 1 points which is obvious in the data visualisations and is likely to impact the performance of all models. As suggested by Nabanita Roy, we tried the Synthetic Minority Oversampling Technique (SMOTE) oversampling technique as seen in the code snippet below.
Interestingly, for discrete features like nBumps (the number of seismic bumps recorded within the previous shift), new rows have some non integers values!
Model Test Results
We tested the following skicit-learn and XGBoost models with the raw data and the SMOTE augmented data, while mainly optimising for Recall:
- sklearn.neighbors: KNeighborsClassifier (KNC)
- sklearn.ensemble:
- RandomForestClassifier (RFC)
- AdaBoostClassifier (AdaBoost)
- sklearn.svm: SVC
- sklearn.tree: DecisionTreeClassifier (DecisionTree)
- xgboost ; XGBClassifier (XGBoost)
And the results….
The results contained in the table above are ordered by f1-score, so as to try and make a good compromise between Recall and Accuracy, however if Recall is the most crucial criteria, the order would have been slightly changed.
All the top performance results are based on the SMOTE augmented data, confirming the value of augmenting data for unbalanced datasets.
Interestingly models with strict restrictions (max_leaf_nodes=10, max_depth=3) worked best. This can be intuitively understood because of the small dataset size, any optimisation may overfit.
Note that the Test set was quite small, as the full raw dataset itself is quite small: 2584 rows x 16 features. We tested with the raw data to make sure the SMOTE process didn’t cause any bias. So, for some of the Top performing results:
- When the RFC SMOTE max_leaf_nodes=10 Confusion Matrix has the top Recall score at: 20 / (20 + 14) = 0.59:
- [[398 85]
- [ 14 20]]
- The XGBoost SMOTE Confusion Matrix Recall score is: 16 / (16 +18) = 0.47
- [[424 59]
- [ 18 16]]
Quite a small difference in absolute numbers, though important in terms of safety!
Note, with the previous One Hot encoders, we got slightly worse results:
Explainability
Some models expose the Feature Importances which are great both to inform feature selection and checking for data leakage, as well as getting feedback and building trust with the customer.
In this case we’ll have a look at the top performing models with feature importance support, as well as the best simple Decision Tree:
- Nbumps (the number of seismic bumps recorded within the previous shift) is understandably a good predictor for hazard in the current shift, as well as nbumps2 (the number of seismic bumps in energy range [10²,10³]) registered within the previous shift).
- Scaled energy: total energy of seismic bumps registered within the previous shift; highly correlated with nbumpsN, maybe should be removed from features.
- Shift_enc_0: information about type of a shift (W / 0 — coal-getting, N / 1 -preparation shift) is of concern: could a Preparation shift be correlated to a Hazardous shift because the engineers thought it was risky? Could there be some Data leakage here?
The DecisionTree model with SMOTE data and max_depth=3 has the 2nd best F1 score and its visualisation also make the role of the encoded Shift value very clear:
Our Conclusion/What we have Learnt
- Training any model on the augmented dataset (SMOTE) improves the model performance significantly.
- XGBoost wins by a small margin, based on shallow Decision Trees.
- The RandomForestClassifier and Decision Tree Classifier perform well, although not with the GridSearchCV Optimised version! It seems like optimising against the SMOTE dataset can be counter-productive and choosing values like max_leaf_nodes=10 intuitively adapted to a small dataset works in this case scenario.
- It might be good practice to remove highly correlated features when performing feature selection.
- Some models like random forest and XGBoost expose the Features Importance’s which are great both to inform feature selection and checking for data leakage, as well as getting feedback and trust from the customer.
- In this case, it is recommended to look at the top performing models with feature importance support, as well as the best simple Decision Tree.
- From plotting the feature importances to get more information for feature selection, Nbumps (the number of seismic bumps recorded within the previous shift) is understandably a good predictor for hazard in the current shift, as well as nbumps2 (the number of seismic bumps in energy range [10²,10³]) registered within the previous shift).
Future Work
Going forward, we recommend further improving the obtained results by doing the following:
- Testing more ensemble machine learning models.
- Investigating further to validate a number of assumptions about which metrics is most important and why there seems to be a correlation between the Shift type and the hazard Classification, could that be data leakage? There is a need to talk directly to the users.
Authors
Catherine Lalanne, Heejin Yoon, Luciana Azubuike
References
- Introduction to Scikit Learn:Understanding Classification Models for Supervised Machine Learning
- To Understand Model Performance Metrics
- To Understand the feature data (data types, missing values, outliers, etc): see-Data Exploration and Preparation
- Visualization (boxplots, scatter plots, correlation matrix, etc)- see Data Exploration and Preparation
GitHub Link to the Notebook
Women in AI (WAI) is a nonprofit org on a mission to increase female representation and participation in AI. First global community of women in AI with 90+ countries and 28 ambassadors, WAI is working to embrace diversity, empower, inspire and educate next generations.
Find out more about the Irish Chapter of Women in AI on Linkedin.