NHTSA Fatality Analysis (Bagging, Boosting, Voting) Classification Models

Yamen Shabankabakibou
Analytics Vidhya
Published in
5 min readSep 13, 2020

--

https://www.nhtsa.gov/equipment/takata-recall-spotlight

In statistics, classification is the problem of identifying to which of a set of categories a new observation belongs, on the basis of a training set of data containing observations whose category membership is known. Wikipedia.
It’s the data scientist’s job to determine the right features to use in training the model, choosing the right model which best fits the data, and finally, draw conclusions based one the most important metrics.

About the Project:

This is a study I have made on the data provided through the National Highway Traffic Safety Administration(NHTSA) about the fatalities that happened in 2018 to predict the possibility of a drunk driver being involved in a car accident.
One of the most important goals of this project is to build the model using Amazon Web Services(AWS) run the Jupyter Notebooks, and store the PostgreSQL database on an Elastic Compute Cloud(EC2) instance (Read my previous blog).
Start with a simple Exploratory Data Analysis(EDA) going throw label encoding, data cleaning, dialing with an imbalanced target column using oversampling methods, and finally, looking for the perfect model.

Step by Step:

1- Start with finding a better data source than Kaggle:

If I were only to depend on Kaggle’s dataset I would have been left with some old data that goes back to (2015).
After doing some research I was able to find the real source of data NHTSA and got more recent data (2018), and because the data by itself is not enough I had to look to find its documentation on the same site BAAM(done).

2- Load data into PostgreSQL and create joined views:

After a general overview of the data and depending on the diagram on the right I decided to choose 3 tables of the 27 available once(Accident, Vehicle, Person), thanks to the data documentation I was able to determine which features to use easily.
Because of the presents of(One to Many) relations between the tables I needed to find a way to aggregate features into one row to fit it into the model.
I had to remove the features that not aggregatable. No features from the Person table were aggregatable but I calculated the count of the males and females in each car. From the Vehicle table, I’ve decided to take the features shown in the picture with suitable aggregation methods. Finally create joined views.

3- Exploratory Data Analysis(EDA):

  • Exploring the Target Variable:
    After taking a look into the target variable I have found it contains values(0,1,2,3,4) which represents the number of drunk people involved so in case I wanted to use this column as the target variable I have to map values larger then one to one (Imbalanced HaH).
  • Exploring the Accidents in Terms of Months:
    The number of accidents accrued due to drinking events is rising towards the summer which can be much reasonable (Party All Night).
  • Exploring Accidents Using Geo Maps:
    Looking to the graph on the left you can see the distribution of accident locations across the USA states on the new year day, whereas the photo on the right describes the count of accidents across all states through 2018 (Texas and Florida have the most number of accidents).

4- Imbalanced Label Sampling Methods:

  • Training Without Changing:
    First I tried to look for the best model which can give me the best results without using any class imbalance methods (Random Forest Classifier was the winner).
    After Cross-Validation, I got accuracy: 0.78
  • Training With Random Over Sampler:
  • Training With SMOTE:
  • Training With ADASYN:
  • Training With SMOTETOMEK:
  • Training With SMOTEENN:

5- Training Using Voting:

I used 5 of the most succeeded models from before:

  1. Random Forest Classifier.
  2. GradientBoostingClassifier.
  3. AdaBoostClassifier.
  4. XGBClassifier.
  5. ExtraTreesClassifier.

With weights=[5,4,3,2,1]

Final Results:

After comparing the results of cross-validation of the Random Forest Classifier using different oversampling methods I noticed that SMOTEENN method was the best method for me!!!

--

--