A complete guide to predict the severity of an accident

Girish Ewoorkar
12 min readDec 15, 2018

--

this blog explores the how severe a accident can be. Its specially meant for beginners who feel difficult what to do in data science project. considering this being “my first project” i have attempted to make it precise and as simple as possible.

Steps to follow:

  1. Data exploration (EDA): looking at categorical and continuous feature summaries and making inferences about the data.
  2. Data Cleaning and removing duplicates: imputing missing values in the data and checking for outliers and removing duplicates.
  3. Selection of important features: selecting some necessary features by domain knowledge and removing redundant features.
  4. Visualization and univariate and multivariate analysis.
  5. scaling if necessary.
  6. algorithm selection based on objective and problem.

OBJECTIVE:

first and foremost task is specifying the objective of our project and one should always stick to the objective while doing the analysis. In my case study, the objective is “To predict the severity of new accidents and reduce assistance time of emergency services by providing recommended resources”

print(d.shape) :(138031, 42) where it represents(no of rows, no of columns)

print(d.columns): description of important features will be explained in further discussion.

d["severity"].value_counts() : here for the class label “severity”, this code used to check whether data is balanced or imbalanced so that this could be handled differently for each case during the modeling phase

1 → MINOR/OTHER INJURY, 2 → SERIOUS INJURY, 3 → FATAL

d.describe() : to check overall description like mean, min, max standard deviation. From this, if we are fortunate we may identify some errors or outliers.

there are 40 columns so if we execute above code we can find all 40 column

from above table we found out an error in speed zone which is -2, which is not possible, so we will remove or handle it during error handling.

From mean of longitude and latitude, we can see from which place the data has been collected :

it belongs to AUSTRALIA

DATA EXPLORATION (EDA):

We try to understand what is the relation between features and class label, domain knowledge of the data and important features in classifying. Sometimes it may not give a clear picture at this stage due to some duplication and errors, but this is enough to get a rough idea.

Due to so many features, I have done multivariate analysis (multiple features w.r.t “severity”)and then univariate(single feature w.r.t “severity”)

plt.close()
sns.set_style("whitegrid")
sns.pairplot(d, hue="SEVERITY",vars=["SPEED_ZONE","TOTAL_PERSONS","PASSENGERVEHICLE","HEAVYVEHICLE","ALCOHOL_RELATED","MOTORCYCLE","DAY_OF_WEEK","ACCIDENT_TYPE"], size=2)
plt.show()
its called pair plot or scatter plot which plots pairwise relationships between all features

the outlined part is same as non-outlined part as it’s a symmetric matrix. Although there seems to be no clear linear separation between the three classes but some features like “speed_zone vs passenger-vehicle” does a comparatively better job.

plt.close()
sns.set_style("whitegrid")
sns.pairplot(d, hue="SEVERITY",vars=["SPEED_ZONE","TIME","PASSENGERVEHICLE","MALES","ALCOHOL_RELATED","MOTORCYCLE","FEMALES","ACCIDENT_TYPE"], size=2)
plt.show()
its pair plot between the different set of features with some old features

with respect to “TIME” many features have done a comparatively better job in classifying.

let's check in 3-D just for one case how does it look

import mpl_toolkits.mplot3d
fig = plt.figure(figsize = (12, 10))
ax = fig.add_subplot(111, projection = '3d')
ax.scatter(d['TIME'], d['HEAVYVEHICLE'], d['SPEED_ZONE'],
c = d['SEVERITY'].map(lambda x: {1:'blue', 2:'red',3:'green'}[x]), marker = 'o')
ax.set_xlabel('TIME')
ax.set_ylabel('HEAVYVEHICLE')
ax.set_zlabel('SPEED_ZONE')
plt.show()
between three features

still unclear and difficult to understand.

uni-variate analysis:

sns.FacetGrid(d, hue="SEVERITY", size=5) \
.map(sns.distplot, "TIME") \
.add_legend();
plt.show();

ACCIDENT_TYPE:
1 — REAR_END
2 — OUT_OF_CONTROL(on a straight road)
3 — OTHER_OPPOSITE_DIRECTION
4 — HEAD_ON
5 — MANOEUVERING
6 — ADJACENT_DIRECTION INTERSECT
7 — SIDE_SWIPE(lane change)
8 — OUT_OF_CONTROL(on a curve)
9 — OVERTAKING

“only some features of univariate analysis have been shown for reference”

I have shown EDA, in brief, to know more refer these links

REMOVING DUPLICATES AND WRONG DATA:

d[d.duplicated(['ACCIDENT_DATE','TIME','LONGITUDE','LATITUDE','SPEED_ZONE'],keep=False)].sort_values(['ACCIDENT_DATE','TIME','LONGITUDE','LATITUDE','SPEED_ZONE'],ascending=[True,True,True,True],axis=0)
it is only part of columns

from the above table, I observed that an accident has two different “ACCIDENT_NO” and have the same value for all features including “ACCIDENT_DATE, TIME, LONGITUDE, LATITUDE”, hence this is considered as duplicate. So I have kept the row with “ACCIDENT_NO” which is applicable to that year and eliminated the rest (ex: T20100001723 -1/1/2010 {keep it} T21100001723 -1/1/2010 {removed}).

d.drop_duplicates(subset={'ACCIDENT_DATE','TIME','LONGITUDE','LATITUDE','SPEED_ZONE'},keep='first',inplace=True)print(d.shape)

after removing, the number of rows and columns are (69014, 42)

d[d.duplicated(['ACCIDENT_DATE','TIME','LONGITUDE','LATITUDE'], keep=False)].sort_values(['ACCIDENT_DATE','TIME','LONGITUDE','LATITUDE'],ascending=[True,True,True,True,True],axis=0)

Since “speed_zone” values are different, so I have removed all the rows because it cannot be decided which speed_zone value is correct at that place and time.

d[(d['HIT_RUN_FLAG']!=0) & (d['DRIVER']==0)]

In hit and run case there has to be a driver who drives the vehicle but in this dataset, some cases have been found where “DRIVER” value is 0 (number of drivers) but still “HIT_RUN_FLAG” is set, so such rows are removed.

1)d[d['MALES']+d['FEMALES']!=d['TOTAL_PERSONS']]2)d[d['NO_OF_VEHICLES']<d['DRIVER']]#the rows satisfying the above condition should be removed

Similarly removing such data where sum of “MALES” and “FEMALES” involved is not equal to “TOTAL_PERSONS” involved because an individual is either male or female(assumed “OTHER” gender is not considered) [see code] also number of drivers cannot be greater than number of vehicles involved but number of vehicles≥ driver is valid because the accident may happen between vehicles which are in motion and parked(without drive). So such rows are removed.

SELECTING IMPORTANT FEATURES:

REMOVE DATA WITH VERY LESS VALUES:

This is problem specific, for example:- in case of cancer prediction where % of the population having cancer is less so cannot remove such features as it is important. But since the severity of an accident does not depend on some features whose values are not even 1% or 2% of total data, hence they can be removed.

the count is too low and is not an important feature
d.drop(columns=['STAT_DIV_NAME','UNKNOWN_','PED_CYCLIST_13_18','PED_CYCLIST_5_12','PILLION','OLD_PEDESTRIAN'],axis=1,inplace=True)

removing features not important:

d.drop(columns=['ACCIDENT_NO','POLICE_ATTEND','NONINJURED'],axis=1,inplace=True)

POLICE_ATTEND:- Generally, police reach the site after the accident so we remove it

NONINJURED:- considering the severity, we discuss about people’s death or injury and not about non injured since other than the persons involved in the accident all are non injured, hence we remove it

removing features which can be provided by other closely related feature as well:

corr = d.corr(method="spearman")
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
f, ax = plt.subplots(figsize=(11, 9))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=1, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})
CORRELATION MATRIX

correlation matrix can be used to check highly correlated features and only one feature is enough for classification, other feature is redundant and has the same relation with “Severity”.

Ex: between motorist and motorcycle since motorist is the one who rides motorcycle hence one of feature is redundant but which one? to solve that we have to rely on domain knowledge where weight or type of vehicle matters more than who rides it hence remove motorist

correlation is very high

in above code spear-man(click on the link to know more)to check correlation.

note: I have removed features “inj_or_fatal, fatality,serious-injury,other-injury”. You might wonder since these are very important features in classifying the severity and are removed as I have observed that my class-label “severity” has 3 categories which are 1,2,3 (1:-MINOR/OTHER INJURY, 2:-SERIOUS INJURY, 3:-FATALITY) which indirectly represents the features. hence I have checked the relation with the severity which showed me all of the serious injuries belongs to category 2

here can be seen that each feature roughly represent a category of the class label “SEVERITY” hence removed.

along with it a couple of features which are not required because other sub-features of it defines it very well, are removed ex: instead of “DRIVER if we keep features which defines their age, features like “ YOUNG_DRIVER and OLD_DRIVER”.

instead of “NO_OF_VEHICLES”, we can keep types of vehicles i.e features like (“HEAVY_VEHICLES”, PASSENGER_VEHICLE etc..)

REMOVING NULL VALUES:

this again depends on given data since its not always good practice to remove null values because data is costly, we will end up losing most of data. Hence this can be handled differently like “substituting value with mean(be careful of outlier because mean is prone to outlier), median or substituted by relevant value using other features”

d.isnull().sum()
d.dropna(how='any',subset=['DAY_OF_WEEK','LONGITUDE','LATITUDE','LGA_NAME'],inplace=True)
count represents the number of null values in that column

Here I have removed rows which have null values since it is less than 2% of whole data so there is no major loss of data.

Sometimes there will be invalid data type hence need to convert them to relevant data type.

day of week i.e {1,2,3,4,5,6,7} are integers not float and actually, its categorical feature correct way is to encode them in numeric data using popular methods like “one hot encoding” technique etc. Since my original data is numeric i.e (given as flag values) hence kept it as it is.

HANDLING ERRORS:

It again depends on the scenario, either you can remove them or treat them based on the number of errors.

since “-1” has more values hence cannot be removed and needs to be treated. let's try to understand domain knowledge on which the speed zone depends. We have features in our dataset such as “longitude” and “latitude”, hence we know that given these two features we will have unique speeds of vehicles.

1)f= d.loc[d['SPEED_ZONE']<2][['LONGITUDE','LATITUDE']]
2)j=0
3)for i in f.values:
4) t=d.loc[(d['LONGITUDE']==i[0]) & (d['LATITUDE']==i[1])]['SPEED_ZONE'].value_counts().index.values[0]
5) d.at[s[j], 'SPEED_ZONE']=t
6) j=j+1

in the above code:-

line 1:- lets take all the rows of longitude and latitude where speed zone is less than 2

line 3:- let's try to iterate through each value in “f” i.e each pair of {longitude, latitude}

longitude latitude

line 4:-we take the maximum number of occurrences of speed_zone for each pair of longitude and latitude

here 50 has occurred 3 times hence replace either -1, 2 or 1 with 50 at that longitude and latitude. Similarly, 60 will be substituted at that longitude and latitude in place of either -1, 2 or 1 as represented in line 5(substitution part).

but still, some do not have any other value (see values circled with brown) these cannot be replaced because at that longitude and latitude there is no other speed of vehicles has been recorded hence those rows have to be removed.

let's check after cleaning data and selecting important features, which features are remaining and the number of rows.

d.shape == (59685, 22)

scaling features:

Standardization is a re-scaling technique that refers to centering the distribution of the data on the value 0 and the standard deviation to the value 1 and we wanted to get rid of the scale

taken from Wikipedia

ALGORITHM:

The important phase of any Machine learning project is choosing which algorithm to apply. In machine learning and statistics, classification is a supervised learning approach in which the computer program learns from the data input given to it and then uses this learning to classify new observation.

Here we have the types of classification algorithms in Machine Learning:

  1. Linear Classifiers: Logistic Regression, Naive Bayes Classifier
  2. Support Vector Machines
  3. Decision Trees
  4. Boosted Trees
  5. Random Forest
  6. Neural Networks
  7. Nearest Neighbor

I have used “K-Nearest Neighbor” since it does a very good job with numerical data. In the given data, there are many numerical features and few categorical data (given in terms of flag or can be encoded).

Now let's see how to split the data…

year wise data

It can be seen from the above table that until 2014 features values have been constantly decreasing and again increasing from 2014 onwards, so we will do “time-based splitting”(click on the link to know more) This cross-validation object is a variation of KFold(click on link). In the kth split, it returns first k folds as train set and the (k+1)th fold as the test set.

note: since i have to split by “ACCIDENT_DATE” then with “TIME”, hence i have written my own version of time-based splitting.

Cross-Validation

Cross-validation (CV) is a popular technique for tuning hyperparameters(k value) and producing robust measurements of model performance. Two popular techniques of cross-validation are k-fold cross-validation and hold-out cross-validation.

First, we split the dataset into a subset called the training set, and another subset called the test set. If any parameters need to be tuned, we split the training set into a training subset and a validation set. The model is trained on the training subset and the parameters that minimize error on the validation set are chosen. Hence the model is trained on the full training set using the chosen parameters, and the error on the test set is recorded.

Why Different with Time-dependent data?

With time series data, particular care must be taken in splitting the data in order to prevent data leakage. In order to accurately simulate the “real world forecasting environment, in which we stand in the present and forecast the future” (Tashman 2000), the forecaster must withhold all data about events that occur chronologically after the events used for fitting the model.

scaled_df.sort_values(['ACCIDENT_DATE','TIME'],ascending=[True,True],axis=0,inplace=True)def train_test_split_sorted(X, y, test_size, dates):
n_test = ceil(test_size * len(X))
sorted_index = [x for _, x in sorted(zip(np.array(dates), np.arange(0, len(dates))), key=lambda pair: pair[0])]
train_idx = sorted_index[:-n_test]
test_idx = sorted_index[-n_test:]
if isinstance(X, (pd.Series, pd.DataFrame)):
X_train = X.iloc[train_idx]
X_test = X.iloc[test_idx]
else:
X_train = X[train_idx]
X_test = X[test_idx]
if isinstance(y, (pd.Series, pd.DataFrame)):
y_train = y.iloc[train_idx]
y_test = y.iloc[test_idx]
else:
y_train = y[train_idx]
y_test = y[test_idx]
return X_train, X_test, y_train, y_testX_traind, X_testd, y_traind, y_testd = train_test_split_sorted(scaled_df.iloc[:, 0:-1],scaled_df["SEVERITY"], 0.4,scaled_df["ACCIDENT_DATE"])X_train, X_cv, y_train, y_cv=train_test_split_sorted(X_traind,y_traind, 0.5,X_traind["ACCIDENT_DATE"])
its after data standardization

K-FOLD CROSS VALIDATION

In k-fold cross-validation(click on the link for more info), the original sample is randomly partitioned into k equal sized sub-samples. Of the k sub-samples, a single sub-sample is retained as the validation data for testing the model, and the remaining k − 1 sub-samples are used as training data. The cross-validation process is then repeated k times, with each of the k sub-samples used exactly once as the validation data. The k results can then be averaged to produce a single estimation. The advantage of this method over repeated random sub-sampling is that all observations are used for both training and validation, and each observation is used for validation exactly once. 10-fold cross-validation is commonly used, but in general k-remains an unfixed parameter.

choosing optimal k-value

We find that by graph between miss-classification error V/s k value. The coordinate where error is less we consider roughly that as optimal k.

i have just shown for 50 values, if we execute graph up-to 800 or less , we get after 500 curve increases

At k=500 minimum value is found, so we will use k=500 for nearest neighbor.

knn_optimal = KNeighborsClassifier(n_neighbors=500,weights='distance')# fitting the model
knn_optimal.fit(X_train, y_train)
# predict the response
pred = knn_optimal.predict_proba(X_testd)
# evaluate accuracy
acc = log_loss(y_testd, pred)
print('\nThe log_loss of the knn classifier for k = %d is %f' % (500, acc))
this value is of test data
this value is of cross-validation data

note: the probability of given data point belonging to each class are found, so the best metric to measure the accuracy of predicted probability is “log loss(click on it for more info)” whose value ranges from [0 — infinity]. And the resulting value is considered good if it is closer to 0.

note:-The obtained resulted value is 0.67 which is not bad considering our data-set was not clearly linearly separable and it can be improved using feature engineering techniques. Since given data set is“imbalanced data-set” and has to be dealt differently to get more accurate values using technique like “over-sampling or under-sampling, SMOTE etc” (which is out of scope of this blog kindly refer other resources) and feature engineering techniques .

Hope this blog is helpful !!

Finally, thanks to my friends for reviewing this blog karthik iyer and anshul gupta

Special thanks to Applied AI Course in helping me to understand the concepts.

--

--