Analysis of California Traffic Collisions

Ariana Gordon
DSInsights
Published in
9 min readFeb 21, 2021

Written by Royi Razi, Ariana Gordon, Eyal Hashimshoni, & Eran Perelman

Credit: Eyal Hashimshoni

Traffic collisions are an everyday event in the United States. An open-source dataset with information regarding traffic collisions occurring in the state of California between January 2001 and mid-October 2020 is available from kaggle. We used this dataset to predict the severity of injury resulting from a traffic collision. If we can accurately predict the severity of the injury, it would allow first responders to better prioritize the urgency of response for traffic collisions and may even be useful in identifying the main contributory risk factors that determine injury severity.

Exploring and Preprocessing the Data

One of the primary challenges in working with this dataset is that it is supremely unbalanced. We addressed this challenge by selecting a subset of the data, essentially under-sampling the data, to ensure we could build our models on a balanced dataset. After loading these subset files into a dataframe, our dataframe contained 543,230 samples and 116 features. As with most data exploration, one of the first steps we took was to examine the head of the dataframe.

df.head()

As we can see in the head of the dataframe, it appears that our data contains a significant number of null values. Models cannot be trained on data containing null values so this is an aspect of the data that required extensive further exploration.

df.isna().sum()

In our examination of the data dictionary, we found that certain features contained null values where a negative response or specific category should have been recorded. As such, we spent a great deal of time examining each feature individually before deciding how to handle the null values for that particular feature. We also decided to drop some features and rows from the dataframe due to the high number of null values contained within those particular features or rows.

The dataframe was also comprised of a number of categorical features that had far too many categories to apply one-hot encoding in an efficient and effective manner. With these features, we performed category reduction by examining each value of the feature individually and grouping like values into a single value category.

After condensing these features, some of the more interesting features were plotted. The plots of these features’ distributions are shown below.

# Candidates for plotting - features:
f, axes = plt.subplots(4, 2, figsize=(12,16))
sns.distplot( X_train_corr['party_age'], color='g', bins = 30,
ax=axes[0, 0]).set_title("party age - distplot");
sns.distplot( X_train_corr['victim_age'], color='g', bins = 30,
ax=axes[0, 1]).set_title("victim age - distplot");
sns.countplot(ax=axes[1, 0], x = y_train, data= X_train_corr,
palette="magma", order=list(range(8)))
.set_title("victim_degree_of_injury");
sns.countplot(ax=axes[1, 1], x = X_train_corr["party_race"],
data= X_train_corr, palette="magma")
.set_title("party_race");
sns.countplot(ax=axes[2, 0], y = X_train_corr["type_of_collision"],
data= X_train_corr, palette="magma")
.set_title("Type of collision");
sns.countplot(ax=axes[2, 1], y =
X_train_corr["motor_vehicle_involved_with"],
data= X_train_corr, palette="magma")
.set_title("Motor car involved with");
sns.kdeplot( df['collision_time'], color='r', ax=axes[3, 0],
shade = True).set_title("Collision time for all accidents");
sns.countplot(ax=axes[3, 1], x = X_train_corr["weather_1"],
data= X_train_corr, palette="magma")
.set_title("weather at time of accident");
plt.yscale("log")
plt.tight_layout()
plt.show()

For the numerical features, one of our primary concerns was handling outliers appropriately. While analyzing the features individually, we had flagged those that were candidates for outlier reduction and plotted each as a boxplot for a visual representation of the outliers.

# Outlier boxplots
potential_outlier_features = ['injured_victims', 'party_count', 'party_number', 'vehicle_year', 'party_age', 'victim_age']
f, axes = plt.subplots(3, 2, figsize=(10,5))
list(itertools.product(range(3), range(2)))
for loc, feat in zip(locs, potential_outlier_features):
sns.boxplot(x=feat, data=X_train, ax = axes[loc], color = "dodgerblue")
axes[loc].set_xlabel(feat, fontsize = 16)
plt.tight_layout()

We removed outliers from these features based on the following thresholds:

# Remove outliers
X_train = X_train[X_train['injured_victims'] <= 40]
X_train = X_train[X_train['party_count'] <= 30]
X_train = X_train[X_train['party_number'] <= 30]
X_train = X_train[X_train['vehicle_year'] >= 1920]
X_train = X_train[X_train['party_age'] <= 100]
X_train = X_train[X_train['victim_age'] <= 100]

We also examined the feature correlations using a correlation matrix but determined none of our remaining numerical features were correlated to a degree that would support removal.

# Correlation plot:
f, axes = plt.subplots(figsize=(12,12))
corr = X_train.corr()
mask = np.tril(np.ones_like(corr, dtype=np.bool))
sns.heatmap(corr, annot=True, fmt=".2f", mask = mask, square = True,
cmap="bwr_r",vmin=-1, vmax=1, center = 0)
plt.title('Correlation between selected numerical features', size = 15);
plt.show()

After transforming the remaining categorical features using dummy encoding, our dataframe contained 213 features. We thought it prudent to perform chi-squared feature selection on the categorical features, keeping the top 60 features.

# Keep top k categorical features
k = 60
fs = SelectKBest(score_func=chi2, k=k)
fs.fit(train[cat_w_dummies], y_train)
train_fs = fs.transform(train[cat_w_dummies])
test_fs = fs.transform(test[cat_w_dummies])
# Plot features by importance
fig, ax = plt.subplots(figsize=(18,35))
sns.barplot(y=cat_w_dummies[:k],x=sorted(fs.scores_, reverse=True)[:k])
plt.title("Chi-Square Scores for the Different Features", fontsize=30)
plt.ylabel("Feature", fontsize=30)
plt.xlabel("Chi-Score", fontsize=30)
plt.xticks(fontsize=30)
plt.yticks(fontsize=45)
plt.show()

All of this data exploration and preprocessing led us to the final train and test dataframes containing 72 features, to which we applied standard scaling.

Classifying the Target

Originally, the target feature was comprised of 8 categories. Since several of these categories were repetitive or ambiguous, we redefined our target to contain 4 categories. These categories are as follows:

0 - No injury       (Originally category 0)
1 - Mild injury (Originally categories 6 & 7)
2 - Severe injury (Originally categories 2 & 5)
3 - Killed (Originally category 1)

Creating the Model

We built multiple models for this project, starting with a simple baseline model and progressing in model complexity, striving to find the model with the best performance. The models we designed included logistic regression, random forest, SVM, k-nearest neighbors, CatBoost, XGBoost, and a neural network. We will discuss the results of 3 of these models.

Base Model: Logistic Regression

Our base model was a simple logistic regression model. With this model, the f1 score diminished as the class increased, meaning that the model was able to predict no injury better than minor injury, minor injury better than severe injury, and severe injury better than casualty.

log_model = LogisticRegression(solver = "lbfgs", max_iter = 2000)
log_model.fit(X_train_scaled, y_train)
y_pred_logr = log_model.predict(X_test_scaled)
f, axes = plt.subplots(figsize=(6,6))
plot_confusion_matrix(log_model, X_test_scaled, y_test,ax = axes,
cmap = 'YlOrBr', values_format = '.0f');
Confusion Matrix for Logistic Regression Model
print(classification_report(y_test, y_pred_logr))
Classification Report for Logistic Regression Model

Random Forest Classifier

We also created a random forest classifier and used a randomized search to obtain the best model parameters. The random forest classifier performed better than the base model overall with an average precision score of 0.76. The precision was relatively high for no injury, mild injury, and casualties, but the model struggled to predict severe injury with precision. Additionally, the recall rates for mild injury and casualties were much lower than we would feel comfortable with putting into production.

parameters = {'n_estimators': 100,
'min_samples_split': 6,
'min_samples_leaf': 5,
'max_depth': 11}
clf_RF = RandomForestClassifier(**parameters)
clf_RF.fit(X_train_scaled, y_train)
y_pred_RF = clf_RF.predict(X_test_scaled)
Confusion Matrix for Random Forest Classifier
Classification Report for Random Forest Classifier

XGBoost

The XGBoost model was our best performer. This model achieved an average weighted precision score of 0.81, an average weighted recall score of 0.79, and an average f1 score of 0.78. It performed well on predicting no injury, mild injury, and casualties with high precision, but struggled to predict severe injury precisely. Additionally, the recall scores for no injury and mild injury were very high, but the recall score for casualties was quite low.

dtrain = xgb.DMatrix(data=X_train_scaled, label=y_train)
dtest = xgb.DMatrix(data=X_test_scaled)
params = {
'max_depth': 3,
'objective': 'multi:softmax', # error evaluation for
# multiclass training
'num_class': 4}
bst = xgb.train(params, dtrain)
pred_xgb = bst.predict(dtest)
Confusion Matrix for XGBoost Model
Classification Report for XGBoost Model

Interpreting the Results

As seen in the confusion matrices below, all of the models we evaluated had trouble differentiating between severe injury (category 2) and casualty (category 3). We believe this error is acceptable because if the model predicts a casualty as severe injury, first responders will make that call a priority. If the error were reversed, it would not be acceptable as that could withhold care from those who need it most. In classifying casualties as severe injuries, we may actually see fewer casualties as first responders could arrive in time to stabilize these victims and potentially save their lives. The image below shows clearly that the XGBoost outperformed the other models.

The difficulty our models had in precisely predicting those victims with severe injuries is further illustrated in the chart below. Again, this is not too concerning for us because the lack in precision is due to those who were listed as casualties being classified as severely injured. The worst case scenario that would occur from this misclassification is that an ambulance is dispatched to the scene to give care that is no longer needed.

The XGBoost made its classification decisions based on the following feature importances:

As you can see, the number of injured victims was the primary feature used in determining the classification outcome. We debated about removing this feature because we did not want to essentially tell the model its target, but decided to include this feature for 2 reasons. First, although it does give the model a clue as to whether or not injuries exist, we are trying to predict the severity of injuries and it does not provide any gauge of that scale. Second, when one calls into emergency services after an accident, it would be expected that the emergency services operator asks the caller if anyone is injured or appears injured, and if so, how many. Since this is information that could be collected prior to first responders arriving at the scene, we felt it indispensable and appropriate to include it in the model.

While we cannot make assertions as to the reasons why the model prioritized the features in the way it has, we do have some conjectures on the matter. For example, year may be a factor because as the years progress, more safety features are integrated into the majority of vehicles in use. Alternatively, perhaps new laws and regulations, regarding the use of cell phones, for instance, may have been implemented. When we evaluate the feature describing victim ejection, we can clearly see that a much higher percentage of people who used seatbelts were uninjured as opposed to those that did not use a seatbelt.

Additionally, although the fallacy persists that men are better drivers than women, the data shows that 76 percent of fatalities were male.

One revelation obtained by observing the difference between casualty rates on weekdays as opposed to weekends may hint at the involvement of sleep deprivation and/or alcohol consumption on fatal collisions. As can be seen in the plot below, there is a large spike in the number of casualties on weekends, in the early morning hours when bars and clubs are closing.

It is also interesting to note that there were far fewer casualties in the Los Angeles area during February of 2019 than in any other month that year.

2019 Casualties Heatmap by Month

Conclusion

After performing full exploratory data analysis, preprocessing, and modeling to examine motor vehicle collisions in the state of California, we were able to find many interesting contributory components as noted above. We encourage those who found this article interesting to perform their own examinations and share your conclusions with us.

Full code for this project can be accessed on GitHub.

--

--