Classification of Whether the Car Accident Is Day-Time or Night-Time

BERFİN SARIOĞLU
The Startup
Published in
5 min readSep 8, 2020

(Traffic Fatalities in USA (2015))

In this project , my goal was creating a model using supervised learning techniques and improving our database skills by storing data in PostgreSQL and doing some of our analysis there. I selected Traffic Fatalities topic-based dataset. Traffic fatalities dataset is released by the National Highway Traffic Safety Administration (NHTSA).

Here is the kaggle competition link of this dataset; https://www.kaggle.com/nhtsa/2015-traffic-fatalities

Step By Step

  1. To Create PostgreSQL Database with using command line and connect it with Anaconda Jupyter.

There are 17 tables in this dataset. I will be explaining how to create a database with the command line over the “damage” table I chose. I want to underline that I use Windows 7 and I select Anaconda Prompt for using command line commands.

a) Go to the path which you dowloading PostreSQL and go into “bin” file. Copy this path.

For me; this path is ; C:\Program Files\PostgreSQL\13\bin

b) Open the command line and go to the path we copied above.

(base) C:\Users\Hp> cd C:\Program Files\PostgreSQL\13\bin

c) Type this command;

(base) C:\Program Files\PostgreSQL\13\bin>psql -U postgres

If you entered password while downloading postgresql, it will ask you for password when you run this command. After entering password, if you see “postgres=#” , this means you are in the right place.

d) Now, we can create our database. Let the name of our database be “deneme”.

Type these commands;

postgres=# CREATE DATABASE deneme;
CREATE DATABASE
postgres=# \connect deneme;
Şu anda “deneme” veritabanına “postgres” kullanıcısı ile bağlısınız.
deneme=#

Now we are in our database “deneme”, we can create our tables. As i said, although there are 17 tables in my dataset, I’ll just explain the “damage” table to show how it is done. You need to do the next steps for each table.

e) Create table and copy the csv file into it with using these commands;

deneme=# CREATE TABLE damage ( STATE numeric, ST_CASE numeric, VEH_NO numeric, MDAREAS numeric);
CREATE TABLE
deneme=# \COPY damage FROM ‘C:\Users\Hp\Desktop\traffic\damage.csv’ DELIMITER ‘,’ CSV HEADER;
COPY 192730

Congratulations! Now you have the “damage” table.

f) Get pandas and postgres to work together!

import psycopg2 as pg
import pandas as pd
import pandas.io.sql as pd_sql
# Postgres info to connectconnection_args = {
‘host’: ‘localhost’, # To connect to our _local_ version of psql
‘dbname’: ‘deneme’, # DB that we are connecting to
‘port’: 5432 # port we opened on AWS
}
connection = pg.connect(dbname=’deneme’, user=’postgres’, password=’sifre')

Now we can start to write queries, join tables etc.

2) Features Selection

After connecting our database, we examine in detail each table.

Our target variable is the car accident in USA is day-time or night-time. We created datetime column and according to hours in that column, we create our target variable column which has 0(night-time) and 1(day-time). Since our target variable is balanced ( Class 0(gece): 17373 , Class 1(gündüz): 14793) , we did not need to apply methods such as “oversampling”.

We select 14 features by using these tables ; Accident , Person , Distract , Cevent , Vision , Manuever , Vehicle.

3)Feature Engineering

a) Drunk_dr feature had 4 values ; 0 , 1 , 2 and 3. This is beacuse in each unique case, there are more than 1 people, so there could be more than 1 people who is drunk. Whether or not there are drunk people is an important variable for our model, not how many drunk people there are. With that logic, we transform 2 and 3 as 1.

b) We make dummies these features; day_week , month, drunk_dr ,weather.

c) Car model year can’t have values bigger than 2006 because that dataset is from 2005. We convert the values which are bigger than 2006 with median of the mod_year variable. After that we create 7 grups as;

> 1990 = 0

> 1990 & <= 1992 = 1

> 1992 & <= 1994 = 2 etc.

d) There were age values bigger than 130. We convert them with the median of age column.After that we create 9 groups as

<= 11 = 0

> 11 & <= 18 = 1

> 18 & <= 22 = 2 etc.

e) We applied label enconding for state, mdrdstrd and mdrmanav features.

4)EDA

With using plotly library, we create a Choropleth Map. It shows the number of cases for each state in USA in 2015. The code of this map is as follows;

data = [dict(
type = ‘choropleth’,
locations = mapt[‘states’],
locationmode = ‘USA-states’,
z = mapt[‘st_case’],
marker = dict(
line = dict(
color = ‘rgb(255, 255, 255)’,
width = 2)
)
) ,
dict(
type = ‘scattergeo’,
locations = mapt[‘states’],
locationmode = ‘USA-states’,
text=mapt[‘states’],
mode = ‘text’ ,
textfont=dict(
family=”sans serif”,
size=13,
color=”Orange”
) ) ]
layout = dict( title=’Density of Traffic Accidents per State in USA(2015)’,
geo = dict(
scope = ‘usa’,
projection = dict(type = ‘albers usa’),
countrycolor = ‘rgb(255, 255, 255)’,
showlakes = True,
lakecolor = ‘rgb(255, 255, 255)’)
)
figure = dict(data = data, layout = layout)
iplot(figure)

We see that after 12 am. ,the number of cases increases and for weekday-wise, friday is worst.

About state-wise drunk driving and number of accident, Texas and California take the lead. It seems that drunking and number of accident have really strong correlation.

It shows that most of accidents happen when the weather is clear.

5) To Apply Supervised Learning Models

We applied 5 different models. These are CatBoost Classifier, KNeigbors Classifier, Random Forest Classifier, Decision Tree Classifier and Logistic Regression. For each model we found best parameters and applied GridSearchCV. With this way we can handle with overfitting.

For prediction the car accident happens the night-time or day-time, best model is Cat Boost Classifier. When we observe feature importance for this classifier, we see that age and drunk features are really important.

CatBoostClassifier: Accuracy: 68.5421%

KNeighborsClassifier: Accuracy: 54.9891%

RandomForestClassifier: Accuracy: 67.5474%

DecisionTreeClassifier: Accuracy: 66.5527%

LogisticRegression: Accuracy: 65.8067%

Feature Importance For CatBoost Classifier

Thanks for reading!

Github repo : https://github.com/berfinsarioglu/TrafficFatalities_Classification

https://www.linkedin.com/in/berfin-sarioglu/

--

--