Classification of Whether the Car Accident Is Day-Time or Night-Time

BERFİN SARIOĞLU
Sep 8, 2020 · 5 min read

(Traffic Fatalities in USA (2015))

This project is our third project in the data science Bootcamp where we receive training under the leadership of Zekeriya Besiroglu. We developed this project together with Adnan Kilic. Our goal was creating a model using supervised learning techniques and improving our database skills by storing data in PostgreSQL and doing some of our analysis there. We selected Traffic Fatalities topic-based dataset. Traffic fatalities dataset is released by the National Highway Traffic Safety Administration (NHTSA).

Here is the kaggle competition link of this dataset; https://www.kaggle.com/nhtsa/2015-traffic-fatalities

Step By Step

There are 17 tables in this dataset. I will be explaining how to create a database with the command line over the “damage” table I chose. I want to underline that I use Windows 7 and I select Anaconda Prompt for using command line commands.

a) Go to the path which you dowloading PostreSQL and go into “bin” file. Copy this path.

For me; this path is ; C:\Program Files\PostgreSQL\13\bin

b) Open the command line and go to the path we copied above.

(base) C:\Users\Hp> cd C:\Program Files\PostgreSQL\13\bin

c) Type this command;

(base) C:\Program Files\PostgreSQL\13\bin>psql -U postgres

If you entered password while downloading postgresql, it will ask you for password when you run this command. After entering password, if you see “postgres=#” , this means you are in the right place.

d) Now, we can create our database. Let the name of our database be “deneme”.

Type these commands;

postgres=# CREATE DATABASE deneme;
CREATE DATABASE
postgres=# \connect deneme;
Şu anda “deneme” veritabanına “postgres” kullanıcısı ile bağlısınız.
deneme=#

Now we are in our database “deneme”, we can create our tables. As i said, although there are 17 tables in my dataset, I’ll just explain the “damage” table to show how it is done. You need to do the next steps for each table.

e) Create table and copy the csv file into it with using these commands;

deneme=# CREATE TABLE damage ( STATE numeric, ST_CASE numeric, VEH_NO numeric, MDAREAS numeric);
CREATE TABLE
deneme=# \COPY damage FROM ‘C:\Users\Hp\Desktop\traffic\damage.csv’ DELIMITER ‘,’ CSV HEADER;
COPY 192730

Congratulations! Now you have the “damage” table.

f) Get pandas and postgres to work together!

import psycopg2 as pg
import pandas as pd
import pandas.io.sql as pd_sql
# Postgres info to connectconnection_args = {
‘host’: ‘localhost’, # To connect to our _local_ version of psql
‘dbname’: ‘deneme’, # DB that we are connecting to
‘port’: 5432 # port we opened on AWS
}
connection = pg.connect(dbname=’deneme’, user=’postgres’, password=’sifre')

Now we can start to write queries, join tables etc.

2) Features Selection

After connecting our database, we examine in detail each table.

Our target variable is the car accident in USA is day-time or night-time. We created datetime column and according to hours in that column, we create our target variable column which has 0(night-time) and 1(day-time). Since our target variable is balanced ( Class 0(gece): 17373 , Class 1(gündüz): 14793) , we did not need to apply methods such as “oversampling”.

We select 14 features by using these tables ; Accident , Person , Distract , Cevent , Vision , Manuever , Vehicle.

3)Feature Engineering

a) Drunk_dr feature had 4 values ; 0 , 1 , 2 and 3. This is beacuse in each unique case, there are more than 1 people, so there could be more than 1 people who is drunk. Whether or not there are drunk people is an important variable for our model, not how many drunk people there are. With that logic, we transform 2 and 3 as 1.

b) We make dummies these features; day_week , month, drunk_dr ,weather.

c) Car model year can’t have values bigger than 2006 because that dataset is from 2005. We convert the values which are bigger than 2006 with median of the mod_year variable. After that we create 7 grups as;

> 1990 = 0

> 1990 & <= 1992 = 1

> 1992 & <= 1994 = 2 etc.

d) There were age values bigger than 130. We convert them with the median of age column.After that we create 9 groups as

<= 11 = 0

> 11 & <= 18 = 1

> 18 & <= 22 = 2 etc.

e) We applied label enconding for state, mdrdstrd and mdrmanav features.

4)EDA

With using plotly library, we create a Choropleth Map. It shows the number of cases for each state in USA in 2015. The code of this map is as follows;

data = [dict(
type = ‘choropleth’,
locations = mapt[‘states’],
locationmode = ‘USA-states’,
z = mapt[‘st_case’],
marker = dict(
line = dict(
color = ‘rgb(255, 255, 255)’,
width = 2)
)
) ,
dict(
type = ‘scattergeo’,
locations = mapt[‘states’],
locationmode = ‘USA-states’,
text=mapt[‘states’],
mode = ‘text’ ,
textfont=dict(
family=”sans serif”,
size=13,
color=”Orange”
) ) ]
layout = dict( title=’Density of Traffic Accidents per State in USA(2015)’,
geo = dict(
scope = ‘usa’,
projection = dict(type = ‘albers usa’),
countrycolor = ‘rgb(255, 255, 255)’,
showlakes = True,
lakecolor = ‘rgb(255, 255, 255)’)
)
figure = dict(data = data, layout = layout)
iplot(figure)

We see that after 12 am. ,the number of cases increases and for weekday-wise, friday is worst.

About state-wise drunk driving and number of accident, Texas and California take the lead. It seems that drunking and number of accident have really strong correlation.

It shows that most of accidents happen when the weather is clear.

5) To Apply Supervised Learning Models

We applied 5 different models. These are CatBoost Classifier, KNeigbors Classifier, Random Forest Classifier, Decision Tree Classifier and Logistic Regression. For each model we found best parameters and applied GridSearchCV. With this way we can handle with overfitting.

For prediction the car accident happens the night-time or day-time, best model is Cat Boost Classifier. When we observe feature importance for this classifier, we see that age and drunk features are really important.

CatBoostClassifier: Accuracy: 68.5421%

KNeighborsClassifier: Accuracy: 54.9891%

RandomForestClassifier: Accuracy: 67.5474%

DecisionTreeClassifier: Accuracy: 66.5527%

LogisticRegression: Accuracy: 65.8067%

Feature Importance For CatBoost Classifier

Thanks for reading!

https://www.linkedin.com/in/berfin-sarioglu/

The Startup

Get smarter at building your thing. Join The Startup’s +792K followers.

Sign up for Top 10 Stories

By The Startup

Get smarter at building your thing. Subscribe to receive The Startup's top 10 most read stories — delivered straight into your inbox, once a week. Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

BERFİN SARIOĞLU

Written by

https://www.linkedin.com/in/berfin-sarioglu/

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +792K followers.

BERFİN SARIOĞLU

Written by

https://www.linkedin.com/in/berfin-sarioglu/

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +792K followers.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store