Mapping Traffic Accidents in Metro Manila
Part 1: Cleaning and Visualizing the Data
I got my hands on a dataset on traffic accidents in Metro Manila early this year, and decided to enter the realm of data science with a simple cleaning and visualization project. The primary goal: supplement the dataset’s human readable location data with geographic coordinates.
A Quick Background
Back in March, I attended a hackathon with some students from UP Diliman. Following the main event, the organizers issued a bonus challenge. They provided a dataset that contained traffic accident records in Metro Manila, and tasked the teams to clean and geocode the dataset such that it could be visualized on a map. I wrote a Python program to geocode the records (it geocoded about 10,000 entries), but unfortunately my submission didn’t win.
With some free time on my hands and an interest in the field of data science, I decided to revisit my program for a second attempt. I think I have enough data to present some interesting findings on traffic accidents in the Metro Manila area.
About the Dataset
The dataset contains traffic accident records in the Metro Manila area between 2005 and 2015. The dataset is saved as several CSV files organized by year; each year’s data is stored in its own separate file (YYYY.csv).
Within a CSV file, rows contain recorded accidents, and columns contain attributes of the accidents, such as date and time, location, severity, etc. Each CSV file contains several tens of thousands of accident records. 2015.csv contains over 96,000 rows.
Libraries and Tools
I used a couple third-party Python libraries and web apps to help me clean, geocode, and visualize the dataset.
- Libpostal: A C library that normalizes street addresses, even international ones. I used Libpostal to normalize the location data provided by the dataset.
- Geocoder: A Python geocoding library. I used Geocoder to geocode the normalized location string into latitudes and longitudes.
- Carto: A powerful web app for viewing location data. I used Carto to visualize and analyze the geocoded dataset.
The general algorithm of my data cleaning and geocoding program is as follows:
- Preprocess the dataset on Excel. Replace odd letters (ñ with n, …), rename some subdivisions (Talon I / Talon 1 to Talon Dos…).
- Open the dataset in Python.
- For each row: check if there is sufficient info to clean and geocode. Some rows don’t contain enough location information to geocode accurately, or do not contain information about the time the accident occurred. Skip the rows with insufficient information.
- For rows with sufficient info, use Libpostal to standardize the provided location data into a string.
- Pass the standardized location string to Geocoder. Acquire latitude and longitude (if any).
- Save the entries depending on Geocoder’s result.
- Entries that were successfully geocoded go in one CSV file.
- Entries were unsuccessfully geocoded go in another CSV file.
By saving the geocoded and un-geocoded rows in separate files, I can attempt to tweak my program to process the un-geocoded rows at a later time.
Benefits of my Data Cleaning Program
My data cleaning program supplements the original dataset in a couple of ways that allow users to gain additional insights.
Geocoded Locations: The biggest change to the dataset is the addition of geographic coordinates to the included human-readable location data. The additional info allows the dataset to be visualized on a map, opening doors for location-based insights.
Age Groups: The dataset included a column for ages, but was not well formatted for graphing. My program parsed the age column for a given accident and grouped the ages of the people involved (if any) into buckets of 5 years, until the age of 66 and over.
Minor Improvements: Some columns that were previously blank now have default values (such as 0, “None”, “N/A”, etc).
At the time of writing, my program successfully processed 15,000+ records from 2015.csv. Google Maps limits free API calls to 2500 a day, so I’ll still need over a month to process the remaining records at my current rate. However, I think there’s enough processed data to make some interesting visualizations.
DISCLAIMER: Please take these results with a grain of salt. I’ll discuss the possible sources for error in the next section. Now, onto the results!
Accidents are marked orange-red. Red points mean more accidents happened around that point, while orange spots mean less (but at least one) accidents happened around that point. All visualizations done with Carto.
The dataset contained traffic accidents that occurred primarily in the Metro Manila area (at least from the records I processed), and the visualization confirms that; the Metro Manila area is teeming with recorded accidents compared to other surrounding regions, such as Antipolo to the east.
Of the traffic accidents processed so far, most of them seem to take place in northern / middle Metro Manila. Several main highways, such as C-4 and R-7, go through Metro Manila, bringing in hundreds of thousands of vehicles each day. It’s possible to trace the outlines of those highways in the picture above.
Besides main highways and avenues, traffic accidents also took place in the smaller roads that branch out to the residential and business districts. For example, the financial centers districts Makati and Fort Bonifacio (south of the Pasig River in the picture above) are dotted with traffic accidents.
Many roads that permeate the business and residential districts are narrow, crowded, and precarious to navigate through. For example, a closer look at the Ayala Triangle area in Makati (pictured above) reveals many accidents in the smaller roads.
Besides northern Metro Manila, some of the traffic accidents processed took place in southern Metro Manila, such as Alabang, Las Pinas, and Paranaque. It seems that the majority of accidents in the southern area are clustered around the main roads, such as Alabang-Zapote Road and Sucat Road.
Alabang-Zapote Road is not elevated, so many streets leading into residential subdivisions intersect with Alabang-Zapote Road. From my experience driving through Alabang Zapote, many of the intersections don’t have traffic lights or round-the-clock traffic enforcers, leaving the drivers with the risky responsibility to merge and turn appropriately.
Areas for Improvement
As mentioned above, there’s a fair bit of room for error, stemming from multiple sources.
Dataset: The initial dataset is inconsistent in several areas, particularly in location data. Many times, the location data provided was not granular enough to to provide anything useful. In addition, some records in the dataset could not exist anymore.
Libpostal: At the time of writing, Libpostal has limited support for Philippine addresses. Philippine-exclusive terms such as barangays were converted to other equivalents (‘districts’, ‘subdivisions’, etc), but the conversion and normalization was not always correct.
Geocoder: Depending on the quality of the location data from the initial dataset or the quality of the normalized address from Libpostal, Geocoder could return incorrect or inaccurate geographic coordinates.
I cleaned and geocoded a dataset on traffic accidents in the Metro Manila area during the year 2015. I used Libpostal to normalize location data, and Geocoder to turn the locations into geographic coordinates. Finally, I used Carto to make visualizations of the geocoded dataset.
The results so far aren’t particularly surprising. The Metro Manila area was teeming with traffic accidents, both along major highways and in the smaller roads that branch out. The northern Metro area suffers more traffic accidents than the south.
This was a simple but challenging and rewarding data clean-up and visualization project. I hope to finish geocoding the 2015 dataset and move onto the past ones, and produce more posts on interesting insights and trends on traffic accidents in the Metro Manila area.
In the meantime, I guess I should be more careful on my daily commute.
Thank you for reading! For questions or comments, please feel free to contact me on LinkedIn.