CMU Heinz Capstone Project — Traffic Flow Estimation Using Vehicle Telematics Data

Ryan Lingo

Published in

99P Labs

15 min readJan 4, 2023

Written by Chia En Lee (Natalie), George Saito, Chuchu Wu, Joey Wang, Wei Xiao, Xi Yan, Jingbo Zhang

Carnegie Mellon University Heinz College Capstone Project

Introduction

“Vision Zero” is a road traffic safety project that was first implemented in Sweden in the 1990s. The goal of Vision Zero is to create a transportation system in which no one is killed or seriously injured from collisions or traffic accidents. The reason behind “Vision Zero” is that traffic accidents are rather the result of human error and can be prevented through the design of safer roads, vehicles, and traffic systems due to the advancement of technology. Traffic safety is based on the principle that human life is more important than anything else and that it is the responsibility of designers and policymakers to create a transportation system that prioritizes the safety and well-being of all road users (Matts-Åke Belin, 2011).

This particular project is to explore the possibilities of leveraging vehicle telematic data for the prediction of real-time traffic density, flow, and count in Columbus, Ohio. Traffic flow estimation using vehicle telematics data provided by 99P Lab can provide valuable insights into the movement of vehicles on our roads and highways. By visualizing the result of the estimation, stakeholders can gain a better understanding of traffic patterns at the level of road segments, which can be used to improve Artificial Intelligence technologies and develop decision strategies to improve the safety of the current transportation systems.

[1] Belin, M.-Å., Tillgren, P., & Vedung, E. (2012). Vision Zero — a road safety policy innovation. International Journal of Injury Control and Safety Promotion, 19(2), 171–179. https://doi.org/10.1080/17457300.2011.635213

Value Impact

The value of developing such AI technology successfully mitigate the risk of collisions and accidents are intuitional. Based on the statistics of the Association For Safe International Travel, there are about 1.35 million people die in road accidents each year, with an average of about 3,700 people losing their lives every day on the roads (Association For Safe International Travel, 2022). If successful, in the ideal situation, we can save 3,700 people’s lives every day by achieving “Vision Zero” in 2050.

Moreover, counting traffic is vital for local and national governments to make informed decisions about mobility, infrastructure, and taxation. This information is used to better understand the present and prepare for the future.

In the US, local governments need data on traffic to report to the Highway Performance Monitoring System (HPMS) — a national report that each state must submit to the Federal Highway Administration. Traffic count data is used by stakeholders to make smart decisions focusing on road maintenance, for example resurfacing or improving highways. There are only about 500 counting stations across the entire US (Otonomo, 2022).

If vehicle telematics can be used to estimate traffic count & traffic flow, there will be more data on traffic count to support decision-making and cost-effectiveness.

[2] Road Safety Facts. (n.d.). Association for Safe International Road Travel. Retrieved December 10, 2022, from https://www.asirt.org/safe-travel/road-safety-facts/

[3]How are Traffic Counts Actually Measured? (2022, January 11). Otonomo. https://otonomo.io/blog/traffic-count/

Project Team and Objective

We are a group of interdisciplinary software engineering and research students who are about to graduate from Heinz College at Carnegie Mellon University. Our areas of expertise include software engineering, web development, data engineering, machine learning, geospatial data analytics, and visualization. The objective of this project is to:

Learn how to map up GIS data with vehicle telematic data, implement related socioeconomic (CENSUS Bureau) data and combine it with probe vehicle trajectory data to train the model to better predict traffic count.

Our Solutions: Map-Matching, Machine Learning, and Dashboard Visualization

In this project, we break down our approach and implementation into a three-step process.

First, We perform telematic data preprocessing. Then, we snap the GPS points (provided by our client 99P LAB) into the map, and we use an advanced Fast Map-Matching algorithm to predict the most likely trajectory.
Second, we create a machine learning model to estimate traffic flow using roadway characteristics and socioeconomic factors.
Third, the output of the results is shown in an interactive web dashboard visualization where stakeholders can interact to grasp the prediction of each road segment. Our results present that telematics data can possibly be used to provide timely estimates of traffic flow.

Project Timeline

Phase 1: Literature Review

Phase 1 is the beginning of our project, after using backward induction to find out the necessary steps we need to take to finish our project, which are Map-Matching, Machine Learning, and Dashboard Developing. Our work in this phase emphasizes a high-level review of the literature on this range of topics. This research deepened our understanding of the concept and methodology, which paved the way for our development and Implementation in the next phase.

Phase 2: Three-step development and Implementation

Our project team in this step is divided into three groups since the methodology and technology focus for Map-Matching, Machine Learning, and Dashboard Developing are inherently different. We leverage each of our team member’s specialties into the development and deployment of each task. During this phase, we constantly consult the project advisor and the clients to align the expectations and goals.

Phase 3: Deliverables

For the last steps, each group wraps up our work with detailed documentation on the methodology we use, which will be shown in the following. We conducted a final presentation on this project and submitted a project report along with codes and documentation as our deliverables.

Literature Review

Map Matching Algorithms

How to map vehicle trajectories and necessary GIS data to the roadway segment is the first question required to be answered. Map-matching algorithm is an integral part of the 99P Lab project. Although many map-matching algorithms are available, which one is the most suitable for this project is still a puzzle. This literature review summarizes different kinds of map-matching algorithms and their strengths and shortcomings from the previous academic papers and reports.

According to Mohammed A. Quddus et al. (Mohammed A. Quddus, Washington Y. Ochieng, Robert B. Noland, 2007), there are mainly four categories of map-matching algorithms, geometric analysis-based algorithms, which are characterized by only including the shape of the links rather than the connections or intersections. The advantage of this kind of algorithm is that the implementation is quick and easy, while the drawbacks are its sensitivity to the number of data points in the data collected and the road density; On the basis of geometric approaches, topological approaches added connectivity and continuity of the links, this is also their advantages, but their insensitivity to outliers may lead to inaccuracy of vehicle heading calculations, or the dissimilarity of the connecting roads may result in underperformance at junctions; The third is the probabilistic algorithms developed by probability and statistics, it has a confidence region and error region, it solves many problems the first two categories have but hard to understand; The advanced map-matching algorithms integrated the three types of algorithms mentioned above, eliminating many drawbacks other algorithms may have, many sub-categories proposed by different researchers, developed for different purposes. (Mohammed A. Quddus, Washington Y. Ochieng, Robert B. Noland, 2007). A detailed summary of the four groups of map-matching algorithms and the strengths and shortcomings of their sub-categories are listed in figure 2.

Liang Li et al.’s research has more implications for the 99P Lab project’s integrity check at later stages; they developed an enhanced RAIM model which integrity problems the traditional RAIM model has by improving the fault detection performance using variable false alarm rate (VFAR) and using non-Gaussian distribution and a sigma inflation algorithm (Liang Li, Mohammed Quddus, Lin Zhao, 2013).

Ron Dalumpines and Darren M. Scott’s paper is more conducive to actual model building and Python-scripting programming; they introduced a post-processing map-matching GIS platform and detailed the five steps to implement the algorithms using Python programming language (Ron Dalumpines, Darren M. Scott, 2011). The flow chart of the five steps can be seen in figure 3, which is created and summarized based on their paper.

Although Wonhee Cho and Eunmi’s research was based on NoSQL’s HBase and Hadoop, the algorithms they mentioned are very useful to the 99P Lab project, such as the point-to-curve based global method using Fréchet distance or the Hidden Markov Model (HMM) (Wonhee Cho, Eunmi Choi, 2017).

The decision of which map-matching algorithms will be used in this 99P Lab project may depend on data collected from sensors, the quality of the spatial data, techniques to be used, validation consideration, integrity, feasibility of implementation, etc.

***Figure 2.*** *Pros and Cons of Different Map-Matching Algorithms*

***Figure 3.*** *Five Steps to Implement Ron Dalumpines and Darren M. Scott’s GIS Map-Matching Algorithm (Ron Dalumpines, Darren M. Scott, 2011)*

[4] Mohammed A. Quddus, Washington Y. Ochieng, Robert B. Noland. (2007). Current map-matching algorithms for transport applications: State-of-the-art and future research directions. Transportation Research Part C, 312–328.

[5] Wonhee Cho, Eunmi Choi. (2017). A basis of spatial big data analysis with map-matching systems. Cluster Comput 20, 2177–2192.

[6] Ron Dalumpines, Darren M. Scott. (2011). GIS-based Map-matching: Development and Demonstration of a Postprocessing Map-matching Algorithm for Transportation Research. In S. R. Geertman, Advancing Geoinformation Science for a Changing World (pp. 101–119). Berlin, Heidelberg: Springer.

[7] Liang Li, Mohammed Quddus, Lin Zhao. (2013). High accuracy tightly-coupled integrity monitoring algorithm for map-matching. Transportation Research Part C, 13–26.

Machine Learning

As we found, there are different approaches to the machine learning methodology; their pros and cons are shown below:

Formula Based Approach

E.g. AADT = 24-hour traffic counts from short-term count stations * adjustment factors
daily adjustment feature, monthly adjustment feature, and seasonal adjustment feature
Disadvantage: can only be used on roadway segments that have short-term count locations

Linear regression that utilizes roadway characteristics and socioeconomic factors

E.g. Model (R2 = 0.82): AADT = -5625 + 8493 FCLASS (functional classification e.g. Collector road, Minor arterial and Principal arterial + 219 LANE ( lane counts of roadway segments) — 1.16 POPBUFF (population)-0.58 NONRETAILEMBUFF (all-other employment)+ 11.55 RETAILEMBUFF (retail employment)
Advantages: can accurately estimate AADT on desired roadways
Disadvantages: Census data: is only updated every five years; other types of data: may be time-consuming to collect
rural roads: principal component analysis/clustering + regression

***Figure 5.*** *Socioeconomic Factors for Linear Regression*

Neural Network

Advantage: do not require the ATRs to be grouped
Disadvantages: artificial neural network models proved to be less accurate, too complex to interpret, or BlackBox

Data Sources

Data processing is the necessary process to convert telematic data to the data type that is suitable for the map-matching algorithm deployment at later stages. The telematic data is downloaded via email provided by the Developer Advocate of 99P Labs.

From the folder, there is host data, rvbsm data, spat data, and an annotated schema file. From the schema file, we learn that host data contains the vehicle’s telematic data on a 0.5-second basis. This means in every 0.5s, a valid telematic device will record critical vehicle information, including the following features:

***Table 2.*** *Important Features of Host Data*

Methodology

Map-Matching Methodology

The Fast Map Matching algorithm (FMM) is an open source map matching framework posted on Github (https://github.com/cyang-kth/fmm), aiming at solving the problem of matching noisy GPS coordinates data to road networks (Cyang-Kth). It is versatile in many aspects: the C++ implementation ensures its high performance; is user-friendly due to its employment of Python API; feasibility to scale up easily; adaptability to include multiple formats of data; Hexagon matching accuracy, etc.

To make this map-matching algorithm to be suitable for our specific cases. We have made several changes mainly to our shapefiles:

The shapefile format available is not the same as the shapefile format given by the algorithm’s demo. The units of GPS coordinates are different. Therefore, we converted the map unit to accommodate the GPS coordinates units in our shapefile.
We generated and added a new attribute (LinkID) to the trajectory dataset.

Our map-matching workflow includes:

Load GPS data
Load network and graph
Compute UBODT: UBODT is the result of fast map-matching precomputation; It will be useful when the fast map-matching algorithm predicts trajectories. Given a node, UBODT will save all the nodes within a certain distance (300 meters). This will improve the running speed of prediction.
Configure a Fast Map-matching model (input, output)

After completing map matching, we need to output our snapped data and give them to the machine learning part. The example of input and output is shown below:

***Figure 6.*** *Input and output for Map Matching*

[8] Cyang-Kth. (n.d.). Cyang-KTH/FMM: Fast map matching, an open-source framework in C++. GitHub. Retrieved December 11, 2022, from https://github.com/cyang-kth/fmm

Map Matching Limitations

The trajectory predictions depend on the input network, the result of pre-computation. If the network is updated, we need to run the predictions again for all the trajectories, which is time-consuming and computationally expensive.
GPS coordinates cannot reflect the altitude, which may cause the predicted trajectories to be imprecise

Map Matching Potential Improvement

Tune the parameters to have better accuracy.
Try other map-matching algorithms and search for better solutions

Machine Learning Methodology

Data Source and Processing

The visualizations throughout the dashboard were created using simulation data provided by the Mobility Data Analytics Center (MAC) at Carnegie Mellon University, mapping with census data collected from the U.S. Census Bureau and telematics data from 99P Lab. The simulation approximates an average day for the Columbus transportation network. Simulation results are estimated at intervals throughout a period ranging from 5 am to 11 am and include baseline indicators such as the car and truck inflow, travel time, free flow travel time, speed, and public transit passenger inflow.

***Figure 6.*** *Machine Learning Data Source*

Mapping

We map our data in the process that is shown in the following graph:

***Figure 7.*** *Machine Learning Data Matching Schema*

Machine Learning Pipeline

The steps we take for the machine learning pipeline are:

Firstly, all accessible features are introduced as our explanatory variables. Secondly, the whole dataset is split into the training dataset (80%) and the testing dataset (20%).

Now, we have begun to search for models that can give us the lowest MSE. The first model we tried is Ridge regression. Since the penalty term is a hyperparameter, we gave a large number of potential values first. To have more intuitive selections on the penalty term, we also visualized the result.

***Figure 7.*** *Ridge Regression result*

From the above graph, we can confidently conclude that the best alpha is between 10^-2 and 10². For the left-hand side graph, alphas between 10^-2 and 10² gave us meaningful results because we do not want alphas too large to penalize all coefficients to 0. For the right-hand side graph, alphas between 10^-2 and 10² gave us the lowest errors.

Now, we can feed the Ridge regression with cross-validation with our potential alphas and print out the related values.

***Figure 8.*** *RidgeCV Training and Result*

We also tried the Lasso model and underwent the same processes as Ridge. To keep this report concise, repeated parts are omitted. Here is the visualization based on the Lasso model and the result from Ridge regression with cross-validation.

***Figure 9.*** *LASSO Regression result*

***Figure 10.*** *LASSO Penalty Term Results*

Fourthly, we tested the best Ridge model and the best Lasso model through our testing dataset.

Figure 8. Test MSE for both models

Based on the test MSE, we can conclude that Ridge is slightly better than Lasso. Besides, we found that some variables are not helpful in predicting count because their coefficients are 0, so we deleted these four variables and updated our variables.

Count Prediction

Equation

Time-unrelated parameters

We refer back to historical road data for time-unrelated variables like the number of lanes, area type, width, and turn lanes. Links can also easily be related to block information, thus, we included census data mapping in this step.

Time-related parameters

We only have two variables that are related to time; one of them is the count data preprocessed by the map-matching group. The original data format was a count per row of data with time and link id. We grouped by the hour, and the second variable is the hour. Since we are only dealing with half circles (morning counts only), we adopted circular regression with the sine of hour divided by 12.

Figure 9. Time variable circular function

Prediction

We adopt the coefficient result from the previous machine learning model and apply it to both time-unrelated variables and time-related ones and generate predicted counts.

Machine Learning Limitations

Limited area

The model was built based on the real count data provided by Mobility Data Analytics Center (MAC) and census data. With these data from Columbus, we can have a glance at relationships between traffic flow and relevant roadway data. However, due to the limitation of the area studied, we still need more information and training data to adapt to other cities.

Census data level

We attempt to gather census data from the census bureau to further support the modeling. Unfortunately, some of the geographic data needed for this calculation is sparse. Median income data is not available at our preferred geographic level, Census Block Groups, and had to be downloaded at the Census Tract level

instead. For compatibility and ease of usage, we adopted the Tract level for all census data.

Scalability

One of the most time-consuming parts is data preprocessing. Resource-wise, the required data is scattered in several sources and needs manual collection and verification. Technology-wise, since roadway data and census data don’t share the same geographical identifier, currently, we can’t easily table join them but have to map these two data through QGIS “join attributes by location” function. While this operation is doable in this project, it is not ideal if there are multiple locations and several versions. Roadway data format should also be strictly followed to ensure fewer errors.

Machine Learning Potential Improvement

Automate public data retrieving
Integrate roadway data source and format
Implement roadway data simulation to expand the usability
Increase observation area both in terms of roadway data and counts

Dashboard Methodology

After obtaining the output data from the machine learning section, we developed a dashboard built upon the dashboard constructed by last year’s 99P Lab team to visualize the traffic flow prediction results.

The dashboard is a Django project, including Django + python on the backend and bootstrap + JavaScript + HTML + CSS + Mapbox on the frontend.

These are the major parts of the traffic dashboard:

A traffic flow visualization page. Once the user selects the time period, this page will display the traffic flow within the selected time period.
A traffic trajectory visualization page. Within the selected time period indicated by the user, this page will display the trajectories of the link once clicking on one link.
Integrated the previous sustainability dashboard into the current traffic flow estimation web application.

***Figure 10(a).*** *Dashboard for showing trajectories of 12pm which goes through one certain link*

***Figure 10(b).*** *Dashboard for showing the traffic flow of links*

Django Web Application Launch Instructions

Mapbox Implementation Details

Data Sets and Format Choice

Data files are called Tilesets in Mapbox
Data format: CSV, or geojson with all the features and geometry data
< 300M: upload CSV to Mapbox Studio.
> 300M push geojson by Mapbox CLI API for tileset

***Figure 11.*** *Tilesets on Mapbox Studio*

Tileset Display Settings

Set map center and zoom level in recipe

***Figure 12.*** *Tileset recipe on Mapbox*

FrontEnd structure

JS part:

Import Mapbox
Load map
Add layer with data source
Set slider for the hour filtering
Set filtering for clicking a certain link

Html part:

Set Map
Set Legend
Set Slider

Web Dashboard Potential Improvement

Integrate with larger datasets and develop the backend database, which interacts with the Mapbox Visualization with API tools. In this way, all the historical data of traffic flow and trajectories would be stored and managed in a way that can be retrieved at any time.

Recommendations and Conclusions

At present, all of our work is based in Columbus, Ohio. In the future, we envision expanding the coverage to new geographic areas and scenarios. Building pipelines of data collection and cleaning can enforce machine learning more efficiently and refine the model more effectively. The visualization framework developed for this project could easily be reworked to display current data for other cities. With some additional fine-tuning, the tool could be expanded to include comparisons with policy intervention scenarios.

Acknowledgment

We would like to thank Professor Sean Qian for his guidance and support throughout the course of this project. We would also like to thank Stan Caldwell, and Karen Lightman at Carnegie Mellon University.