Decoding Travel Times: Exploring Telematics Data Dynamics

Published in

99P Labs

8 min readMay 8, 2024

Project Summary

We, the 99P capstone team of the MTDA program program at The Ohio State University, built a model that compares real-life trip times to the ideal times projected by the Google Distance Matrix and predicts whether a given trip will be over- or under-estimate based on geography, time of day and local weather. The model’s underwhelming accuracy caused us to rethink our assumptions about the relationship between traffic congestion and drive time.

Who We Are

We are a capstone team of three people (Evan, Qamar and Wajihah) from Master of Translational Data Analytics at The Ohio State University. This master’s program is mainly for adults, most of whom are already working with data in some capacity, who want to transition to a career in data analytics. The “translational” in the name captures the program’s emphasis on the full problem-solving pipeline, from developing a research question/problem statement, to analysis and modeling, to presenting results. We have coursework not just in statistics, programming and machine learning, but also in data governance, UI/UX and visual design.

The capstone is a year-long project, divided into four sprints, exploring and analyzing real-world data in conjunction with a sponsor. 99P Labs was a great sponsor to work with because they showed us a lot of trust and flexibility in the problem we chose to tackle and the data we used. On the flip side, however, there were times when the open-endedness itself felt like a challenge to overcome.

How It Started

99P came to us with a deceptively simple prompt: start with telematics data, and, using your knowledge of data analytics, say something about “traffic.”

Telematics, a portmanteau of telecommunications and informatics, generally refers to the collection of data on vehicle usage using either a dedicated device connected to the vehicle’s internal computer system or indirectly through a device like a GPS tracker or even a cell phone. “Traffic” can have multiple meanings, not all of them related to cars and trucks. The definitions of traffic that are relevant to our project, according to Merriam-Webster, are:

The vehicles, pedestrians, ships, or planes moving along a route.
Congestion of vehicles (e.g. “stuck in traffic”).
The movement (as of vehicles or pedestrians) through an area or along a route.

We held off on choosing a more specific focus for our project until we could do some EDA (Exploratory Data Analysis) on our dataset, just to see what type of questions the data might lend itself to answering. Initially, 99P provided us with data taken from The Ohio State University’s logistics fleet. However, we soon realized that there was an issue with this dataset.

Telematics data records typically consist of a series of “snapshots” giving a vehicle’s position at a certain moment in time, and maybe other data such as speed, acceleration, brake depression, etc. If a record is taken every couple of seconds, it’s possible to very accurately reconstruct the route a given vehicle took.

However, the OSU fleet vehicles only reported position every two minutes. This would be low fidelity for any set of vehicles, but for delivery vehicles making frequent turns and stops it was especially bad. Reconstructing anything these cars and trucks did would be basically impossible.

So, 99P then gave us a sample of data from the V2X (Vehicle-to-Everything) study, a part of the Smart Columbus initiative. Vehicles in the study are the personal vehicles of employees in the Columbus, Ohio metropolitan region. These vehicle records had the amazing fidelity of twice per second, which was so much data it was actually hard to get enough processing power just to open the file.

The Timeline

The switch to the V2X dataset came about halfway through our first semester, which put us on a time crunch as we were originally expected to have EDA and a data dictionary completed by then and be ready for our first sprint. Instead, we spent our first sprint recreating EDA with the new data set, still without a specific research question.

Here’s a quick summary of how our sprints went:

Sprint One: We hastened to redo our EDA but didn’t find anything inherent in the data that looked like a dependent variable or something with meaningful classification potential.

Sprint Two: We tried what we thought were our best ideas but wound up at a bunch of dead ends. This was a “crisis” moment for our team.

Sprint Three: Following Ryan’s suggestion to make use of a distance matrix API, we finally found our dependent variable and identified a specific research question! Now to quickly redo all of our EDA and run some analysis.

Sprint Four: Each of us built a machine learning model based on the work we’d been doing so far, with the goal of seeing which method (K Nearest Neighbors, Decision Tree or Random Forest) produced the best results.

Developing a Research Question

It took time to develop a research question when we didn’t have an obvious dependent variable. We could calculate the duration of each trip, but with so many unique start and end coordinates it was not possible to meaningfully compare duration trip to trip.

This changed when we made use of the Google Distance Matrix API to generate projected times for each trip. These projected times were based on coordinates, not on time of day, so they would not incorporate the effects of any predictable patterns of traffic congestion. By comparing actual durations with projections, we came up with a new attribute, “duration delta.” This attribute could be directly compared between trips, and what’s more, we believed that through the duration delta, we could indirectly infer the presence of congestion. Longer trip times would mean more cars on the road, while shorter trip times would indicate relatively sparsely-traveled roads. Thus, we set off on analysis and modeling.

Exploring the V2X Dataset

Our dataset consisted of telematics records taken from the personal vehicles of employees in the Columbus, Ohio metropolitan area. The sample we studied included 2,741 trips taken between January 2 and October 1, 2022.

One of our assumptions was that duration delta would be higher during the typical “rush hour” period. Imagine our surprise when we checked the distribution of trips in our dataset by time of day:

We expected the volume of trips to peak around 9 am and 5 pm, with perhaps an additional lesser peak during the changeover between second and third shift (midnight to 1 am) because plants operate 24 hours. Instead, we saw that trip volume peaked around 2 am, steadily fell until it sank to virtually nothing at 10 am, then gradually rose again until the 2 am peak. This completely contradicted what we thought of as typical traffic patterns. Without knowing more about the vehicles’ intended uses, we struggle to think of an explanation for this.

Another aspect we explored was the geographic distribution of trips. To better classify our trips, we grouped all our start/end points into five “zones” around Columbus. The final result looked like this:

Traffic was definitely biased toward the northern “outer belt,” not surprising for employees who work at facilities to the northwest of Columbus, but certainly not representative of Columbus traffic patterns overall. The difficult to explain part was how few trips there were to/from Zone 0. We would expect many of these trips if the cars were primarily used for work-related commuting. However, only about 26% of trips ended in Zone 0, and there were 0 trips on record that started from that zone.

Project Methods and Results

In our capstone project, we tackled the challenge of comparing real-world trip times to ideal projections using telematics data from the V2X study. Here’s how we approached it:

1. Feature Creation: To compare trips with varying start and end points, we created the “duration delta,” measuring the difference between actual and expected trip durations. This allowed us to analyze how much trips deviated from their projected times, providing insights into traffic conditions and congestion.

2. Weather Analysis: We integrated historical weather data into our analysis, identifying conditions that significantly impacted trip duration delta. Snow overcast conditions, for instance, led to the highest duration delta, highlighting the influence of weather on travel times.

3. Zone Identification: Using K-Means clustering, we categorized trip start and end locations into five geographical zones. This approach helped us understand spatial patterns, concentration areas, and unique trip characteristics associated with each zone, aiding in transportation planning and route optimization.

4. Data Filtering: We addressed data challenges such as round trips and obfuscated coordinates. Filtering out round trips and handling obfuscated data improved the quality of our analysis and models.

5. Modeling: Initially aiming for continuous prediction, we shifted to building classifiers to predict whether trips would be over- or under-estimated (“late” classification). Models like K Nearest Neighbors, Decision Tree, and Random Forest were explored, each with varying accuracies in predicting trip duration deviations. Through these methods, we navigated complexities in the dataset, identified influential factors like weather and geographic zones, and developed models to understand and predict trip duration dynamics.

Model Results and Conclusions

Despite our efforts, none of the predictive models we tested exceeded 60% accuracy. Our best model, the Decision Tree, achieved 57.3%, only slightly better than the worst-performing model, K Nearest Neighbors. This leads us to several key conclusions:

1. Limited Model Improvements: While other modeling techniques might exist, our models showed similar performance, suggesting limited gains from alternative approaches.

2. Data Bias and Challenges: The bias in the V2X dataset, along with round trips and obfuscation, may not align well with our modeling approach. Exploring cleaner data sources could yield better insights.

3. Random Factors in Travel Time: Travel time between locations within a city seems influenced more by random factors than predictable attributes like time of day or weather, challenging the usefulness of comparing actual to theoretical travel times based solely on coordinates.

4. Route Considerations: Our analysis didn’t include driver route choices, a significant factor in travel time. Incorporating route data would require advanced technical capabilities beyond our project’s scope. Moving forward, we advise caution in replicating our analysis. Future teams should address data bias, explore alternative data sources, and consider route information for more accurate analyses. Communication with data owners about potential data issues upfront is also crucial for better outcomes.

References

Arribas-Bel, D. (2017). contextily: context geo tiles in Python. Retrieved from https://contextily.readthedocs.io

Jordahl, K. (2014). GeoPandas: Python tools for geographic data. Retrieved from https://geopandas.org/en/stable/

Liu, Y., & Wu, H. (2017). Prediction of road traffic congestion based on Random Forest. 2017 10th International Symposium on Computational Intelligence and Design (ISCID). https://doi.org/10.1109/iscid.2017.216

McDonnell, K., Murphy, F., Sheehan, B., Masello, L., Castignani, G., & Ryan, C. (2021). Regulatory and technical constraints: An overview of the technical possibilities and regulatory limitations of vehicle telematic data. Sensors, 21(10), 3517. https://doi.org/10.3390/s21103517

National Geospatial-Intelligence Agency. (2017). Map Projections for GEOINT Content, Products, and Applications (NGA.SIG.0028_1.0_MAPPROJ 2017). https://nsgreg.nga.mil/doc/view?i=4478&month=12&day=10&year=2023

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., … Duchesnay, E. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

Visual Crossing Corporation. (2024). Visual Crossing Weather (2022–2022). [data service]. Retrieved from https://www.visualcrossing.com/