99P Labs
Published in

99P Labs

Understanding data-driven driver safety personas — OSU MTDA Capstone Reflection

Project Overview

Each year in Columbus, there are 29,654 traffic crashes that result in an average of 594 serious injuries and another 101 deaths (Ohio State Highway Patrol, 2016–2021). A fair portion of these accidents are caused by driving under the influence, speeding, and driving while distracted — all completely preventable behaviors. In recognition of the vital role vehicle and user safety can play in preventing accident-related injuries and deaths, 99P Labs has set a goal to improve driver safety. To realize this goal, it will be necessary to enhance vehicle safety features and engineer the vehicle interface to optimize the users’ ability to focus. This analysis uses data transmitted from 99P Labs vehicles to identify driver personas and their specific behaviors.

For our Ohio State capstone project, we sought to identify specific driver personas and their safety behaviors using data from 99P Labs customers in Columbus and use this information to develop user-specific safety recommendations. This project was completed as part of our Master’s of Translational Data Analytics program.

The overarching goal of this study was to find opportunities to make safety improvements that ultimately prevent serious injuries and fatalities from automobile accidents. We approached this problem through a predictive analysis that led to the creation of driver personas. For each persona group created, we suggested driver-specific safety features that may help members of this group exercise more judicious driving behaviors. While it is impossible to control the actions of every single driver in every car, 99P Labs can champion safety ideals that are accepted as best practices in their vehicles and work with other companies to establish cross-industry standards. Understanding common causes of accidents can help OEM’s design their vehicles with the mission of empowering drivers to be proactively safe. By being a good steward of safety principles, 99P Labs can bolster the consumer trust of their products.


Our analysis was based on the hypothesis that individuals can be categorized into groups of V2X drivers based on patterns in their driving behaviors. Specifically, statistically significant differences in trip duration, distance, speed, and warning messages can determine the level of safety at which a driver travels.

H0: μ1 = μ2 = μ3 = μ4 = … = μi
There are no significant differences in driving behaviors across V2X drivers.
H1: At least one significant difference in driving behaviors is present between V2X drivers. Process

Our analysis used data transmitted from vehicles in Columbus to identify driver behaviors and motivations with the intent of finding opportunities to make safety improvements. The process began with cleaning and exploring the V2X dataset. The V2X data, or Vehicle to Everything, was selected from our clients and gave us dimensions with which to identify and explore driver behaviors. The V2X dataset consists of the following five tables:

  • Summary — summary measures used to characterize the trip
  • Host — measures from the host vehicle
  • RvBsm — Remote Vehicle Basic Safety Message — basic safety measures from other equipped remote vehicles
  • EvtWarn — Event Warning — imminent warnings to the host driver regarding interactions with remote vehicles
  • SPaT — Signal Phase and Timing — measures from smart intersections

We chose to focus only on the Host and Summary tables for the scope of our work. Cleaning involved removing extraneous variables, outliers, invalid values, and missing value placeholders. We then transformed the data from trip-level to device-level metrics (n = 69) and began creating new variables to describe driver behaviors. We created three categories of behaviors to start: speed, distance, and duration. Maximum and average speeds were calculated, as well as the number and portion of trips during which the driver exceeded various thresholds (i.e., 70, 80, 90, 100). Regarding distance, minimum, maximum and average distance were calculated as well as number and portion of trips during which the driver traveled over 5, 30, 75, 100, and 150 miles. Finally, we calculated minimum, maximum, and average trip duration.

For the variables with multiple thresholds, we used the 80th percentile to help us determine which speed and distance to use in our analysis. We then created new columns based on this finding: portion and number of trips where the driver exceeded the 80th percentile of the maximum speed (i.e., 78.267 miles per hour) and portion and number of trips where the driver exceeded the 80th percentile of the average speed (i.e., 57.347 miles per hour). We calculated these variables for distance (80th percentile = 26.415 miles traveled) and duration (80th percentile = 27.662 minute trip) as well.

We attempted to integrate the event warning table into the analysis but were unable to use this data asset due to sparseness of information in the table. To join the event warning table with the current metrics data frame, we wanted to calculate the portion of certain warning types as a proportion of total drives. We also wanted to create a metric that showed the portion of trips with

any kind of warning. Unfortunately, it was not possible for us to use the event warning data to create clustering features. When looking only at events that are likely to be driver-caused risky situations, we found event warning data for 4 unique devices. Given that the full metrics dataset contains 69 unique devices, creating clustering features using this variable would introduce bias since the data is not missing at random. Had we been able to use this data asset, we would have filtered to only warning messages with the following event codes: FCW, IMA, BSWLCW, RSZW, CSW, PDA, LTA.

Cleaning and manipulating the dataset lead us from 47 columns to 17. These columns included device, average trip speed, maximum trip speed, number and portion of trips where the driver exceeds 78 mph, number and portion of trips where the driver exceeds 57 mph, number and portion of trips where the trip exceeds 27 minutes, number of trips that exceed 26 miles, minimum, maximum, and average trip distances, and minimum, maximum and average trip durations. We explored principal component analysis (PCA) as a means of reducing the dimensionality of our dataset. Implementing PCA with our dataset did not translate well to the context of driver behaviors, so we did not end up integrating the technique to our workflow.

Next, we used various visualization techniques to proceed through the analysis. First, we used heatmaps to identify any collinearity across variables and removed columns that were too closely correlated (cutoff = > abs(0.8)). We also used an elbow plot to determine the optimal number of clusters (k = 3) and proceeded with k-means clustering. Finally, we used boxplots to display the clustering results.

The final step of our analysis involved conducting a one-way analysis of variance (ANOVA) followed by Tukey HSD to test for statistically significant differences. For two columns (num_trip_dist_26 and max_trip_time) the homogeneity of variance assumption was not met, so a Welch’s t-test was performed instead.


Statistically significant differences led us to identify three unique driver groups. The table below displays where the significant differences were and the relative mean value for each group. Red indicates the highest mean value among the three groups, green indicates the lowest mean value, and yellow falls in between. An equal sign (=) is used to signify that the groups were equal on a given measure.

Using what we learned about the three distinct groups, we developed personas to help contextualize these differences.


An individual’s behavior behind the wheel can have significant implications on the safety of themselves and others. In developing personas, it will be important to consider a few key descriptors of each driver type. For example, the relative risk level for a certain behavior and who it can jeopardize will be important to note. It may also be helpful to consider the motivation behind the drive and the circumstances that could be triggering certain behaviors. Another element worth considering is the level of driving expertise and proficiency. Lastly, a driver’s personality and life situation will be important for their driving style. For example, a teenage boy might be more likely to speed and drive aggressively than a father of young children. The project team’s data assets do not contain demographic characteristics, but the sponsor may have access to additional contextual information that would allow them to glean these types of insights. Even if some driver characteristics are not easily measurable, it can still be helpful to consider them when developing interventions. The following groups aim to categorize drivers by easily recognizable characteristics. These groups are not mutually exclusive, and one driver could fit into multiple categories.

Driver Expertise Levels

Fresh to the Streets

o Drivers in this group have limited exposure to the roads and may not be fully aware of best practices and traffic norms. Because they are novice drivers with less than five years of experience, they are more prone to cautious navigation and could be captured by slower speed or frequent braking. Examples of drivers in this category could be student drivers or people new to driving in the US. This could also include drivers in unfamiliar areas on new road types, like someone from a rural area driving in the city. Members of this group would respond well to helpful driving tips and reassurances for good practices. They may also appreciate tutorials that explain how to best utilize vehicle safety tools, like the backup camera.

• Budding Expert
o Members of this group are young professionals who have been driving for 5–10 years and are well versed in a variety of driving experiences. They understand and abide by most traffic norms and regulations. Interventions that nudge drivers in the direction of better long-term driving habits would be good for this age group.

• Seasoned Veteran
o These drivers have been driving 10 plus years and have well-established driving practices. Their experiences have helped them develop careful driving reflexes and they are pretty safe. Some members of this group may have an inflated sense of confidence and be prone to arrogant maneuvering. Drivers in this group may not require as much safety program if they have well established practices. However, if someone in this group does require safety interventions, it may be more difficult to break their long-established precarious habits.

Personality and Life Situation Considerations

  • Age
  • Years of Driving
  • Types of roads commonly traveled
  • Gender
  • Household type
  • Occupation
  • Patience level
  • Emotional wellbeing
  • Ability to focus
  • Other responsibilities (e.g., children)


  • There are a variety of driver types in the dataset. We have developed several features and user persona prototypes that can help classify unique drivers into easily recognizable groups. These groups could inform 99P Labs of the best types of safety- related programming to include in the user interface and messaging system of a vehicle. They could also incorporate these findings into a driver application or website with safety tips. Given that humans are extremely diverse and often impossible to perfectly predict, it will be critical to build feedback cycles into deployment of these tools to understand which features help each driver prioritize safety.
  • The event warning dataset has information for only a limited number of devices. Conducting additional research into the collection of relevant variables and assessing data governance would be a prudent next step. This seems likely to be a helpful asset for the ongoing goals of the project and it would offer additional insights into driver behavior if it could be integrated into the analysis pipeline.
  • The next team working on this project should prioritize…

o Validating and building upon the current analysis pipeline
o Finding unexpected data points and curating additional persona groups that categorize these behaviors
o Continuing to map cluster groups to suggested personas
o Exploring industry best practices for in-vehicle user safety features and suggestion them for driver persona groups

Thanks to 99P Labs!

The project team would like to extend our sincerest gratitude to 99P Labs for allowing us to work on an analysis in this exciting problem space using their data assets. We are truly appreciative of the time they spent working with us on our capstone project.


Ohio State Highway Patrol. (2016–2021 Annual Averages). Crash Dashboard. Retrieved May1, 2022, from https://www.statepatrol.ohio.gov/ostats.aspx#gsc.tab=0

The National Law Review. (July 17, 2017). Permanent Headlight Usage Shown to Reduce Car Accidents. Retrieved May 1, 2022, from https://www.natlawreview.com/article/permanent- headlight-usage-shown-to-reduce-car-accidents

NHTSA. (n.d.). Speeding. [Text]. Retrieved May 1, 2022, from https://www.nhtsa.gov/risky- driving/speeding



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store