Exploring the Uncharted: Unleashing the Potential of Telematics Data in Data Science

Published in

99P Labs

9 min readMay 10, 2024

The 99P Labs x UC Berkeley team is part of the Data Science Discovery Program at the University of California at Berkeley.
Researchers: Michael Florip, Lauren Chu, Jinn Lim, Vivrd Prasanna, Peter Wang

Project Overview
For the duration of the Spring 2024 semester, our team has explored the complex field of telematics data, aiming to transform its challenges into opportunities for innovation and practical applications across various domains. Our project investigates new insights and methodologies, advancing research and development in telematics data science.

Preliminary Data Exploration
The dataset provided to our team is called v2x_columbus_trips, a comprehensive dataset of various telematics datapoints, including both vehicle telematics and diagnostic data. A few of the tables are ‘evtwarn’, ‘host’, ‘hvcan’, ‘per’, ‘rvbsm’, ‘spat’, and ‘summary’. Each table is designed to capture specific aspects of vehicle and driver behavior, including alert levels, vehicle identification, event applications, geographical positioning (latitude, longitude, elevation), vehicle statuses (e.g., brake status, turn signal status), acceleration, and speed. Key attributes include unique vehicle identifiers, timestamps, and a range of status indicators for both host and remote vehicles, encompassing brake status, vehicle class, elevation, heading, and speed

Data Overview
The dataset offers a comprehensive examination of vehicle telematics and diagnostic data, structured across various tables such as ‘evtwarn’, ‘host’, ‘hvcan’, ‘per’, ‘rvbsm’, ‘spat’, and ‘summary’. Each table is designed to capture specific aspects of vehicle and driver behavior, including alert levels, vehicle identification, event applications, geographical positioning (latitude, longitude, elevation), vehicle statuses (e.g., brake status, turn signal status), acceleration, and speed. Key attributes include unique vehicle identifiers, timestamps, and a range of status indicators for both host and remote vehicles, encompassing brake status, vehicle class, elevation, heading, and speed.

Literature and Topical Review
Each team member performed a literature review on telematics research, identifying its applications, methodologies, and critical insights. The reviews underscored telematics’ interdisciplinary reach across industries like actuarial science, safety engineering, and environmental science. Key themes included telematics’ integration into insurance models, its impact on young driver safety, and its contributions to sustainable urban planning and efficient fleet management.

From there, we synthesized the information from our literature review into a list of possible topics:

Evaluation of Fleet vehicles for Electric Vehicle (EV) Transition
Evaluation of Ride-Sharing Trip Costs to Better Price Quote
Mobility Clustering: Match Users of Similar Mobility on Ride Sharing Services
Auto Insurance Price Prediction: Inform Driving Risk and Impact
Urban Freight Environmental Impact Analysis

For each of the topics, we evaluated the data needed, data we have, and any missing data and where to find it. From this short list of starting points, we decided to explore “Evaluation of Fleet vehicles for Electric Vehicle (EV) Transition” first. This most notably required us to find EV data, which we brainstormed we could find on Open Charge Map and research publications who link the datasets they used.

Evaluation of Fleet vehicles for Electric Vehicle (EV) Transition
We initiated our project with the goal of promoting energy-efficient mobility by evaluating the transition potential of vehicles to electric vehicles (EVs). Initially, we utilized telematics data to assess the energy-saving potential and efficiency gains of switching from internal combustion engine vehicles to EVs. Our analysis began by comparing one EV with one non-EV, focusing on distance driven and the potential savings in time, money, and fuel. We aimed to determine the best and worst performers in both categories.

We developed a simple model using three efficiency metrics: energy consumption per mile/km, cost savings per trip, and time savings per trip. Data for this analysis came from the ‘Electric Vehicle Specifications and Prices’ dataset on Kaggle, which included details like battery capacity and efficiency ratings.

However, integrating extensive and nuanced telematics data with limited EV data proved challenging due to the disparity in data granularity. This issue highlighted the difficulties in using telematics data to evaluate fleet vehicles for EV transitions.

Future telematics research should focus on collecting and analyzing data from electric vehicles to match the comprehensive data available for gasoline vehicles. This approach will facilitate more detailed comparisons and better-informed decisions in the transition to sustainable mobility.

Automatic Classification of Driving Trips based on Telematics Data
Next, we explored automatic classification of driving trips based on the telematics data given to us. Our goal was to develop a system that automatically classifies driving trips into different categories based on telematics data collected from vehicles using unsupervised learning techniques, specifically clustering algorithms, to group driving trips into clusters based on similarities in their telematics data. Based upon this classification, we can identify the most frequented routes, which could potentially help with traffic management. To begin, we started by clustering our data based upon start and end coordinates.

# Make a copy of summary for this approach
second_approach = summary.copy()

# Combine start and end coordinates into new columns
second_approach['start_long'] = second_approach['startlongitude']
second_approach['start_lat'] = second_approach['startlatitude']
second_approach['end_long'] = second_approach['endlongitude']
second_approach['end_lat'] = second_approach['endlatitude']

# Clustering columns
selected_cols = ['start_long', 'start_lat', 'end_long', 'end_lat']

# Filter the dataset to include only selected columns
selected_two = second_approach[selected_cols]

# Perform K-means clustering
k = 5  # Number of clusters
kmeans = KMeans(n_clusters=k, random_state=42)
clusters = kmeans.fit_predict(selected_two)

# Add cluster labels to the dataset
second_approach['cluster'] = clusters

# Visualize the clusters (assuming 2D data)
plt.scatter(second_approach['start_lat'], second_approach['start_long'], c=clusters, cmap='viridis', label='Start')
plt.scatter(second_approach['end_lat'], second_approach['end_long'], c=clusters, cmap='viridis', label='End')
plt.xlabel('Latitude')
plt.ylabel('Longitude')
plt.title('Clusters of Driving Trips')
plt.legend()
plt.show()

After this, we identified the most frequent routes in each cluster by aggregating the routes per cluster, ranking routes by frequency, and creating a map centered around the average coordinates of all routes, with polylines running from the start and end coordinates.

# Step 1: Filter Data by Cluster
cluster_id = 0  # Choose the cluster for which you want to identify frequent routes
cluster_data = second_approach[second_approach['cluster'] == cluster_id]

# Step 2: Aggregate Routes
routes = cluster_data.groupby(['start_lat', 'start_long', 'end_lat', 'end_long']).size().reset_index(name='count')

# Step 3: Rank Routes by Frequency
ranked_routes = routes.sort_values(by='count', ascending=False)

# Create a map centered around the average coordinates of all routes
center_lat = ranked_routes[['start_lat', 'end_lat']].mean().mean()
center_long = ranked_routes[['start_long', 'end_long']].mean().mean()
mymap = folium.Map(location=[center_lat, center_long], zoom_start=12)

# Define a list of colors
colors = ['red', 'blue', 'green', 'orange', 'purple', 'pink', 'yellow', 'cyan', 'magenta', 'lime']

# Add polylines for routes between start and end points
for index, route in ranked_routes.iterrows():
    start_point = (route['start_lat'], route['start_long'])
    end_point = (route['end_lat'], route['end_long'])
    points = [start_point, end_point]
    color = random.choice(colors)  # Randomly select a color
    folium.PolyLine(locations=points, color=color).add_to(mymap)

# Add heat map layer to visualize density of routes
heat_data = ranked_routes[['start_lat', 'start_long', 'end_lat', 'end_long']].values.tolist()
HeatMap(heat_data, radius=15).add_to(mymap)

# Display the map
mymap.save('frequent_routes_map.html')
mymap

K-Means Clustering with Multiple Features
Building upon our initial model of clustering the vehicle routes through start and end coordinates, we attempted to incorporate more features in hopes to produce more comprehensive cluster categories. The following features that we decided to include in this approach were, “trip_distance”, “trip_duration”, “num_intersection_encounters”, “num_spatrx”, “num_shadowbsmrx”, and “num_normalbsmrx”. Trip distance and duration were calculated by simple linear transformations by calculating the haversine distance from two coordinates and difference between the starting and ending time of trips respectively. The following code reflects the feature engineering process.

def haversine(lat1, lon1, lat2, lon2):
    # Radius of the Earth in km
    R = 6371.0
    
    # Convert latitude and longitude from degrees to radians
    lat1 = math.radians(lat1)
    lon1 = math.radians(lon1)
    lat2 = math.radians(lat2)
    lon2 = math.radians(lon2)
    
    # Calculate the change in coordinates
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    
    # Haversine formula
    a = math.sin(dlat / 2)**2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlon / 2)**2
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
    
    # Calculate the distance
    distance = R * c

    return distance

# Getting distance traveled with the trips. 
third_approach['trip_distance'] = third_approach.apply(lambda row: haversine(row['startlatitude'], row['startlongitude'], row['endlatitude'], row['endlongitude']), axis=1)

# Getting the duration of the trip. 
third_approach['tip_duration'] = third_approach['endlocaltime'] - third_approach['startlocaltime']

# Selecting Features to Use in the Model.
third_columns = ['trip_distance', 'trip_duration', 'numintersectionencounters', 'numspatrx', 'numshadowbsmrx', 'numnormalbsmrx']

From here, we began to preprocess the data with the StandardScaler() method from the Scikit-Learn Preprocess Library. In essence, by scaling all of the feature values on a scale between 0 and 1, features that have higher values will not have higher influence on the K-Means model. By fitting and transforming the DataFrame of selected features, we were able to output a matrix that effectively represented scaled values of the original dataset.

# Standardizing the data
scaler = StandardScaler()
third_scaled = scaler.fit_transform(third_data)
third_df = pd.DataFrame(third_scaled)
third_df.columns = third_columns
third_df

After the preparation and preprocessing, we finally were able to fit the data onto the pre-trained Scikit-Learn K-Means model. Though, there are some parameters that we had to make for the overall accuracy of classification; the number of clusters. A common approach to find the optimal amount of clusters to select is to iterate through a wide range of potential values. For this instance, we decided to iterate over 16 clusters and compare the within-cluster sum of squares (WCSS) as the measure of error. After running these iterations, calculating the WCSS, and logging the errors, we graphed the performance for each of the cluster numbers with an Elbow Plot. From the graph, the optimal number of clusters seemed to hover around the 6–7 amount.

# Initialize a list to store the within-cluster sum of squares (WCSS) for each cluster number.
wcss = []

# Try different cluster numbers.
for i in range(1, 16):
    kmeans = KMeans(n_clusters=i)
    kmeans.fit(third_df)
    # Append the WCSS to the list
    wcss.append(kmeans.inertia_)

# Plot the Elbow graph.
plt.plot(range(1, 16), wcss, marker='o')
plt.title('Elbow Plot')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.show()

Based upon this information, telematics research can build upon our work by further refining the classification system to incorporate additional telematics data features, such as speed, acceleration, braking patterns, and time of day. This would allow for a more complex understanding of driving behaviors, which could be used to predict traffic conditions and optimize route planning within large, bustling cities.

However, given the increasing usage of telematics data collection and usage, it’s important to consider data privacy concerns. The collection and analysis of detailed telematics data, including temporal location patterns, raises significant privacy issues, specifically regarding the tracking of individual vehicles and the potential identification of drivers’ habits and locations. To balance these concerns while maximizing societal impact, researchers must ensure that all telematics data used for analysis is anonymized. Additionally, researchers should only collect and store the minimum amount of data necessary for the specific research or application purposes, ensuring that data is hidden behind authorized access.

Most importantly, it’s imperative that we implement clear and transparent consent mechanisms that inform drivers about what data is being collected, how it will be used, and who will have access to it. This continuous informed consent is crucial to ensuring that we can continue to collect more data, thus equipping us with further information to work with and produce greater results. One option to aid in this process is to provide drivers with the ability to opt-in or opt-out of data collection programs, explaining to them the extent of the data collection, as well as what the information will be used for.

Trip Duration Prediction Modeling
To continue exploring the potentials of the telematics data provided to our time, we decided to try taking a supervised modeling approach to predict trip duration based on features in the data.

See full code and insights at this GitHub repository link.

Conclusion
As outlined in the project objective, our research was centered around identifying and further exploring possible use cases for telematics data for social good, which future telematics researchers can build upon moving forward. Our exploration into the utilization of telematics data for the evaluation of fleet vehicles for electric vehicle transition, and the automatic classification of driving trips, has laid a foundational framework that will aid the future of urban mobility, whether in sustainable mobility or efficient urban transportation patterns.

Exploring the Uncharted: Unleashing the Potential of Telematics Data in Data Science

Written by Michael Florip