Vehicle Location and Dwell Time Prediction

Published in

Geek Culture

10 min readMay 24, 2021

Introduction

Our team was fortunate enough to be selected by 99P Labs, an innovation lab supported by Honda and The Ohio State University, to develop machine learning methods that can predict a vehicle’s future location and dwell time (how long a car will remain stationary for) enabling a host of services: make a car a delivery location, get services on the move, and enter a personal car into the gig economy to earn passive income.

Over the course of five months, we developed two models to predict a vehicle’s next location and associated dwell time. To determine the destination of a given car based on an initial start position in time, we developed a Markov Model. We then creatively combined DBScan, K-NN, and XGboost algorithms to achieve accurate dwell time forecasts. Once the two models were built, they were consolidated into an efficient UI for service providers.

Data Cleaning and Preprocessing

We gathered data from 99P Labs Developer Portal API. It is a live, event-based dataset that collects spatio-temporal data from cars on a ~millisecond interval in the United States. The data was geofenced to Columbus, Ohio, to simplify modeling efforts and prevent potential model bias that may arise due to differences from city to city. As the vast majority of records were NaN valued due to the event-based data collection systems implemented by 99PLabs, we filtered out the NaN values in the call to the API. In order to pull a sufficient amount of data to train and test our models, we created a pagination technique with the help of a 99P Labs Software Engineer Intern, Tommy Tran. The data retention efficiency from this process was less than 1%, meaning that for every 100 rows pulled in, we kept 1.

The data was then aggregated into individual trips of each car. Further logic was implemented to extract the start and end location of each trip sequence, and the time elapsed between engine cycles (difference in timestamp from last stop to next start) was calculated as a dwell time duration. These durations were also placed into a sequence of three interval bins: 0–3hr, 3–6hr, +6hr, which effectively changed our model architecture from a regression model to a classification model. Pivoting to a classification problem allowed us to optimize the accuracy of our product to fit our service providers use case rather than try to predict a number on a continuous range with low accuracy.

Location Prediction

Figuring out which model to implement for vehicle location predictions was extremely challenging, but Markov models were the most intuitive and commonly implemented approach for location predictions. This model works by having a set of states, for which each pair of states i, j in the set has an associated probability of moving from state i to state j. To apply a Markov model to our data, we initially set location clusters as unique states in the Markov chain and calculated the probabilities of moving from the current location to a different location for each car. To process the data for the Markov model, we dropped all rows with no dwell time durations in order to only use completed trips, and kept latitude and longitude for only start and end locations.

For the model architecture, we first set a buffer in order to prevent an excessive amount of coordinates that were generally in the same area (e.g. driveway vs. the street in front of a house) and grouped these locations under one state. After we found the set of all unique states (read: locations) a car visited, we wrote a transition function that calculates the probability of the car starting in location i and ending in location j for all locations in the set of states. To do this, we utilized the prob140 data science library, created for UC Berkeley’s Data140 course.

After we calculated the transition probabilities for every state, we wrote a function that took in a car’s ID number and returned the next most likely states, and their corresponding probabilities, given its last recorded state in the dataset. We wanted to connect all the models together at the end to receive the location prediction, corresponding chances, and dwell times for each one by only inputting a vin. However, we couldn’t connect them all in the given time we had since we had used slightly different columns and filtered data frames for each. Therefore, we decided to convert our final output from the Markov chain from a series format to a dataframe format and printed the top next locations for each vehicle and trip whose chances summed up to 80%. We then converted this final large dataframe into a csv file and added the dwell times for each one in a separate column to insert into our Streamlit UI, giving users the ability to see dwell time and location predictions for a specific vehicle or vehicles within a set date range.

Example of a Markov Chain with Transition Probabilities

Limitations with Markov Chains and Potential Future Approaches

In order to specify the states of the car, we wanted to add additional variables to our states (e.g. time of day, day of week, etc.) to achieve more accurate probabilities and overall better model performance. However, we realized that this could cause model overfitting, especially if our new variables were too specific, which would’ve been detrimental to our model’s performance since it wouldn’t have been able to predict as efficiently with unseen data.

To solve this problem, we would’ve wanted to implement another model known as a Hidden Markov Model for location predictions if we had more time. This model differs from Markov Chains because it makes predictions based on the entire original dataset as a whole rather than a simplified one that is grouped by each individual vehicle and trip. Because of this, the model would be able to identify a jump in the changes between each trip, give these changes a specific weight, and update these weights with more trips as the car continues to drive over time. Thus, each trip would include hundreds of observations for every additional second driven, which would help the model learn the population distribution more effectively.

Some reasons for why this model would be very useful is because it would allow us to identify which cars are currently on flight, observe up-to-date changes in driving patterns, and add time series related data for more accurate predictions. It would also be able to predict future locations for new cars that don’t have any stored data by initially analyzing other drivers’ behaviors. For example, if a new user is currently on flight and a delivery worker wants to predict where he or she is going to go next, the model would look at other vehicles who started from that same point, determine where their most common final destination is from that starting point, and predict that this new car will most likely follow that equivalent route.

Although the Hidden Markov Model is extremely useful and advanced, it requires a significant amount of time and knowledge to build, train, and improve. While this would’ve been an exciting task to attempt as our next steps, the Markov model that we created was still able to make accurate predictions based on the proportion of trips in each vehicle’s trip history. Therefore, it can serve as a good starting point for further implementation of additional variables and other models, such as the Hidden Markov Model, as well.

Dwell Time Prediction

Two distinct methodologies were undertaken in the pursuit of accurate dwell time prediction. The first methodology used spatial-temporal clustering and supervised learning methods. We created a network of clusters which can identify trends and associations between locations, time of day, and dwell times. After establishing this network, we implemented a K-NN algorithm to predict what cluster a new data point belongs to, allowing the predicted cluster to act as an auxiliary for dwell time prediction. We can then use the summary metrics of this predicted cluster to predict the dwell time of that new data point, as well as provide context to augment decision making processes. Contextual information includes the median, mode, mean, and variance of the predicted cluster. In essence, we create a network of clusters, predict what cluster a new point belongs to, and then use that cluster’s information to make a final prediction.

Method 1: DBScan + KNN

To create a network of clusters based on parameters: dwell time, location, and time of day, we implemented a DBScan clustering algorithm, which creates groups of data in n-dimensional space based on patterns found between the given features. For example, a school would likely have three primary parking locations: a teacher parking lot, student parking lot, and parent — drop off area. Ideally, our dbs model would split up the data points in the school and find these clusters based on how long cars park in a given area, what time of day it is, and where the cars are parked. Another example is an airport parking lot. Perhaps there are two clusters at an airport, one for people parking temporarily to pick up people and other for people leaving their cars there while they go on vacation.

Example of generated cluster network for Columbus, Ohio

We then trained and tuned a K-NN model to predict what cluster a given point belongs to. The difficulty of this lies in the fact that we are using 1 fewer dimensions to predict than we used in creating the cluster (dwell time is not known for a new point). Luckily, with access to DeepNote’s GPU and most powerful CPU machines we were able to tune the hyperparameters of our model on a large search space, greatly improving performance. Once we had the predicted cluster for every point, we predicted the mode dwell time interval of its associated cluster. The model achieved a test accuracy of ~78%.

Method 2: XGBoost on a binary interval

In competition with ourselves, we created a second, alternative methodology to our first. Our second model did not rely upon a network of clusters for prediction. Instead, this approach trained an XGBoost (Decision Tree Classifier) model on the raw data (location and time of day) to predict a dwell time interval directly. Additionally, we simplified the outcome space from being a multiclass classification problem to a binary one. Rather than binning dwell time into three intervals, we trained the model to predict if a given car’s dwell time was greater or less than some value k.

Our model achieved rising accuracy as the value of k increased, as larger dwell times are less common and have seemingly stronger associations with location. In the end, we used 10 hours as our interval. This decision was determined through analyzing distributions of dwell times and the tradeoffs between accuracy and generalizability. This model achieved a test accuracy of 82%.

User Interface

Interpretability was key for interaction with our models. We built a user interface using Streamlit.io to give users the ability to see the clusters we’ve made and filter results for specific vehicles. In highlighted region A of the figure below, we can use the sidebar to filter for a specific vehicle or cluster. In highlighted region B, we can see the different vehicle clusters we generated within Columbus Ohio. Using the sidebar to filter for a specific vin, an interactive, labeled map is returned in highlighted region C with potential next locations for that vehicle, along with binary associated dwell time. Similarly, filtering for a specific cluster generates an interactive map in highlighted region D with the location of that particular cluster.

Our user interface was built to help with the interpretation of our models, which were trained primarily on past data. Connecting our current system to a constant stream of data would open the gates to even more possibilities. Implemented into our user interface, one could imagine granting users, like an Amazon driver, the ability to find vehicles currently on the road near them with adequate dwell times for services like package delivery — saving time and resources for both the driver and customer.

Unlike our current user interface, a UI meant for someone like an Amazon driver would omit results that don’t match a service provider’s requirements. For example, vehicles that have a dwell time under 4–6 hours, as well as vehicles that are headed home, would be omitted from the user interface. The ability to filter for specific vehicles and date ranges would also be removed, however, filters for vehicles in a particular cluster would remain. Simplicity, clarity, and ease of use would be key for service providers.

Hondezvous Dashboard — Built with Streamlit.io

Conclusion

Model Limitations and Improvements

As our models were built specifically on data in the Columbus region, the results couldn’t be extrapolated to other cities in the United States. In order for them to work in other cities, a new network of clusters needs to be constructed using new data from that given city.

Although we were able to build our models with millions of data points, predicting data at this granularity, or specific location, requires much more information to be consistently accurate. Ideally, having numerous points to work with in each city and building would tremendously help with improving the accuracy of our predictions.

We believe significant improvements to our models’ performance could be achieved through augmentation of the dataset with regards to the features. If we had access to demographic data pertaining to a vehicle’s owner and other similar information, we could immensely enhance overall model performance and cluster generation for more precise dwell time predictions.

Acknowledgements

We would like to thank Rajeev Chhajer, Brian Nutwell, Tommy Tran, Tony Fontana, and Kent Broestl from 99P Labs for guiding us on this project and providing crucial help and advice along the way.

A massive thanks to Elizabeth Dlha and the team at Deepnote who gave us free access to their most powerful GPU and CPU to train our models.

Another thanks to Professor Ed Henrich and Professor Arash Nourian from UC Berkeley’s Data-X course for their guidance and ideas.

Written by:

Adam Huth, Charlie Duarte, Ebru Odok, Isabel Zavian, Nikhil Dutt, Xuerui Song AKA Team Hondezvous