P-Rex: Personalized Load Recommendation System

Published in

CloudTrucks Engineering

7 min readJun 25, 2024

Introduction

The first core value at CloudTrucks is to put drivers first. One way to serve drivers is by providing personalized load recommendations upon search — similar to how Netflix recommends shows or how YouTube suggests videos based on user preferences. We aim to achieve the same for our drivers, while acknowledging the differences between trucking load boards and TV shows.

From a data and engineering perspective, we focused on selecting the best machine learning (ML) algorithms and infrastructure to serve drivers, considering the nature of the data. From a product perspective, we prioritized frequent updates with incremental improvements based on user feedback, rather than infrequent major updates.

Meet Fred the P-Rex, our Personalized Load Recommendation System’s friendly dinosaur. Join us on Fred’s journey, from a notebook to a recommender with high user engagement.

Fig. 1: Fred the P-Rex, who is the mascot of the P-Rex project (as well as a personal holiday crochet project of our data scientist)!

How We Built Fred: Development Process

Our core assumption for P-Rex was that a driver’s preferences for booking loads, such as trip distance or pickup location, do not change significantly over time because a driver’s preference for being at home at night or their home address doesn’t change often. Thus, a driver’s past booking behavior is a good predictor of future actions. We validated this through a booking similarity analysis, which gave us the confidence to proceed to the next step.

While collaborative filtering is widely used (as highlighted by Netflix’s notable research) to build recommendation systems, it proved unsuitable for our case because once a driver books a load, it becomes unavailable to others, unlike movies which can be recommended to multiple users. Consequently, we opted for the XGBoost Classification model, using the probability of accurately predicting a true label (in our case, booking a load) from a classification model to rank scores. XGBoost uses decision trees to effectively capture feature interactions. It manages multicollinearity by selectively dropping redundant features and keeping the most significant ones, facilitating easier determination of reliable feature importance. Optimized for scaling large datasets through parallelization, XGBoost also efficiently handles missing values, making it well-suited for P-Rex.

We created an initial model prototype using a small subset of the data. Working with high-quality, small data samples enables rapid development and testing of different models in the initial phase of model development. This phase also allowed for experimenting with various feature engineering techniques.

For feature engineering, we started with features that reflect drivers’ preferences and remain consistent over time. We normalize each user’s data for ML by comparing their current values to their past preferences. For example, a driver’s pickup locations tend to be fairly consistent. Additionally, we create new features by calculating ratios, which enhances our model’s accuracy. Alongside these personalized features, we incorporated features that capture general preferences of any user — for instance, all drivers prefer to generate more revenue and have fewer deadhead miles.

While developing P-Rex, we experimented with numerous parameters and model hyper-parameters, such as the look-back window for the training period, different feature sets, and regularization parameters in XGBoost. We evaluated the Area Under the Curve (AUC) for various parameter sets and selected those that yielded the best results. We also emphasized model interpretability and utilized the SHAP (SHapley Additive exPlanations) package. In addition to checking the SHAP values of user-specific (i.e., personalized) features to understand user behavior and inform subsequent actions, we also reviewed the SHAP values of revenue related features to ensure they have the expected impact as a sanity check for the model.

Feeding Fred: Preparing the Data

While feature selection and engineering are at the core of what makes ML models great, the structure of your data sets is crucial for model performance. Some common pitfalls to avoid include:

Ensuring that data used for training reflects the state as it would be for real-time inference
Accurately reflecting historical states when using historical aggregates during training, computing them the same way as for live inference
Ensuring consistent application of feature engineering between training and inference

To address these issues, remember that all features are ultimately derived from how entities (in our case, “drivers” and “loads”) interact over time. We separate our feature sources into information that is relevant “now” for a given driver and load, and information regarding historical preferences and behavior.

Timely Data

We combine the information known about the driver and the load at the time of prediction to develop features relevant to driver/load compatibility, such as the driver’s current distance from the pickup point. When the model is deployed, you simply extract this information at the time of inference. However, during training, you must capture the state of these entities as they would have been for a live prediction. The ETL process must capture the entities’ state across multiple points in time to avoid target leakage.

Historical Driver Preference

To estimate a driver’s compatibility with a load, it’s helpful to look at the driver’s past behavior. Some drivers prefer local routes, while others prefer long hauls. Computing such features requires long-running aggregates, and their definitions need to be standardized. Failing to do so can lead to several issues:

Computation of expensive aggregations offline that are not easily reproducible in production
Inconsistent definitions between offline and real-time applications
Incorrectly associating feature values with their observed time, leading to major performance discrepancies between training and prediction — this relates to label leakage

To address these issues, use a properly designed feature store where:

The same process generates values for both offline and real-time applications
The process loads the most recent feature values to a low-latency serving store

Unleashing Fred: Deployment and Monitoring

Deploying ML models to production in a scalable, performant way requires several key components:

A convenient workflow for machine learning engineers and data scientists, with model life cycle management from training to monitoring live performance
Solid integration of the feature store for both offline and online applications
Monitoring and scalability of the model in production

Since we are on GCP (Google Cloud Platform) and our data scientists were already using Vertex.ai Jupyter notebooks, using the Vertex.ai tool stack was a natural starting point for us. We deployed our first version of P-Rex that way but soon encountered some bottlenecks:

Deploying custom code along with your model artifact isn’t straightforward. The general inference interface expects all feature engineering to be done before sending the prediction over the wire to the hosted model, requiring us to:
Replicate feature engineering in both the training job and the real-time inference code
Make the client application aware of the ML specifics of the model
Make changes in several places whenever we want to modify feature engineering
The feature store service and the model service are decoupled. The feature store is called differently for training and inference. Moreover, the client application needs to make a separate call to the feature store client and perform feature engineering before calling the model service. This is a poor abstraction

While our goal is not a comparative evaluation of different MLOps products, we switched to Qwak in subsequent iterations. Their model abstractions allow us to define the model, feature engineering, and inference behavior in a single place. This eliminates code or logic replication between applications. Moreover, their feature store client is deployed to the same cluster as the model for low-latency serving, and the integration is seamless. The application client only needs to call the model with a payload composed of understandable entities, not vectors and matrices of numerical values.

Fig. 2: Architectural diagram of personalized load recommendation system. The key to any successful real-time ML service is to centralize the feature data generation and the feature engineering to be identical in offline training and real-time inference.

Ensuring Fred’s Growth: Continuous Improvement

To ensure Fred’s continuous improvement and growth with each new iteration, we focus on three key areas. First, we collect feedback from drivers through surveys and their direct interactions with the Ops team. Second, based on this feedback and brainstorming sessions with team members beyond the engineering team, such as Product, Design and Ops teams, we develop additional features to enhance our model’s intelligence. Finally, we maintain a process of continuous retraining and deployment. We have automated weekly deployments, triggering them based on the AUC threshold. We also monitor model performance and interpretability by checking the drift of AUCs for both test and future holdout datasets, as well as feature importance. Training data quality is monitored by checking the null fraction of features.

Fig. 3: P-Rex model’s performance over time, using AUC as the metric. The AUC scores for both the test data (blue line) and the future holdout data (orange line) are plotted against deployment dates. Despite some fluctuations, both lines generally stay above 0.85, indicating strong model performance. The use of test and future holdout data ensures robustness, showing the model performs well on current and future unseen data. Continuous monitoring and evaluation help maintain and improve the model’s accuracy over time.

What Fred Could Do: Performance & Application

Search Ranking

Our first use case for P-Rex is to improve search relevance ranking for each driver when they search for work. Drivers search for loads similarly to how one may look for a flight. Traditionally, we defaulted to sorting the results with parameters such as rate per mile as a simple heuristic for desirability, with the ability for the driver to re-sort by other fields and further filter. Our hypothesis was that using P-Rex, we could achieve a default sorting that better reflects the user’s preferences. Our experiment indicated that using this approach increased the booking rate by 12% over the control group.

Push Recommendations (Push-Rex)

While many of us find constant app notifications annoying, our hypothesis was that our drivers would find it useful if we automatically sent them relevant options for their next job. By talking to the customers, we learned that as they are busy driving, proactively sending them relevant recommendations will help them. When ranking recommendations, we already know what the user is searching for. However, when considering pushes, we need to determine which loads work with our drivers’ schedules and preferred geographies before ranking them. We also want to ensure that the options pay above the market benchmark for the lane before recommending them.

We find that our drivers find this feature very useful, with consistently over 11% of push notifications being clicked on. Moreover, the booking rate for the results of push notifications is almost 3x that of organic search.

In the future, we aim to make this feature even more relevant, as bespoke recommendations help our drivers stay active and earn more money.