Identifying Commute Trips at Lime

Published in

Lime Engineering

10 min readApr 6, 2021

Understanding why riders choose to ride with Lime is key to designing new product features, customizing how we talk to different riders, deciding which promotions to offer, and understanding what’s driving some of the long-term trends we are observing on the ground.

In this post, we will walk through how and why we developed a machine learning model to identify commute trips specifically. We follow the commute use case very closely because it helps us understand if we are meeting our reliability goals; reliability must be high for riders to trust us with their commutes.

Why Build a Classification Model for Commute Trips?

Lime regularly polls riders and non-riders about why they are using (or not) our service. Some of those surveys are administered by our User Research team, some by third parties, and some are directly embedded into our product. In fact, we ask most riders at the end of their trip, to tell us why they used Lime (see below).

While this data can give us general information about Lime usage over a large region, it is too sparse to give us a precise understanding of local or individual behavior. That is, we cannot use this data to micro-adjust our strategy at the local level or send customized marketing or promotions to individual riders.

For that we need an approach that can generalize our knowledge to all trips, given that we have information on a few trips.

Enter, machine learning.

Building the Model — Feature and Model Selection

We choose to model the problem as a binary classification. Every trip gets classified as either “commute” or “not commute”.

Step 1 — Feature generation

The first step is to collect and choose features to build the model. We initially considered a very large set of potential features based on our understanding of what might correlate with a commute trip. They can be categorized into trip-level features, geographic context features, and rider context features.

Trip level features (examples)

Total trip duration, trip distance, trip displacement (straight line distance between start and finish): Commute trips are usually shorter in duration and more direct in route than recreational trips.
Time of day, day of week: Commute trips tend to concentrate on weekdays and rush hours.
Group ride: Group rides are unlikely to be commutes.
Weather: Bad weather usually reduces the number of recreational trips more so than commute trips.

Geographic context features (examples)

Trip region: Different cities and neighborhoods have different base rates of commuting.
OpenStreetMap points of interest: Simple counts of different point of interest categories (e.g. tourist landmarks, public transit, etc.) can encode information about the origin or destination of a trip, which can correlate with commute or non-commute trips.

Rider context features¹ (examples)

Count of past trips: Having many past trips over a long time period makes it more likely that a trip is a commute; having many trips on the same day or past few hours makes it less likely that a trip is a commute.
Repeat trips: Commutes are more likely to be repeats of past trips. We can measure this by looking at past trip start and end locations.

Step 2 — Feature selection

We use Shapley values as our feature importance metric.² We started with ~100 features and iteratively trained, tuned, assessed performance, and dropped features until model performance degraded significantly.

The excellent SHAP package implements computationally efficient ways to calculate Shapley values and allows us to visually communicate feature importance. Each dot below is a trip. The color of the dot indicates high (red) or low (blue) values for that feature. Being to the left of 0 decreases the likelihood of being a commute, being to the right increases it. Using the day of week feature as an example, we see that the lightest red values (which represent the encodings for Saturday and Sunday) are highly predictive of non-commute trips.

Step 3 — Model selection

We evaluate different models using ROC curves. An ROC curve illustrates the trade-off between the true positive rate (labeling a true commute as a commute) and false positive rate (labeling a true non-commute as a commute). Better models are located in the top-left hand side of the chart because they accurately label many true commutes (high true-positive rate) while avoiding labeling non-commutes (low false-positive rate). The plot below shows how different models and simple rules of thumb compare. We looked at other metrics as well, but chose to plot ROC over, for example, precision-recall, because there is no extreme imbalance between the proportion of commute and non-commute trips.

As illustrated in the chart below, machine learning models improve significantly upon simple heuristics including:

classifying all weekday trips between 7AM-10AM and 4PM-7PM as commutes
labeling all repeat trips as commutes
flipping a coin (random classification)

Many models and feature sets were evaluated in the course of this exercise — and we should stress the iterative part of this process — but only a handful of them are shown below. In the graph we plot the performance of an L1-regularized logistic regression (scikit-learn implementation, light-blue), a gradient boosting model (LightGBM implementation, dark blue), and an ensemble model that combines those two models into one meta-model (mlxtend implementation, pink).

Generalizing the model to all Lime trips

The model described in the previous section is trained and tested on a subset of all trips — those that have labels generated by the end-trip survey. This subset is not generated at random and we could run into some issues if the model’s performance degrades when it is exposed to different data.

In other words — how can we trust that the new model performs well over all Lime trips when it was developed on a small, non-random subset of trips?

As an example, it could be that healthcare workers — who are likely over-represented in our dataset because it was collected during the Covid lockdown— tend to commute very early in the morning. Our model would then learn that commute trips generally take place early in the morning. When the lockdown ends and the rest of the commuters come back, our model might fail to detect commutes that occur later in the day.

To be sure, not all biases will cause a degradation in the model’s performance and some of them can be readily fixed with reweighting techniques. But some of them cannot and we still want to estimate how much of an issue they represent.

Identifying potential sources of bias and measuring their impact on model performance

In this section, we discuss three potential sources of bias and how we addressed them:

Covid might be disrupting commute patterns
End-trip survey data (i.e. our training set) is only collected on 4 and 5-star trips
End-trip survey response rates vary across regions

We used non-random splits of training and test data to measure the first two sources of bias and used a modified loss function to address the third source of bias.

Covid or seasonality induced bias

Covid is disrupting commute patterns over time and a model trained during the height of Covid might not generalize to a post-Covid world. To understand the extent of this bias, we compare the model’s performance when the training and testing datasets are generated at random to when the model is trained on (chronologically) earlier data and tested on later data. All of the survey data was collected during the pandemic so this is not perfect, but mobility behavior has still changed over time during this period, so we consider it a useful proxy.

We do not see significant loss of performance with the chronological split (vs. the randomized split). We therefore estimate that the model generalizes well enough for now. We are continuously monitoring this as more of our riders get vaccinated and we collect new data.

Trip rating bias

We only collect survey data for 4 and 5 star rated trips yet want to generalize the model to all rated and unrated trips. Again we do not see a significant loss in model performance when training on 5 star trips and testing on 4 star trips and so feel that the model generalizes well enough in this respect.

Regional distribution bias

Survey response rates vary across regions and we do not have enough labeled data to train region-specific models for all regions. In order to make sure that the model “cares about” or focuses on optimizing for the right data, we need a way to tell the model which regions occur more often in the survey dataset than in the all trips dataset and vice versa.

To do that, we calculate the percentage of all trips and the percentage of surveyed trips in each region and pass the ratio as a weight in the model’s loss function. This further penalizes bad predictions in regions underrepresented in the survey and lesser penalizes bad predictions in regions overrepresented.

Pseudo-labeling and active learning as data augmentation methods

In the remainder of this section, we discuss pseudo-labeling and active learning, which are techniques that increase the amount of labeled data available and in doing so attempt to increase the generalizability and performance of the model. Note that these techniques are currently under consideration but have yet to be implemented in the training or serving of our model.

Pseudo-labeling uses semi-supervised learning to create labels for previously unlabeled data. Unlabeled data in our case are trips without a survey response.

Different algorithms exist and they generally work by calculating some measure of distance between data points and then performing clustering. Unlabeled data points in close distance to highly uniform labeled data are assumed to have that label. This is done iteratively until clusters converge. Finally, the model is retrained using the new pseudo-labels in addition to the originally labeled data.

This approach is often used when only a small amount of data is labeled. This is the case for us in smaller and newer regions where we operate. Initial tests show small performance increases and we see the method as useful for expanding the distributional coverage of our data and giving us increased model performance as a result.

The plot below illustrates the intuition behind pseudo-labeling. Note that only labeled data are shown: commutes in blue, non-commutes in orange. Pseudo-labeling takes advantage of clusters that exist in the data. If an unlabeled data point (again no unlabeled data is plotted below) is close to many commute data points, we might pseudo-label it a commute trip and use it as training data. As a side note, to provide this type of visual intuition we first reduce the many-dimensional feature space used by the model into 3-dimensions using t-distributed Stochastic Neighbor Embedding (t-SNE). t-SNE is a dimensionality reduction technique that attempts to maintain the relative distances between data points that exist in the original space in a lower dimensional space.

Dimensionality Reduced Visualization of Data

Finally, we are also considering active learning to improve the generalization of the model to all Lime trips. Active learning focuses on the efficient use of manual labeling resources. The goal is to determine which unlabeled data points would be most informative to have labels for. We could then manually derive labels for those data points and in doing so improve the performance and generalizability of the model at a relatively low cost.

Here is a look at the entire process, from raw data, through transformation, train-test splitting, feature selection, hyperparameter tuning, and finally model selection and deployment.

Summary

Identifying why a rider took a trip enables a better understanding of the business, a tailored experience for different rider types, and better tracking of product and operational goals. At Lime, we use a rider survey to generate labeled commute data and ML modeling to generalize detection capabilities to all Lime trips.

In the post above, we go through feature and model selection, assessment of model performance, and key considerations for generalizing model output to all trips. We also discuss pseudo-labeling and active learning techniques that we are exploring to further improve performance. The current model significantly outperforms previously existing heuristics, is driving new thinking to better serve commuters, and is used to assess service reliability.

There are always new approaches to consider and improvements to make and we would love to hear your thoughts — please comment below to continue the conversation.

Thanks for reading and stay tuned for future posts about how we are using modeling and machine learning techniques to extend our rider behavior insights!

Thanks to Dounan Tang, Justin Bozonier, Tristan Taru, Ruben Kogel, Jeh Lokhande and the rest of the Data Science and Analytics team for their valuable comments and suggestions.

Thanks to Michael Kronthal and Zach Kahn for their user research insights.

Lastly, thanks to Jianfeng Hu and Arne Huang for helping troubleshoot the unforeseen complexities of putting this model in production.

[1]: The decision to use rider context features requires consideration. Downstream, we plan to use commute trips to classify riders as commuters so ideally each trip is considered independently from rider context. Unfortunately (and unsurprisingly), it is hard to get good performance without considering the rider context so ultimately we decided to include these features.

[2]: Shapley values are a way to assign credit for a joint task among a group of contributors that satisfies a particular set of important properties. Applied to predictive models, the features are the contributors, prediction is the task, and Shapley scores are the credit that each feature deserves. The method reduces the inconsistency and arbitrariness that plagues some other feature importance measures by considering all possible feature permutations when calculating values.