How we measure at Zoba

Published in

Zoba Blog

9 min readNov 12, 2020

This piece is co-authored by Evan Fields, Head of Data Science at Zoba, and Jay Cox-Chapman, Director of Product at Zoba.

At Zoba, we fervently believe in measuring how well our products work. In a previous post, we explored why spatial optimization is so important to mobility operators: if supply fails to keep up with demand, a fleet will forfeit up to a third of possible rides. Zoba Move, our deployment and rebalancing product, recommends optimal supply patterns for mobility operators.

Therefore, we are always asking — and helping our customers ask — how can we tell if a supply strategy was successful and what value it provided? How can we tell whether we’re deploying to ideal locations or rebalancing the right vehicles? We care about demonstrating the incremental rides and revenue that result from following Zoba’s recommendations because it means we’re helping our customers succeed. Today we’re going to discuss some of the metrics that we use to capture how well our recommendations are working.

The real world is too complex for A/B testing

As a first principle, the goal of any supply strategy is to improve some high level objective such as total trips served or total revenue generated. This holds whether you’re moving micromobility scooters, delivering packages, managing an autonomous vehicle fleet. Therefore, measuring the effect of a new supply strategy should be easy: just look at the change in the preferred key objective. This answer is both philosophically correct and borderline useless in practice. In practice, weather, seasonality, day of week, variations in fleet size, competition, regulation, marketing, hardware changes, etc. can hopelessly confound any analysis of a change in supply strategy. With so many factors, it’s impossible to determine to what extent a change in supply strategy caused a change in a high level objective. For example, if total rides increased when a new supply strategy started, was that increase caused by the new supply strategy or because it was the first warm week of spring?

Most software companies, when they want to test the effect of a change, set up an A/B test to check whether the change has a statistically significant effect on an outcome they care about. Zoba has to contend with messy real-world effects that are not present in a software-only user experience. Even switchback testing, a specialized form of A/B testing that tries to control for time-based effects, is difficult to use because of the difficulty of quickly pivoting real-world operations and seasonal effects.

What about regression?

To properly understand the effects of supply interventions, we need a way to control confounding factors. Many modelers will naturally look to regression as a way of looking at a market as a whole, or “top-down”. At Zoba, we’ve found that regression can work well enough for measuring the effects of operational interventions, but it’s an unreliable tool for two key reasons. Firstly, key mobility objectives are determined by an array of factors typically too numerous to enumerate — let alone quantify well enough for use in regression. Consequently, parsimonious regression models with available data tend to have large unexplained variance. It may come as a surprise that simply gathering more days worth of data does not solve the problem. Regression models covering long time periods have to account for slow-but-impactful market evolutions due to seasonality, changing travel behavior, hardware upgrades, new pricing models, and so forth¹. Secondly, mobility exhibits significant temporal dependencies. Think about how the rides that happen today depend on how the fleet was arranged this morning, which in turn depends on the rides that happened yesterday. These time dynamics have complex nonlinear structure and are difficult to capture in an interpretable regression model.

While top-down approaches like A/B tests or regression can be very useful, we find detailed, granular trip-by-trip and vehicle-by-vehicle data more practical to conduct natural experiments. For example, we can control for important day-by-day effects such as weather and day of week by comparing sets of deployments which occurred on the same day. We might compare ride or revenue outcomes for vehicles deployed according to Zoba’s recommendations on July 4th to outcomes for vehicles deployed outside Zoba’s recommendations on the same day. Since both sets of deployments happen on the same day and share the same weather, differences in outcomes between the groups are unlikely to be driven by day of week effects, the holiday, or weather conditions, and we get closer to isolating the effects of our supply strategy change.

Not all metrics are equal

Even with granular data, developing effective metrics requires subtlety due to the spatio-temporal interactions inherent in shared mobility. Users (and potential users) of shared mobility have vehicle preferences, typically for vehicles that are nearby, highly charged, and undamaged. Each user may choose the vehicle they deem best. Consequently, an idle shared vehicle awaiting a user competes with all nearby vehicles (from both the same and competing services) for rides. That means the next event that happens on a given vehicle depends on the locations and states of nearby vehicles. Those vehicles’ next events in turn depend on their nearby vehicles. Trace this network of inter-vehicle dependencies out, and you’ll realize that in general, the next event on any vehicle depends on the locations and states of all other vehicles in the market. To illustrate vehicle dependencies, we plotted the locations of vehicles in a live market (blue dots), with lines connecting vehicles within 200 meters of each other — close enough that they’re competing for rides. Notice how each vehicle competes with only a few neighbors, but almost the entire fleet is joined together by this “competition network:”

Each dot is a vehicle. The lines connect vehicles within 200m of one another.

Vehicle locations and states are in turn dependent on previous events: most obviously, a vehicle usually ends up somewhere because a customer put it there. In short, we have competition creating spatial dependencies and rides creating temporal dependencies, and thus every vehicle’s next event depends on all previous events from all vehicles in the market.

Some common industry metrics are confounded by these spatio-temporal interactions. One such metric is the “time to first ride after deployment.” At face value, this seems like a useful metric: if I’ve placed a vehicle in a good spot, it should see a ride relatively quickly. However, “time to first ride” does a poor job of describing whether a supply strategy is optimal or not, for three reasons. First, next events on vehicles are subject to vehicle interactions, so other vehicles in the neighborhood of the deployed vehicle will have a significant effect on the outcome. If a vehicle is placed next to four other vehicles, the other four may get a ride first, even if there is strong demand. Second, a short time-to-first-ride is not always optimal; operators will often rightly deploy in the morning to capture demand for the afternoon commute. Finally, time-to-first-ride will be heavily skewed by the time of day that the deployment was done.

The web of dependencies between vehicles suggests caution — or at least epistemic modesty — when designing metrics. Because vehicles interact, no metric can measure how “good” a single deployment or rebalance action is. Indeed, the very notion of the quality of a single supply adjustment is ill-defined exactly because all contemporaneous supply adjustments jointly affect all vehicles. Supply interactions are complex — mobility is complex! — and no metric can capture a practically useful notion of how successful a supply intervention was while accounting for these interactions. This is doubly the case for simple, intuitive metrics. Instead, the best we can do is develop a bouquet of metrics, each of which uniquely captures something important about how well supply interventions work. And when analyzing metric data, we must remain aware of the inevitable biases of each metric. In general, we believe that these metrics actually understate Zoba’s impact, but we feel the tradeoff is worth it for making our impact easier to understand.

How we measure at Zoba

When a new customer begins using Zoba to inform their deployment and rebalancing optimizations, we keep track of exactly which supply interventions followed Zoba’s recommendations. This is essential; you can’t measure how effective an operational intervention was without knowing whether (or to what extent) the intervention was actually implemented. We then present the customer with a set of “per-vehicle” metrics that track exactly what happens to an individual vehicle after that vehicle is deployed or rebalanced.

When a new customer begins using Zoba to inform their deployment and rebalancing optimizations, we keep track of exactly which supply interventions followed Zoba’s recommendations.

Zoba’s effect can be estimated by comparing metrics for vehicles touched according to Zoba’s recommendations to metrics for vehicles rebalanced or deployed outside of our recommendations. Whenever possible, we perform a paired comparison between comparable sets of deployments or rebalances, such as deployments with and without Zoba that happened on the same day. For a given day, we look at the average metric for all Zoba deployments and all legacy deployments. So we can say something like “On November 3rd, Zoba deployments had a 90% ride probability, and legacy deployments had a 60% ride probability. Thus on this day, Zoba deployments had a 50% better ride probability.”

At Zoba, we’re continually working to develop principled metrics which capture some of the competing goals when performing a supply adjustment. The remainder of this piece describes three such metrics, rides after deployment, 24-hour ride probability, and wasted deployments.

Rides after deployment

Ideally, a user rides a vehicle from one high demand location to another, so that the vehicle can quickly serve another ride. This doesn’t always happen. Consider a commuter riding a vehicle from a train station to their suburban home at the end of the workday — the vehicle is unlikely to serve another ride for the rest of the day. To capture the notion that vehicles should be deployed such that users keep them in good circulation, we measure each vehicle’s rides after deployment, which is simply the number of rides the vehicle serves within the 48 hours after it’s deployed or until it’s re-deployed, whichever happens first. Why 48 hours? We find that for most markets, 48 hours is a long enough time horizon to give vehicles the chance to capture multiple downstream rides, but it’s also short enough to reflect that vehicles should get rides soon after deployment.

24-hour ride probability

This metric measures the likelihood that a vehicle will receive a ride within 24 hours after deployment. After all, it’s bad if vehicles sit idle. We use 24 hours because it smooths over cycles of high and low demand and avoids time-of-day effects. The ride probability is a powerful and highly interpretable metric, but it can be subject to some confounding. In particular, recall that whether or not a vehicle gets a ride depends on the locations of nearby vehicles. An individual vehicle’s ride probability can be depressed by over-deploying in the same area. As an intuitive example, suppose Zoba recommends a single deployment at a street corner, but instead 20 vehicles are placed at the corner. Then Zoba would consider the first vehicle deployed as following Zoba’s recommendations, but with so many vehicles present, the Zoba-approved deployed vehicle might not actually capture any rides in the first 24 hours after deployment.

Wasted deployments

The problem with deploying 20 vehicles to the corner when a single deployment is optimal brings us to our next metric: wasted deployments. Deploying too many vehicles doesn’t lead to more rides than deploying just a few. Zoba captures this intuition by counting wasted deployments, calculated as follows. A market is divided into small spatial cells. If the number of vehicles in a given cell never falls below k > 0 vehicles over a given day, then the same set of rides could have been served with k fewer deployments to that cell, and thus the last k deployments to the cell are wasted. The following figure shows two cases at the same location, one where the 2 scooter deployments were valuable, and one where they were wasted.

By jointly analyzing the rides after deployment, ride probability, and wasted deployment metrics, we can form a holistic picture of immediate and downstream effects on deployed or rebalanced vehicles as well as how the other vehicles in the neighborhood are affected. When a suite of metrics tells a similar story of increased ride probability, increased downstream rides, and decreased waste — typically the case when Zoba begins informing market supply operations — we can be confident the corresponding supply strategy is working.

If you’re interested in optimizing your fleet and digging deeper into fleet performance, drop us a line at info@zoba.com.

[1] In our experience, the regression models of shared mobility supply interventions work best with about 60 days of data. With much less data, estimated coefficients have untenably large uncertainty. With much more data, seasonal effects and other long term changes add too many additional confounding factors. But even near that 60 day sweet spot, regression-estimated effects still have frustratingly wide uncertainty and have a habit of suggesting physically implausible results.

How we measure at Zoba

The real world is too complex for A/B testing

What about regression?

Not all metrics are equal

How we measure at Zoba

Rides after deployment

24-hour ride probability

Wasted deployments

Written by Jay Cox-Chapman