Introducing the rideOS Ridehail API

Frank Huang
rideOS
Published in
4 min readOct 11, 2019

In this series of blog posts we will take a closer look at the technologies that power the rideOS ridehailing platform, starting with the Ridehail API.

The rideOS Ridehail API provides backend functionality critical to running a ridehail network — automatically matching riders to vehicles, tracking available vehicles, sending push notifications, visualizing the fleet in a web dashboard, and providing an archive of trip history. In this blog post we will highlight some details on how we built a robust and efficient ridehailing backend.

The most efficient assignments of vehicles to trips changes over time as vehicles move around and new riders request trips. To keep assignment quality high as the fleet changes, the Ridehail API continually calls rideOS’ Fleet Planner to keep assignments optimized. To keep things running smoothly, the Ridehail API must handle two concerns:

  1. Fleet Planner is called asynchronously, so we must consistently handle concurrent state changes.
  2. Fleet Planner considers all vehicle and trip states together, but the Ridehail API acts on one vehicle or trip state at a time.

Concurrent State Changes

Fleet Planner may take a few seconds to return results for larger fleets. In the meantime, the state of the fleet may have changed in conflicting ways. Consider this example:

  1. Rider requests a trip, which causes Ridehail API to call Fleet Planner to find out which vehicle to assign this trip to.
  2. Rider cancels the trip.
  3. At the same time, we get the Fleet Planner response, which assigns the trip to some vehicle.

There are now two potential states of the trip: cancelled or active and assigned to a vehicle. To handle this, we model ridehailing trips with a set of rules that keep state changes consistent. In this example, cancellation is treated as a terminal state of the trip, so the trip will be cancelled no matter the order of 2) and 3). These rules must be applied when handling state changes, as well as when handling Fleet Planner responses.

Consistently applying state changes at scale is a large part of the value that the Ridehail API brings, and is used as the foundation to build our multi-instance optimization scheme.

Local to Global

Fleet Planner considers the state of all vehicles and trips at once, but the Ridehail API is a local API that focuses on the states of individual vehicles and trips over time. We must reconcile this mismatch when calling Fleet Planner to update assignments.

The simplest way to keep assignments updated is to call Fleet Planner for every state change, but this is expensive. We want to limit costs while still keeping the latency of assignments low, so we developed a strategy that both queues and batches changes. This scheme batches frequent changes, but allows infrequent changes to be sent out to Fleet Planner immediately.

In this scheme, for a given fleet:

  1. Only one Fleet Planner call may be ongoing at a time.
  2. One additional Fleet Planner call may be queued, in addition to the currently-ongoing one.

Concurrent Fleet Planner results may produce conflicting decisions for the same vehicle or trip. The Ridehail API will handle these concurrent changes by only accepting one of them, so the other result will be effectively wasted. Thus, we try to eliminate this kind of wasted work by prohibiting concurrent Fleet Planner calls.

Any state change that happens during an ongoing call queues another call. Because Fleet Planner is a global solver, each Fleet Planner call takes into account all the state changes that happened before it started, so a queue of size one is sufficient.

Coordinating Across Multiple Instances

Our load balancing works on a per-request basis — this means that calls to the Ridehail API for the same fleet, vehicle, or trip may be handled by different instances. This simplifies the design of the load balancer, but consequently the Ridehail backend must coordinate the optimization for one fleet amongst multiple instances. We use inter-process locks and semaphores built on Zookeeper to do this, which we based on recipes from the Apache Curator Framework.

To represent the queue, we use a semaphore with 2 leases (one current, and one queued). If an instance tries to acquire the semaphore, and there are no leases remaining, then a Fleet Planner call has already been queued and nothing needs to be done. If an instance successfully acquires the semaphore, then it has joined the queue and it is ready to call Fleet Planner.

At this point, two instances have potentially acquired the semaphore. To ensure that only one is actively calling Fleet Planner, we have an inter-process lock on the Fleet Planner call itself. Only the instance holding the lock can call Fleet Planner; once it is finished, it releases the lock and the other instance (the “queued” one) can begin its call.

When using distributed locks, it’s a good idea to consider the failure cases. In the Ridehail API, if the semaphore guarantee is broken (more than 2 instances acquired the semaphore), we queue more optimizations than necessary. If the lock guarantee is broken (more than 1 instance holds the lock), then concurrent optimizations can happen, but the state machine will reject any conflicting results. In both cases, there is no consistency failure, and the only consequence is wasted CPU time. Because of this, we consider distributed locking a plausible design for this scheme.

The Next-Generation Ridehail API

There is still a lot to do to prepare the Ridehail API for the next generation of ridehailing fleets. We are exploring ideas like using gRPC to offer async, streaming-based APIs; and using languages with non-blocking, event-driven I/O instead of using threads to multiplex I/O.

If you’re excited about these kinds of problems and want to help us build the next generation of dispatch for ridehailing, we invite you to apply for open positions here!

--

--