Migrating Glovo’s dispatching service from a single machine to a distributed system

Lorenzo Arribas
Aug 20, 2019 · 4 min read

At Glovo, we have one vision: Connecting our customers to their cities. We do that by facilitating access to different services (like restaurants, supermarkets, and other stores) through a single mobile application and a fleet of independent couriers.

Anything you want. Delivered in minutes.

Of course, that is easier said than done. Behind the scenes, there are hundreds of factors that affect the experience of our customers and couriers.

One of the key factors in this equation is who is the perfect courier to deliver an order. This is very important for us because finding the best match means our customers will receive their orders faster, our couriers will deliver more orders every hour (and earn more money), and our partners will be able to receive more orders.

Answering that question is the main job of our dispatching team. And more specifically: the job of Jarvis.

Meet Jarvis, our Dispatching Service

Image for post
Image for post
This is how I feel our team works when we’re having a live issue with Jarvis. Source

Jarvis is Glovo’s dispatching service. Its goal is to make cities as efficient as possible. I’d say Jarvis stands out from other services in three different ways:

  • Its domain is very mathematical. We are running an optimisation problem based on estimations we get from machine learning models (e.g. how much time will it take the courier to reach a point? When will the partner have the order ready?).

From pets to cattle

A few months ago, we had a big problem: Jarvis was one of the most critical pieces in our system. However, it was also one of the most fragile.

This is how everything worked back then:

Image for post
Image for post
Jarvis ran on one single instance. This instance used the actor concurrency model to make decisions for different cities concurrently. Whenever there was a release (and a rollback), we would have a few minutes of downtime, delaying order delivery.

We used the actor model to run hundreds of concurrent actors. Each actor would control a particular city. Our main problem was that two actors could not be in charge of the same city, as they could potentially make conflicting decisions. This meant we could not distribute the service: we needed to rely on a single physical machine.

As a result, every time we deployed a new version we would have a few minutes of downtime, increasing our delivery time (one of our most important KPIs). If the version was faulty and we needed to roll back, we would have twice as much downtime.

Last quarter, we decided to change this and transition from a we-have-a-single-machine-please-cross-your-fingers architecture to a highly available, distributed architecture.

The task was at the same time daunting and very interesting. We had to make some fundamental changes to our system, but it was absolutely essential to make them gradually, without disrupting the service at all. To put it another way: we needed to change the plane’s engine in the middle of the flight.

Whiteboard session No.1: Requirements

The first thing we did was gather the whole team and come up with a list of requirements. Roughly speaking:

  1. The service should be distributed. At any time, there should be multiple machines in different availability zones or regions, competing to run the assignment problem for a city.

Whiteboard session No.2: (And beyond…)

We don’t want to make this post overly long and, to tell the truth, we would be really glad to hear from you and how you would approach this problem! So let’s leave it here for now.

In the next article, we’ll follow up with our approach, with a couple of surprising plot twists that made us redefine our solution.

glovo-engineering

Discover some of the challenges and funny stories we…

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store