Randomised data — don’t get too excited :)

Get Real! — Transitioning to Real-Time Marketplace Analytics with Ops Data Brain

As one might imagine, being in an on-demand logistics business means that our customers interact with our product hundreds of times a minute — consequently, in order to monitor all these interactions real-time, we need to be able to handle such volumes of data flawlessly, while also providing a superior user experience to our stakeholders. This article embarks on a journey of marketplace monitoring at GOGOX, describes the issues we have run across and also what our future plans are.

What and Why?

This is how we started

Randomised data as well :’)

When I first joined, we used the above heatmap to monitor our marketplace in “(quasi)real-time”. This tool was used on a daily basis by Ops & Customer Service Teams to monitor our orders & drivers and react ad-hoc to any issues that might occur.

However, as we scaled, we started to experience more and more issues with that approach:

  1. Driver Locations having a few minutes delay.
  2. Showing pending orders that were already assigned to drivers.
  3. Introducing large workloads on the databases.
  4. Map crashing due to too many data points.
  5. The tool was quite static and only allowed for ad-hoc reactions to the anomalies in our ecosystem.
Embarrassing: me using my non-existent design skill to create the first ever mockup of the ODB UI. Randomised data.

While points 1–3 were mainly related to issues with our legacy infrastructure at the time (one single big relational database with all the data in it), points 4–5 related to some limitations of our current design and justified the decision to build a new tool.

First Step — GOGOTRACK

Our first ever GOGOTRACK design diagram

Our first goal was to transition away from having to use the production replica database, as that was one of the primary points of failure in the past, simply due to the volume of data having to be queried, as well as having to deal with the replica lag.

As at that time, we already had had 3 different systems (GOGOVAN, GOGODELIVERY & GOGOVAN KR), the idea was simple: bring all these systems to the common denominator, with focus on user actions rather than how it is processed on the backend.

Thus, GOGOTRACK, our unified events schema was born!

Why such an approach?

First trouble — it’s hard to send some events

The first issue we ran into as we started discussing GOGOTRACK with the Platform Team was the fact that some data we needed was not readily available in the schemas (not without having to perform expensive joins).

Thus, being pragmatic, we stripped our requirements to the POC containing only the following 2 events:

  • DRIVER_LOCATION_UPDATED — containing lat/lon location information about our drivers.
  • ORDER_CREATED — containing information available when the order is created (such as pickup/destination)

(both have quite descriptive names :) )

It already meant that:

  1. We do not have to query for new orders — Database would not only be used to track changes in order status.
  2. We do not have to constantly check for new driver locations (having 8m registered drivers, it’s quite a lot of data to check for), as these are also sent to GOGOTRACK now.

That already meant getting rid of 80–90% of database workloads and combined with the approach described below allowed us to level up our real-time analytics.

Unfortunately, that was not enough to receive the complete order funnel — but it was still a good start for only 2 weeks of dev work ;)

Enhancing the data

Having started to received these two events, we faced the new challenge:

How do we get the complete order funnel in our pipelines?

We ended up creating some stream processors that simply enhanced each of these events and created new ones (all following the GOGOTRACK schema of events that we could not implement directly in production)— that way, we still removed some huge workloads from the database (such as checking for new orders all the time) and simply queried for record updates: the latency went down from a couple of minutes to a 1–5 seconds immediately! We quickly managed to find a solution that worked and solved our basic bottlenecks.

Yet we also planned how we should transition it in the future:

Solution for the future — mediation layer

Our first design diagram of the analytics mediation layer

What we did find out in the due process is that there will always be differences in what matters to Data & Platform teams from the transactional perspective. Thus, going forward, we plan on having a so-called mediation layer, with the idea being simple:

Platform team should be able to dump the data in there in whatever schema they use, the mediators would then transform those events (or perform joins) into the GOGOTRACK format and pass it on, all in minimum latency possible.

Events Processing on the Data Backend

Our Events Processing MVP Infrastructure

All of our events processing could be easily (and simpler) done with Kafka, however, Kafka is a beast and requires a lot of support, which at the time was hard for the team of our size to provide (recently however we have launched our first Kafka cluster 🎊). Additionally, the cost for a hosted solution in USD10,000s was simply not acceptable for a brand-new unproven project.

So we did what we always strive to: pushed for pragmatism and kept it simple. Bear in mind that while we know there are better options out there, that was the best and easiest solution to achieve our goals with the size of the team at the time (2 Data Engineers).

All of the GOGOTRACK events are dumped to our Kinesis stream.

These are then fanned-out based on the event type to our Redis Pubsub infrastructure (one “topic” (i.e. redis key) per one event type)

We also have RxPY processors (with redis-backed state) for Complex Events Processing and enhancing these events with more data from the database.

All of the final events are then sent to our backend using socket.io.

Why have we chosen such technologies?

Their main advantage: Time To Live — it took us ~2–3 weeks to get everything up and running, at a minimal cost!

Anyone who ever has done a tiny bit of Reactive Programming (was worth learning ReactJS after all!), used Redis as a cache and set up socket.io could easily achieve the same.

That also meant we could quickly just move on to validating and iterating on the main component — Ops Data Brain itself.

User Interface — Ops Data Brain 🧠

Around the time we started considering deprecating the old heatmap and moving onto a more advanced solution, the Data Viz team at Uber released a beta of Kepler.gl — their in-house GeoData Visualisation Tool. After a few quick POCs, we fell in love with it and decided to adapt it to our usecase.

Embarrassing v2: the first design mockup of ODB I made after we decided to use Kepler.GL — still visualising some random NYC data. Yet already not that far from how it looks like now :) Randomised data

Our first problem was the fact that kepler was designed to mainly work for static data — the idea was simply: upload a CSV file you want to visualise.

As the first (and very inefficient) iteration, we thus ended up emitting the whole dataset every few seconds via socket.io to our frontend, which then visualised it. Surprisingly (and mainly thanks to how well Kepler handles large datasets, it worked well, giving the user comfortable experience — you could not feel that the dataset was being reloaded, but rather it looked like each individual datapoints are updated — experience we aimed for from the start.

Screenshot of final Beta Version of Ops Data Brain (Randomised data)

Our primary goal was to keep all functionalities of the old heatmap, while also making the UX more seamless and allowing for more advanced analytical usecases, as well as data-driven recommendations.

Let me walk you through Ops Data Brain components step-by-step:

KPIs Panel

Randomised data

This is just to provide a general overview of how we are doing right now — it is particularly useful during peak-demand days, when the Ops team quickly needs to know what the performance bottlenecks are at the moment. It also helps us to quickly spot anomalies and investigate those, as well as tracking impact of downtime and informing our reactions to those.

Supply Lines

Randomised data

Supply Lines are for our general understanding of how our driver levels have varied in the last 24 hours and whether there are any gaps. It’s also useful when there are issues with some part of our driver app — it’s easy to then see at which point something fails (for example, drivers could be able to select orders, but not accept them).

Unmet Demand Sparklines

Randomised data

As described in our previous article on Unmet Demand, these are used by our Operations team to understand where we will need more drivers in the next hour: in fact, we did in the past experiments on automatically sending push notifications to some drivers nearby encouraging them to move to these regions!

Visualisations

Randomised data

This is the fanciest bit! Powered by Kepler.gl, we provide visualisation of all of our active drivers, as well as pending orders and and a heatmap of those. This is very useful for our Customer Service team, as it allows them to quickly spot long-pending orders, as well as find idle drivers nearby and reach out to them right away to encourage them to take that order, sometimes offering some incentives. We also use it to monitor our GOGOBUSINESS drivers.

Recommendation Feed

Randomised data

For me personally, this is the most exciting part of Ops Data Brain (although it does not look the part). Under the hood, it is empowered by our Machine Learning models, which quickly send recommendations as to what actions should be taken next.

It was our stepping stone to introduce Machine Learning models in semi-production, what we like to callHuman-Supervised Machine Learning.

The idea was simple: CS & Ops teams react to these recommendation messages, reason whether they are sensible or not and if so, execute the recommended actions.

Such an approach allowed us to quickly validate a lot of our ideas, without risking bad user-experience. It helped us to nail down issues early on and paved the way to our first fully productionalised in-app Machine Learning model: JARVIS — our driver order assignment engine.

Moreover, it also provided ground work on GOGOML — our Machine Learning serving layer.

How do Teams use Ops Data Brain?

Given that our Business is On-Demand Logistics, it is essential that teams in our company understand what is happening to the logistics supply and demand in real-time, as well as observe how the marketplace changes in time.

As I’ve already mentioned, Ops Data Brain is used across teams, but each one of them finds it useful for different reasons, which I’ll explain now :)

Customer Service

They are the team that this was build in the first place: the usecase is to be able to quickly find long-pending orders and idle drivers, as well as to be able to locate drivers in case our clients reach out to us with enquiries.

Additionally, they use some of the recommendations in the feed to assist them in the actions they take: for example, decide which driver assign the order to.

Operations

Ops team relies heavily on Ops Data Brain mainly during our peak-demand days, as it helps them to assess the current situation in our ecosystem and decide what actions to take to drive our KPIs.

Product

Product team use it on the ad-hoc basis to find out some information about our product, for example number of active drivers in the last 24 hours or to answer questions regarding typical number of pending orders at a given hour in the given region.

Senior Management Team

It’s used in a similar fashion as by the Product team: to answer some ad-hoc queries and monitor our marketplaces.

Data

We mainly use it as the test platform of our new ideas: most of the data-driven recommendations first end up in Ops Data Brain and then we work with Ops/CS to quickly validate those before deploying in-app. Additionally, Ops Data Brain proved essential for monitoring purposes, as well as understand our ecosystem better. We also use it to support Engineering team during app outages and to monitor the recovery.

Next Steps

That was a rather longish introduction to our real-time interactive heatmap called Ops Data Brain. We’ve gone a long way, but we are nowhere close to being done, some of the future projects include:

  • Transitioning to Kafka
  • Better integration with Customer Service Admin Panel
  • More data-driven recommendations and experiments

If you are interested in research work we have done, please read our article on Route Optimisation here and on predicting Unmet Demand here.

If you want to find out more about our Data Team, please see our Head of Data’s article here.

We are always looking for top-notch Applied Operations and ML Research talent. Please get in touch if interested!(Onsite and remote)

--

--