Migration from an offline to a real time capacity engine

Sourabh Goyal
Urban Company – Engineering
5 min readJul 17, 2019

--

One of many problems that we are solving at UC is predicting and reserving our (UC Partner) capacity for our requests. Optimising for this capacity allocation, helps us answer difficult questions like: How many orders should we take for a valid permutation of service type, time and location at any snapshot in time? Answering such questions, involves having to calculate complex predictive metrics using multiple signals around partner availability, skills, current inventory, preferences, previous history, etc. If we are too conservative in our predictions and allocations, then we end up taking fewer orders resulting in loss in revenue and if we are too aggressive, then we end up taking too many orders which we can’t deliver, resulting in bad customer experience. Just like most things in life, it’s a trade-off problem.

Before beginning to dig deeper into migration of engine, let’s shed some light over how offline engine worked, and what were its limitations.

Our first capacity engine was a simple event based system wherein events were consumed by other subsystems to trigger updates to a few counters. In simplistic terms, this can be compared to a table maintaining counters for number of partners, number of eligible partners, requests taken, and impact from requests taken in nearby time slots. Here’s how a typical table would look like:

Whenever we had to calculate capacities for a given location and time, system could query this table and find out the available capacity by using simple arithmetic formula.

Capacity = num_available_partners — num_request_taken — num_impact

This system served well for almost a year as we just kept adding columns, triggers, and updating formula. However, we reached a point where it was no longer easy to maintain these triggers nor it was possible for a developer to tell, if the counters were getting updated correctly. As UrbanClap was rapidly growing, it became extremely important to understand why we could and couldn’t take demand for a particular time. Explaining this was nearly impossible even after logging each and every update in system. All of our efforts were in direction of fixing these triggers, recalculating counters offline periodically, etc. While being totally unaware of problem which was much bigger than this.

The problem which led us to rethink our capacity system was something else, i.e. this system cared only about absolute number of jobs taken, and remained agnostic of partner availability. Our capacity formula had two parts: (a) Total partners tagged in the locality, and (b) Total orders received in the locality. However, partners from nearby localities could also pick such orders, especially if partners in the current locality didn’t respond soon enough. This caused an unintended outcome. We would often find certain areas blocked from taking any new orders, but still have partners in those areas available, without any jobs. We realised the impact of this problem only during a marketing campaign, when additional supply which was on-boarded to fulfil the expected surge in demand remained idle. We realised, a better system was needed which could calculate capacity on the basis of real time partner availability.

To do this in the shortest time, we decided to tweak our matchmaking system to do the required task. Role of Matchmaking system is to find the best professional who can serve an order. To find the best professional, we matchmake orders with professionals based on time of order, location, skill, real time location of partner, ratings of partner, etc.

Our matchmaking system then, had three stages namely,

  1. Partner search : Mechanism to search for partners who have availability (location and time) and ability (based on skills, specialisation, etc) to serve the request.
  2. Preference engine : After finding partners through ‘partner search’, this mechanism filter out partners based on partner and customer preferences, and their past interactions on the platform.
  3. Fanout & scheduling mechanism: Engine to schedule and strategise matchmaking for an order, and send leads and communications to selected partners.

We started extending Search and Preference engine to create a module which could serve our purpose. Matchmaking system by design was an offline system, which worked asynchronously. Due to this nature of its operation, Search and Preference engine were not optimised to perform fast in real time. Also they were designed to compute for a single time and location at a time, reusing the same engine meant that we have to run all the required requests in parallel, thus giving sudden surges in throughput, and server resources. So just to be safe and to release fast (like really fast), we tweaked few pieces around and introduced this beast during final capacity calculation. To have minimum impact on matchmaking infrastructure, we decided to search for available partners as and when capacity for a time slot was calculated as full. If at least one Partner was available for that time slot, we reopened that slot.

The new hybrid system worked. We were able to take 10–15% more requests than before. Average utilisation of partners improved. But due to inefficiencies in the system, we could not roll it out in all of our categories. While this proof-of-concept was attracting attention from other category heads, we were figuring out how much effort would it take to make this system scalable. The entire effort to make this system scalable took us through the journey of surgically removing inefficiencies and adding more robust solutions in place, without hampering a core sensitive part of the UrbanClap marketplace.

--

--

Sourabh Goyal
Urban Company – Engineering

Engineering Manager @ UrbanCompany | Ex Senior Engineer, Carrier Commercial HVAC systems