UC’s journey of 5X API performance improvement

By — Shashank Chaudhary (Engineer, Supply Team)

UC Blogger
Urban Company – Engineering
7 min readDec 2, 2022

--

At UC, our Microservices platform is our legacy and is one of the most stable platforms you will come across. The stability is achieved through: Standardisation of the tech stack and platform, very robust alerting and monitoring available in the platform, and paranoia in ensuring each and every alert is attended to within minutes

However, there was one area that needed to be addressed . When we benchmarked our APIs, we realised our latencies could improve. This blog chronicles our journey in bringing about a 5x improvement in our partner-facing API latencies.

Most of the solutions explored in this blog are simple but very effective. We believe, many startups and young teams will find these solutions relevant and should be able to deploy similar optimisations.

How do we measure API performance at UC:

All API performances are tracked through Grafana dashboards which our platform provides by default. The metric that we are usually interested in is the P95 latency of our APIs.

So what was the performance optimisation that was needed:

The average of P95 latencies across our Partner facing APIs was high. Our issue was not high throughput or scale. In fact, we did fairly well in our performance tests and all APIs would scale well at 2x and 3x throughput. So this ruled out the usual suspect:

Infra was not a bottleneck. We are on AWS and our containers were scaling seamlessly with higher throughput(and higher cpu utilisation). All parameters like Cpu utilisation, Memory utilisation, IO ops, etc were in control

This got us thinking……

Our thought process and areas of attack

Any API’s latency will be a sum of :

  • Sum of the total time spent within the microservice for computation.
  • Sum of time spent in all serial External service RPC calls
  • Sum of all database calls

We had problems across all of these.

1. Database performance issues:

Our primary databases are largely Mongodb (and in some cases MySql for transactional use cases). For some queries, the DB latencies were high. We analysed these queries for indexes and found they were querying collections with fields that were missing indexes.

The data growth in these collections/tables was gradual and hence latencies increased over time.

Eg of an api becoming slow over time with increased data size behind it.

To solve such issues going forward, our platform team has now put in place a dashboard that alerts us whenever a query in prod runs for more than a threshold. These would be candidates for indexing.

However, indexing is not an elixir. It has a cost. It needs extra memory (RAM) and disk. Also indexes have write overhead. The more indexes you have, the more work the DB has to do to update that index whenever you insert/update/delete records in your table. This impacts write performance and overall db performance. So indexes should be used judiciously.

Sometimes indexes are just not enough. No matter how many indexes you add, indexes might fail if the columns/fields have low cardinality. Example: At UC, we now have about 50+ million reviews. While querying reviews for a given partner, even with the index on partner id, for some very old partners, a query on partner was returning over 10K reviews and ended up choking our DB.

This is where archival comes in. We are exploring ways to archive this data as it is already stored in our data warehouse, so it should be safe to delete from our primary store. However, when we archive our data, then some queries become nontrivial and require some extra effort from the developers’ end.

Continuing with the reviews example, we have a requirement where we need the count of reviews for a partner. With archival, we have to maintain some aggregate collection where the aggregates (total count of reviews) are updated after a rating is inserted in the system so that we know this value even after archival and need not query across multiple stores.

2. Cost of experimentation and configurations:

UC as a startup moves fast. We try out and experiment with multiple flows and features, and learn from them. But experiments came at a cost, especially because we controlled them from a central microservice: Experiment service.

This cost hit us harder because we didn’t sunset legacy flows and experiments and kept them (the execution) on in the code while they were no longer relevant in business. We removed such old experiments from our code base in UC and we saw a 40% drop in the response time of the service (across APIs).

Then we came to the age-old conundrum of choosing a dynamic configuration vs a static constant in code. Examples: Cities in which a feature was live, rating thresholds for retraining, insurance amount etc.

Fetching configurations had an extra cost since we fetched it from a central service and were fetching too many configs. To improve performance, we took a call to move out configs to constants, if :

  • Configs were getting updated rarely (once in a few months or even once in a few weeks)
  • we were ok with changes not to reflect immediately but after the next deployment

Finally, we came to the biggest optimisation in experimentation that our platform team delivered. Was it really necessary to serve experiment-related data from a central Experiment microservice?. The central experiment service was serving at very high throughput, was using the most amount of resources and was contributing to p95 latency to all services. It was also a Single Point of Failure.

This is when our platform team took a pivotal call to move away from a central experimentation service to a library called within respective services reading from local application cache. The experiment config is now populated locally from the central experimentation service asynchronously on experiment updates. It helps save the inter microservice network calls, yet it being a library still provides standard support on how to easily use experimentation. This change led to a 40% decrease in p95 latencies across most UC services.

3. Synchronous (Blocking) Heavy Computation:

Use case 1: We found many high throughput read(GET) APIs which were being called with the same parameters and were returning the same response. These APIs had decently high latencies (~500ms p95) because they were doing a lot of transformation on every API call. Example: APIs serving the “Recent updates” section of Partner Home page where updates are aggregated from updates from Meeting service (new meetings) or Notification service (new Notifications). We had two options here:

Option 1: Cache the API response with a TTL. This is a simplistic solution but can lead to stale data(if something changes during TTL).

Option 2: Listen to state transition events of base entities and keep the transformed aggregated data ready(Computed Asynchronously) for the GET API to consume. However this approach introduces engineering complexity. Example: Instead of calling these Meetings and Notifications microservices every time a homepage is loaded, the Homepage aggregator service can listen to events from Meetings and Notifications service and keep the homepage response computed

So finally we took a call on a case-to-case basis between stale reads vs consistent reads but complex solves

Use case 2: We had some Write (POST) APIs that had extremely high latencies because processing was happening synchronously.

Example: Our partners uploaded thermometer images where we were validating the images (OCR followed by validation) synchronously.

In many cases, we moved the processing in these Write (POST) Async where we would acknowledge the users and then process these via a queue, thereby delinking the user and the associated connection from the processing and then notifying the user once the processing completed.

Summary

In this blog, we have talked about unindexed and unoptimised DB queries, avoidable network calls & experiments and finally unnecessary synchronous calls. These are simple and intuitive solutions but will be the most useful ones for young startups and teams. Also another important thing to note, more often than not, these measures do have tradeoffs: whether it's increased cost or engineering complexity or consistency. Understanding and realizing the tradeoffs is sometimes more important than just blanket trying to fix performance. Please do reach out if you have any feedback/questions/or alternative thoughts on any of the points discussed above and would be happy to engage on them.

The team :

Shashank Chaudhary who sets the bar high in ownership and owned up a lot of these changes.

Jatin Rungta is the quintessential geek who loves tinkering with our systems and challenging the status quo.

Sourabh Jajoria a UC engineering veteran who has always championed Technical excellence.

Rishabhdhwaj Singh who loves solving complex business problems using tech and is paranoid about System performance.

and the entire Supply Engineering team at UC for their relentless pursuit of tech excellence :)

Sounds like fun?
If you enjoyed this blog post, please clap 👏(as many times as you like) and follow us (@UC Blogger). Help us build a community by sharing on your favourite social networks (Twitter, LinkedIn, Facebook, etc).

You can read up more about us on our publications —
https://medium.com/uc-design
https://medium.com/uc-engineering
https://medium.com/uc-culture

If you are interested in finding out about opportunities, visit us at http://careers.urbancompany.com

--

--

UC Blogger
Urban Company – Engineering

The author of stories from inside Urban Company (owner of Engineering, Design & Culture blogs)