Establishing robust communication between Micro-services

Pravar Mandloi
Engineering @ Housing/Proptiger/Makaan
4 min readDec 26, 2022

--

In this article, we will tell you a story of how we managed to develop a robust communication between 3 micro-services which manages a payment volume of around 10000,00,00,000+ (1000+ cr) INR transactions in a month and per day load goes up to 100,00,00,000+ (100 cr) INR.

Current Architecture

We have a gateway service Apollo which handles all the order requests, a payment service Fortuna which integrates with payment gateway and a database service Mystique which basically validates and stores the data in mongoDB.

Apollo handles all the order requests, then communicates with Fortuna and Mystique after order creation.

The order data is stored in both mongo and SQL. We use mongo as it provides faster queries and a better scalability. We use SQL as we have some very complex queries running in our system.

Mystique is a service which is used for storing data on MongoDB. Mystique service is used by many other services and saves data in the same pattern, hence we have a common different service for MongoDB.

After the order creation, Fortuna interacts with the payment gateway through web-hooks and does the processing. Fortuna handles payments, payouts and refunds with different configurations and logics.

Problem Statement

When user does a payment, an order is created in Apollo and Mystique, and then Apollo requests Fortuna to initiate the transaction. Fortuna initiates the transaction and waits for the webhooks from the payment gateway. Once Fortuna receives the webhook, it processes the event and sends a callback to Apollo. Apollo does the required processing and stores the current state in Apollo(SQL) and sends a request to Mystique for saving data in MongoDB.

As the traffic eventually increased, fails increased in callbacks between Fortuna and Apollo mainly due to pessimistic locks and API timouts from Mystique. Due to this, not only the callbacks failed, but we also faced status disparity between Apollo and Mystique. This disparity further contributed to failing of retry callbacks due to the configured rules.

Impacts

Fortuna : Load increased as it tries to retry all the failed callbacks, usually triggered with a cron.
Apollo : Load increased as it caters all the retry callbacks from Fortuna.
Mystique : Load increased as it caters all the retry requests from Apollo.

Approach

Optimisations were made in apollo to consume the callback efficiently, such as by having other operations during the API through a queue, so as to decrease the response time of callback APIs.

A retry mechanism was developed which synced the order status in Mystique and Apollo. Then it requested callbacks from Fortuna.

The main problem was still to reduce the load at Fortuna’s end. This was done by introducing a hybrid architecture of sync and async communication for sending callbacks. We used a system which involved both message queues and API to send a callback which was configurable on the basis of state. By sending callbacks through a queue, threads were free as the responsibility to send a callback lies to the message broker now. We used RabbitMq as our message broker. Only urgent callbacks were configured to be sent through API’s, others through the queues.

Queue Configuration and Architecture

A queue configuration with 0 retries was chosen as a retry delays the other events in the queue which were waiting to be consumed. This ensured us that no transaction gets delayed by waiting for the retries to finish.

A dead letter queue was configured with the queue which syncs the transaction which was failed through the queue and retries the callback. We did order syncing through the dead letter queue as to ensure a successful event.

We created 6 consumers at Apollo’s end to consume the events rapidly.

Results

  • The average time taken for the transaction to complete reduced by 79%.
  • The error count in Apollo reduced from 2.5k instances a day to merely less than 500 instances. Error count in Mystique reduced from 10k to merely 0.
Error rate trend
  • All the previous orders with a volume of 12,395 were synced between Apollo and Mystique.
  • Our callback system became robust with close to 0% fails. The failed callbacks were anyways synced with a cron which retries callback for the failed orders.

Limitations

This architecture achieved a very robust intercommunication between services, but it lacked in achieving sequentiality between the inter-related callbacks.

We came across multiple cases where the parallel consumers consumed callbacks of same orders but different status together, leading to dirty reads and hence failing callbacks. Some status take more time to process, hence leading to consumption of the next status before completion of the previous ones.

There were cases where a callback failed and gets retried through a dead letter queue. Within this frame of time, next callback gets processed and gets failed.

Although these cases are very few (less than 0.01%), we still need to create an architecture which solves this issue keeping in mind to cater the increasing traffic in our system and make the communication complete robust.

--

--