Handling Downtime in Payment

By — Aryan Arora (Engineering, Payment Platform)

Published in

Urban Company – Engineering

6 min readApr 11, 2020

The world is switching to cashless transactions and we all use online payments for something or another. It’s just simpler to use and it’s more convenient than handling cash or making exchange. But while it’s simpler to an end user, the system that drives it, is much more complex. This system is what we call a Payments System.

It involves a myriad of different entities that work together to make a payment reach from one account to another. To simplify this, we have broken the whole payment flow into five different levels, which can be observed in the diagram below.

The first point from where the transaction initiates is Level 1, the organisation’s (Urban Company) own servers. This is the entry point for any transaction in the system and from there, based on different payment methods and our own implementation, the flow goes on to/till different levels. It is clear from the diagram that if any of the entities fail, it causes a breakage in the chain, eventually cascading into the payment failing at the top. This is what we refer to as having downtime in the system — when payments start failing, orders stop getting placed, a catastrophic situation.

These payment downtimes are usually specific to a payment method (e.g PayTm Wallet, HDFC Debit card, SBI Net banking, etc), a payment mode (e.g. Wallet, Card, UPI, etc), a gateway or very rarely the entire gateway aggregator (basically, a gateway comprising of multiple gateways).

The base line is, downtimes can occur because of issues at any level. It’s also interesting to note that a downtime that occurs at Level 1 has a higher degree of impact when compared to a downtime at Level 3 (payment gateway), lowest impact being closer to Level 5. This is because the fault tolerance of system increases with the increase in levels. This fault tolerance is proportional to the number of alternate entities available at a level to make the payment flow successful or how many payment flows would get affected if no alternate entities are available.

Why do we even need to handle downtimes? Why can’t we just ignore them till things are back to normal?

I wish we could ignore them but we can not and that’s because for the following obvious reasons.

It leads to poor user experience and increased customer frustration.
It leads to failed/lower orders, because we could have prompted the user to try an alternative method.
Often, it increases inflow of helpline calls.
We won’t like if the above 3 points became true.

Having explained the problem, let us now deep dive into how to solve this problem.

We divided this problem into three parts -

Detection: Accurately catching the downtime when it happens.
Disabling: Disabling the affected payment methods.
Re-enabling: Re-enabling the payment methods when the system is up again.

All these steps are important. If we fail to catch the downtime, it can impact the user experience and business badly. If we fail to re-enable the payment options on time, online GSV goes for a toss. So we had to come up with a healthy mixture of precision and recall in our model. Please refer this for more details on these parameters.

To explain clearly, we made this flow diagram to create a visual picture.

Detection, creation and application of downtime

We will use this as a reference for the explanation.

Part 1 — Detection

To accurately detect the downtime, we need to have the relevant data. Let’s call this relevant data Transaction logs, as in the above diagram (flow prefixed by alphabet A). Each transaction, either failed or success, that goes through the system is logged here. Each record has all the relevant information — the gateway used, the payment mode, time of attempt, type of card, net-banking, wallet, etc.

Now that we have these logs, we need to utilise them. We have a cron-job runs at every Y minutes and calculates the success rate for each payment method/gateway separately. If the success rate is down for a payment gateway as a whole, we create downtimes for all the payment methods via that gateway (or we can just switch to a different gateway if possible). If the success rate for a particular payment method is down, we just create downtime for that. Creating downtime just means that we make an entry in our system, Downtimes data store in the above diagram. While creating this entry, we also send ourselves a notification about the downtime.

Part 2 — Disabling

Now that we have all the information about the current downtimes, we need to use that. Whenever a user requests for available payment options, we check our Downtimes data store and disable all those options that are facing downtime. This flow is prefixed by the alphabet C in diagram.

Part 3 — Re-enabling

To re-enable the affected payment methods we need to know if the current downtimes in system are still active or not. We can not just randomly take our payment methods down for a specific time and then enable it again. That’s just not precise. We might make it disable for a longer period of time than the actual downtime period or vice-versa. Both the scenarios are equally bad.

We need more of a constant feedback mechanism to make this decision. We achieved this by implementing AB testing. When there is a downtime, we disable that payment method for A% of users but still keeping it enabled (with a high failure rate alert) for B% of users. It’s the users that falls under bucket B that helps us reaching to a decision point as we keep on getting transaction logs. Based on these transaction logs, if success rate for any payment method or gateway improves and a downtime entry for that exists, we remove that downtime entry so that it gets enabled again.

This is a calculated tradeoff in which some users will still face payment failures but in return they offer us a silver lining to improve our precision. And of-course, this depends upon the number of transactions attempted by users in bucket B because we need a certain sample size to make any decision, hence we had to set that limit precisely too.

In the case when we don’t get enough transactions to make a decision, we keep that method down for a configured expiry time and then we enable it automatically. We also reshuffle the order of payment options list on basis of dynamic priority based on downtimes. This helps in promoting the methods that are working fine.

Results -

We now handle all our downtimes in a 100% automated fashion.
The overall system downtime has become more transparent. We receive immediate notifications for any downtime in the system. Earlier, we din’t know how frequently these occur.
We can now raise alerts to our 3rd party partners immediately when we witness a downtime from their system.
And the most important, we have improved customer experience.

Here is a sample screenshot from the app that shows the same.

About the author

An engineer on weekdays, an explorer by weekends, Aryan arora is a young engineer who works in the Monetisation Team and brainstorms on different ways we can use tech to make our payments’ ecosystem better.

Sounds like fun?
If you enjoyed this blog post, please clap 👏(as many times as you like) and follow us (@UC Blogger) . Help us build a community by sharing on your favourite social networks (Twitter, LinkedIn, Facebook, etc).

You can read up more about us on our publications —
https://medium.com/uc-design
https://medium.com/uc-engineering
https://medium.com/uc-culture

https://www.urbancompany.com/blog/humans-of-urbanclap

If you are interested in finding out about opportunities, visit us at http://careers.urbancompany.com