How Airtel set up system and processes for ensuring Payment Health and Intelligent Routing!

Rajat Jain
Airtel Digital
Published in
17 min readJul 21, 2022

The business needs to provide their customers option to make payments online. The customers can pay through different payment modes such as

  • Net Banking
  • Credit Card
  • Debit Card
  • Wallet
  • UPI

Different payment service providers are present in the industry that provides such integration, they are referred to as payment gateway. Each payment gateway has the capability to enable customers to pay online using any of the banks available in the industry.

Following is the list of available payment gateways in the industry:

  • PayU
  • RazorPay
  • CCAvenue
  • PayTM
  • JusPay

The payment gateways are in turn integrated with different Bank APIs and provide a standard interface to their clients for seamless integration. The business client just needs to contract closures on banks they need to enable on their website for customers making payments.

Online Payment Flow

Thus, a simple flow of payments flow will look like this.

Albeit, things don’t work as smoothly as they may sound. As you can see there are API integrations across multiple layers i.e. client, payment gateways, and banks. All these layers are deployed on completely different servers and are maintained by different vendors. These services are subjected to the following practical challenges:

  • Power outages
  • Servers downtime due to maintenance/upgradation
  • Data Migration

Due to these outages, users will not be able to make payments online. Businesses cannot afford such outages, hence comes the concept of redundancy. The client integrates with multiple payment gateways to provide 100% service availability to their customers. For eg: when the user selects HDFC net banking for payment, the client has integration available with both PayU and CCAvenue, the client will decide based on certain parameters which payment gateway will execute the transaction. This selection of payment gateway on certain parameters is called routing.

Routing Types

  1. Static Routing — A fixed percentage of traffic is given to each payment gateway.

2. Dynamic Routing — The percentage of traffic to each payment gateway may vary based on different parameters such as success rate, response time or charges.

Another aspect of mitigating revenue loss due to bank downtimes is to upfront restrict users to select the bank on the payment options page by clear communication of the downtime. This will enhance the customer experience as he has the knowledge of the current status of bank availability for payment processing, the concept of showing a bank’s health as per payment mode is called a health check.

We will cover in detail both concepts of the logic behind the calculation of health checks and dynamic routing for maximizing revenue collection for any business.

Within the context of Airtel, Airtel has different businesses in the market. The different businesses include travel booking, mobile bill payments, and insurance premiums. Each of their business have a separate customer base, let’s term these businesses as LOB (Line of Business). The business owners of each LOB determine what payment modes shall be enabled for their customer base. Also, they can engage with different payment gateways as per their contractual commission for payment collection. Thus, as a payments platform we would need to support the following integration capability:

PROBLEM STATEMENT

The goal of each business is as follows:

  1. 100% online payments system availability.
  2. Maximum success rate of transactions.

Health of Payment Instrument

In order to accomplish both goals, we would capture holistic data at the organizational level to determine the health of payment instruments. As we can see, across all businesses of Airtel they are using Net Banking, Credit Card, Debit Card, UPI, and Wallet. Under these payment modes, all banks are also enabled like SBI, HDFC, ICICI, etc. The uptime of these banks are independent of the business of Airtel, meaning if SBI bank is down on PayU PG, it will be down for all 3 LOBs i.e. travel, mobile, and insurance. Thus, essentially Health of any payment instrument is the function of PG, Payment Mode.

Health (SBI) = fx (PG, Payment Mode)
Now how do we determine the health of SBI Net banking on PayU. There are 2 data sources:

  1. PayU PG Health Check API — Each PG provides an API that tells the real-time downtime of any bank enabled on their network. We term this as an API Data source.
  2. Airtel Transaction Data — Airtel has 3 businesses collecting payments online, they can query their own transaction data to determine whether SBI bank is performing well for their customers in the recent past. If they sense, there is a dip in success rate they can deduce downtime of SBI Bank on PayU PG. There are multiple tools through which you can aggregate the transaction data in time series for the same like Influx. We term this data source INFLUX.

Thus, now the health of any payment instrument is also a function of the Data Source. Health (SBI) = fx (PG, Payment Mode, Data Source)

PG Health Check API

As mentioned, each PG has the capability to provide real-time health of any payment instrument. PGs are able to provide this information as they receive scheduled downtime alerts from the banks, also they are monitoring the performance of banks on their network to deduce real-time health. The different values provided by PG are as follow:

  1. UP — Transactions will be successful.
  2. FLUCTUATING — Transaction might fail.
  3. DOWN — Transaction will fail.

Sample PayU PG Health Check Response

Airtel Transaction Data

For Airtel, the greatest advantage is they can capitalize on data from their multiple businesses to determine the health of any bank. The ultimate goal is to maximize success rate and 100% system capability, there could be a use-case when there is no downtime from the bank but due to some faulty code in Airtel Payments Integration, SBI Net Banking is not working for their customers. These system failures can be best detected by health check determination from Airtel's own transaction data.

The flow for determining health status is as follows:

The user logs onto the recharge portal of Airtel, chooses the recharge amount, and proceeds to the payment page. The user selects SBI Net Banking as his payment option, PayU PG is selected to complete the transaction, the user is redirected to Bank Page and completes the transaction. The transaction is captured in the database as follows:

This is one transaction record in the database, Airtel is a huge company we can expect hundreds of such transactions in a minute across different businesses for SBI Net Banking. Thus, as the next step, we aggregate these transactions against time. The aggregation against time is as follows:

Select count(*) from transactions group by paymentMode, bankCode, PG, paymentStatus, time(1m)

This aggregated data is present in a database tool named index, the health check service queries the influx database to retrieve the data and calculate the success rate and health check in the below format:

We calculate the success rate by taking a weighted average of the success rate at each minute till the last 30 minutes. The intent is to identify the performance degradation in near real-time, hence highest weight is assigned to the time interval closest to the current time. The same is explained with a working example for SBI Net Banking for PayU PG.

Thus, Avg (SR) = Sum (C )/ Count (C )= (71.42 + 75 + 66.67 + 81.25) / 4 = 73.58

Wt Avg (SR) = Sum (C*D) / Sum (D) = (71.42 + 150 + 200.01 + 325) / (1 + 2 + 3 + 4) = 74.64

As you can see, there is a clear difference in the weighted success rate and average success rate for the last 4 minutes. This is due to the fact that SBI peaked at its performance at 81.25% at the last minute i.e. 10:03 AM. Thus, it is imperative that in near real-time the SBI Net Banking is performing well for PayU PG. Hence, the weighted success rate gives a much clearer picture than the normal average.

Health Determination

Now we have got the performance metric of SBI Net Banking for PayU PG in the form of a weighted success rate. Next is to interpret health for the same, we do so by comparing success rate against two threshold values namely up threshold and down threshold value.

  • UP Threshold — 70 %
  • Down Threshold — 30%

Case 1: Down Health
0 < Weighted SR <= Down Threshold

Case 2: Fluctuating Health
Down Threshold < Weighted SR <= Up Threshold Case 3: Up Health
Up Threshold < Weighted SR <= 100

Thereby, at 10:05 AM for SBI Net Banking at PayU PG, we calculated Weighted SR as 74.64 %. As per scale, it lies under Case 3:

70 < 74.64 <= 100
Hence, health is deduced as UP.

The calculation of health is done for each minute by considering transactions of the last 30 minutes, this is aided by a scheduler service that timely queries the transaction data.
The business team can configure the following artifacts as per their traffic and SLA:

  1. Count duration of transactions to be considered for calculating weighted success rate.
  2. Down threshold.
  3. Up threshold.

Now, we have the health information of SBI Net Banking on PayU PG across 2 data sources as follows:

We combine health from both of these sources to deduce final health as follows:

  1. Use only API Data Source
  2. Use only Influx Data Source
  3. Pessimistic Approach — Take AND combination of health across both data source

4. Optimistic Approach — Take OR combination of health across both data source

The final result of Health (Bank) = fx(PG, Payment Mode, Data Source) is as follow

Airtel company now has a huge database of the real-time health status of banks across payment modes and payment gateways. They can predict the bank's behavior by cashing in on their huge customer base. This data can also be exposed as an external API to be consumed by other businesses.

Now, moving on to the part wherein Airtel company need to solve its original problem statement:

  1. 100% online payments system availability.
  2. The maximum success rate of transactions.

Health Enquiry

Health inquiry is the concept that ensures 100% online payment systems capability, this is possible by showing health status upfront on payment options to users about real-time health. Thus, if SBI is having downtime, we can disable the payment option thereby allowing users to choose an alternative approach.

The health of any bank is a function of the LOB, Payment Mode, PG, Routing Value

Health (SBI) = fx (LOB, Payment Mode, PG, Routing Value)

The LOB parameter is very critical in determining the health of banking instrument, as we have seen different businesses has contracts with different PGs. Also, it is their prerogative to decide traffic distribution among those PGs, thus we need to factor in the PG routing contribution to calculate the accurate health of the bank for that particular LOB.

The flow for calculating health for LOB is as follows:

The user needs to make payment via SBI Net Banking for his recharge on the Airtel website.

Let’s fetch health from the master table for SBI Net Banking across all PGs, we give health a numeric value as shown in the table.

The next step is to fetch traffic distribution across different PGs for Travel LOB in the case of SBI Net Banking.

Now we will marry the health of SBI Net Banking of a PG with their respective value to get consolidated value.

Health (Bank) = SUM (Health Value (PG) * Routing Value (PG))

In the above case,

Health (SBI) = 1*0.5 + 0.5*0.3 + 0*0.2 = 0.65

Next is to interpret health for the same, we do so by comparing success rate against two threshold values namely up threshold and down threshold value maintained at LOB Level.

  • UP Threshold — 70 %
  • Down Threshold — 30%

Case 1: Down Health

0 < Health Value <= Down Threshold

Case 2: Fluctuating Health

Down Threshold < Health Value <= Up Threshold

Case 3: Up Health

Up Threshold < Health Value <= 100

Thereby, for SBI Net Banking for Traffic Lob, we calculated Health Value as 65 %. As per scale, it lies under Case 2:

0.3 < 0.65 <= 0.7

Hence, health is deduced as FLUCTUATING.

Stimulating the algorithm for another use case.
The user needs to make payment via SBI Net Banking for his insurance premium on the Airtel website.

1. Health from Master Table

2. Traffic distribution of SBI Net Banking for Insurance LOB

3. Cross of Health Value with Routing Value

Health (SBI) = 1 * 1.0 = 1

4. Health Calculation based on threshold

Routing Value 100

Thereby, for SBI Net Banking for Traffic Lob, we calculated Health Value as 100 %. As per scale, it lies under Case 3:

0.7 < 1.0 <= 1.0
Hence, health is deduced as UP.

As you can see that the health of SBI Net Banking differs for different businesses in the same company. For travel customers, their transactions might fail, but for insurance customers, their transactions will be successful. The primary reason is the performance of the same bank might differ on various PGs integrated.

Traffic Sampling

Consider a case when SBI net banking is down across PG. As per the health inquiry on 10 AM Airtel identifies the downtime by analyzing its transaction data. It decides to disable the payment option, to re-enable the same it will analyze traffic performance on 10:05 AM. So, it would need a few transactions to calculate the success rate. Hence, a small sample of 5 % traffic is opened even when SBI Net Banking is down to calculate the success rate iteratively and take further informed decisions. This concept is called traffic sampling.

Dynamic Routing

Let’s discuss the critical feature of dynamic routing and the logic of traffic distribution across PG configured against the bank of payment mode for a lob. There are primarily 2 approaches for changing traffic distribution dynamically based on the real-time health of the payment instrument:

  1. Soft Routing
  2. Hard Routing

a. Pessimistic Approach

b. Optimistic Approach

Will discuss the algorithm, applicability, limitations, and advantages of these approaches in detail. The main source of data for dynamic routing is Airtel transaction data which is aggregated at the time-series level using the influx tool.

Soft Routing

By soft routing, we are shifting traffic from one PG to another in small steps at regular intervals of time. The shifting of traffic is from low-performing PG to high-performing PG.

The shifting of traffic is gradual and there is an option to configure both step value and time- interval. Also, there is a minimum and maximum threshold up to which traffic can be routed from low-performing PG.

Now we will stimulate soft routing for SBI Net Banking for LOB on the Airtel website. Let’s fetch the success rate from the master table for SBI Net Banking across all PGs.

Thus, the PG which is having lowest success rate for SBI Net Banking is RazorPay currently.
As we can see both PayU and CCAvenue are performing better than RazorPay, now we will route traffic from RazorPay to PayU and CCAvenue respectively by step values.

The next step is to fetch traffic distribution across different PGs for Travel LOB in the case of SBI Net Banking.

For calculating the updated routing value, we rake in the following configuration.

Step Value = 5
The traffic will be transferred from 1 PG to another in chunks of 5%.

Minimum Threshold = 5
No PG can be assigned traffic less than 5%

Maximum Threshold = 95
No PG can be assigned traffic more than 95%

The algorithm is as follows:

  1. Routing Value (Low SR PG) = Routing Value (PG) — ((Count of PG — 1) * (Step Value))
  2. If
    Routing Value (Low SR PG) < Minimum Threshold

then break

3. Else

Routing Value (Other PG) = Routing Value (PG) + Step Value 4.

4. If

Routing Value (Other PG) > Maximum Threshold

then break

5. Else

Repeat Step 4 for the remaining PGs

Thus, updated routing values are

Last but not the least, persist the updated values in the database.

Based on business SLA and traffic, the frequency of dynamic routing can be tuned. For estimation, a user is able to complete the transaction in 2 minutes. Since dynamic routing's main source of data is transactional data, hence it is recommended to have at least 5 minutes of intervals to capture substantial data across PG for success rate calculation.

Now we will stimulate another iteration to check the use-case of minimum or maximum threshold violation. Let’s see the data after 5 minutes of the routing update.

Success Rate of SBI Net Banking across PG.

As we can see, post routing update, the success rate of PayU and RazorPay increased but CCAvenue has maintained its success rate. As per the algorithm, RazorPay still qualifies for soft routing having the least success rate among all PG.

The traffic distribution is as follows:

Let’s calculate the updated values using step value.

As we can, see post calculation the traffic assigned to RazorPay is 0 % which is lower than the minimum threshold, so we shall not update the routing for this iteration. The reason for the same is that the source of the dynamic routing update is Airtel transaction data, thus we need some transactions at each PG to calculate the Success Rate. Hence, in a time interval of 5 minutes, we route at least 5 % traffic even to the lowest performer PG, so that the same can compete with better performing PG in the next iteration.

Hard Routing

Soft routing does the routing update of traffic distribution gradually. With a time interval of 5 minutes and a step value of 5%, it may take up to 30 minutes before maximum traffic is routed to the best-performing PG. It might be cases when a particular PG has crashed suddenly, then there is a use-case to shift the entire traffic from the PG to another PG. This concept of bulk shifting the traffic from 1 PG to another PG is called hard routing. There are 2 approaches to implementing hard routing, will discuss them in detail.

Pessimistic Approach

The pessimistic approach is the use-case wherein a PG has crashed suddenly and its traffic is routed to another better-performing PG. In this case, we are trying to minimize the damage by phasing out crashed PG, but the traffic diverted is equally distributed among other PG which might not yield the best success rate. The flow for the same is as follows:

Now we will stimulate pessimistic hard routing for SBI Net Banking for LOB on the Airtel website. Let’s fetch the success rate from the master table for SBI Net Banking across all PGs.

Thus, as per the success rate calculation, we can see PayU PG has a success rate of only 3% which is less than the minimum threshold. It is imperative that PayU PG thereby making it eligible for hard routing. We will divert all traffic from PayU PG to other better-performing PG.

The next step is to fetch traffic distribution across different PGs for Travel LOB in the case of SBI Net Banking.

For calculating the updated routing value, we rake in the following configuration.

Minimum Threshold = 5
No PG can be assigned traffic less than 5%

Maximum Threshold = 95
No PG can be assigned traffic more than 95%

The algorithm is as follows:

  1. Traffic to Be Routed = Routing Value (Low SR PG) — Minimum Threshold
  2. Routing Value (Low SR PG) = Minimum Threshold
  3. Step Value = Traffic to Be Routed / (Count of PG — 1)
  4. Routing Value (Other PG) = Routing Value (PG) + Step Value

5. Repeat Step 4 for remaining PGs

Thus, updated routing values are

Last but not the least, persist the updated values in the database.

Thus, as you can see although we have saved our system from crashing by routing traffic from PayU to other PGs, the traffic is split equally between CCAvenue and RazorPay even though CCAvenue has a higher success rate. This is a limitation of pessimistic hard routing as the maximum success rate is not achieved. To overcome this limitation, optimistic hard routing is implemented.

Optimistic Approach

The optimistic approach is the use-case wherein maximum traffic is routed to PG with the highest success rate. All other PGs will have traffic of minimum threshold. This will also cater to the cases wherein PG has crashed suddenly and also the success rate is maximized. The flow for the same is as follows:

Now we will stimulate pessimistic hard routing for SBI Net Banking for LOB on the Airtel website. Let’s fetch the success rate from the master table for SBI Net Banking across all PGs.

Thus, as per the success rate calculation, we can see that CCAvenue PG has the highest success rate of 68%. It is imperative that CCAvenue PG should have maximum traffic. We will divert all traffic to CCAvenue PG.

The next step is to fetch traffic distribution across different PGs for Travel LOB in the case of SBI Net Banking.

For calculating the updated routing value, we rake in the following configuration.

Minimum Threshold = 5
No PG can be assigned traffic less than 5%

Maximum Threshold = 80
No PG can be assigned traffic more than 95%

The algorithm is as follows:

  1. Routing Value (Highest SR PG) = Maximum Threshold
  2. Step Value = 100 — Maximum Threshold / (Count of PG — 1)
  3. Routing Value (Other PG) = Step Value
  4. Repeat Step 3 for the remaining PGs

Thus, updated routing values are

Last but not the least, persist the updated values in the database.

Comparative Analysis

As we can see each dynamic routing approach has its own set of pros and cons. Thus, the incorporation of the approach is subjected to the business requirements.

--

--