A closer look on failed payments and refunds at redBus
Every payment system has to deal with failed transactions (DNCs), which arise due to user abandonment or system issues. One of the job then is to correctly determine the final state of the transaction (success or failure) and decide whether to auto refund the customer or not. The more accurate and transparent this process is, the more it results in customer satisfaction.
Payment system at redBus deals with payment data that flows to and from multiple internal systems and third party payment gateways. The final status of an order needs to be reconciled across all systems. The challenge or the problem statement is how to send/receive payment updates across such heterogeneous set of systems in a way that the final updates are correct, traceable and suggestive of future actions and improvements. Inability to do so contribute to increased TAT and miscommunication there by affecting the customer experience and hence the Net Promoter Score (NPS).
How do we solve this?
We wanted to capture the processing of failed order in a way so as to provide better data management to consuming applications (in terms of uniformity and performance), and decided to build a data pipeline to solve this. At the heart of this pipeline is Apache Kafka, a distributed message broker which is surprisingly resilient for an event system and provides multiple consumption channels.
The problems that we tried to fix in the handling of failed payments led us to a set of design choices—
- Scalable system and move towards event driven architecture
- Provide accurate communication to reduce anxiety for the customer
- Always have a definitive PAYMENT_STATE which maps to some operational workflow
- Logging and Real Time Alerting
These reflect in the way we architected the solution and got to the root of each problem —
Gather Payment Updates From Heterogeneous Systems
Pull based approach for evented systems > For internal transaction systems which publish events, the “Producers” have the responsibility of polling the queue and push payments data to Kafka.
Push based approach for non-evented systems > For internal transaction systems which lock payments data in relational stores or third party systems, an API is exposed to push payments data to Kafka. This “Publish API” sits behind a layer of bearer token based authentication.
Single entry point > In both the above approaches, Kafka gets real time updates on a payment order (from initiation to completion). The received message is mapped to a Kafka message and pushed to a designated topic “sink”. This topic acts as a single entry point where message validations are run and transaction properties are assigned. The downstream flows cannot add or remove these properties but are free to modify them for further consumption.
Uniform message contract > A Kafka message is a key-value pair (of implementer’s choice), with key as a unique Id and value as the serialized model of transaction properties. This model contains more than 30 properties utilized in parts by multiple consumers. It’s possible to see this denormalized message structure as a waste of precious system/network resources from a single consumer standpoint, but it takes away the complexity of writing data checks in every consumer and avoids multiple roundtrip to database, providing better read performance and simpler code.
Lean, Mean and Fast
No batching > Batching is a norm for relational stores, but in a streaming architecture, messages are consumed rather than queried. No batching also means that every transaction is given equal chance of state confirmation, so the kind of drop and source of final state can be accurately determined.
Strong coherency between topic and consumers > Messages part of a certain Kafka topic have strong coherency with the consumers connected to that topic. Each consumer sees only those messages which it can blindingly compare and classify into a state and/or move to another topic.
Parallel and lightweight consumers > With redBus quickly expanding to new geographies and entering into multiple verticals other than buses, the underlying payment and refund systems are bound to get higher volume of transactions and so, should be able to scale well. Thus, in case of a surge in the volume of failed transactions, we look for deploying parallel consumers to clear up the queue and keep turaround times within acceptable limits.
This is majorly possible due to Kafka’s design of allowing multiple consumers (recognized by group.id identifier) per channel. Unlike traditional queuing systems, Kafka divides a topic into multiple partitions. If no custom partitioning strategy is defined, round robin algorithm is used for data distribution within a topic. A consumer can read from 0 or more partitions depending on number of partitions and running state of other consumers in the group. In an ideal scenario, a single consumer is assigned a single partition by Kafka runtime. If a consumer in a group dies down, the orphan partition is rebalanced across live consumers, so no messages are lost.
The consumers are written in Go which fit our purpose really well due to its light-weightedness in dealing with network calls — sql connections and http apis, which is really what we needed. Concurrency is handled via goroutines with cheap threads and CSP-style message passing. This allows us to easily dispatch more number of Kafka consumers as depicted in above diagram. Another thing to like here is that the Go toolchain (using gc compiler) generates a single binary optimized for the target OS and architecture. This is orders of magnitude faster than other compiled languages. The resultant binary does not need any VM or even Go runtime to execute, thus laying off additional load on servers and reducing deployment cycles.
In production, we run Kafka with 15 2-partition topics (with support for dynamically extending the partitions, if needed). This allows us to run a multi consumer setup on same or different nodes. In the diagram below, top reveals the % of memory and cpu utilisation per Go process on a single node. The 7 Go processes listed (“golang-c”) combine to use 1.6% memory and less than 6% cpu during peak time.
It is fitting at this point to look at RDS metrics as well (24 hrs)
Classification of transaction states > In order to improve upon the failures, recognising the kind of drop and source of final state is important. redBus classifies them as IDNC (internal drops between payment and transaction system), EDNC (external drops between payment gateways and payment system). EDNCs are further classified into SF (salesforce status check) and RECON (settlement sheet). This way we are kept informed about the source of failures and effectiveness of these disparate systems in identifying true refunds.
Visibility Surrounding Transaction States
With multiple interrelated processes running on an order, it is important to have visibility surrounding the state of an order being handled by every consumer. The consumers publish state changes for an order to Mongo db in a dated collection. A single such log accompanies timestamp, source of log, note and data, if any. The historical changes on an order are at the disposal of SF executive to keep the anxious customer well-informed. The handled errors/exceptions are also published to Mongo db, while occasional unhandled exceptions log to the disk.
Real Time Monitoring
We utilized Grafana for monitoring overall health of the system and detecting anomalies. These anomalies not only helps us detect if a certain consumer is doing it’s job or not, but also tells us about sudden increases in external drops from a payment gateway or an internal transaction system going down. More payment failures means more number of failed states detected. This cause and effect view is laid out in a grid of time series graphs providing a snapshot of the system at any given time.
At redBus, we experiment a lot with new world payment instruments and flows. The failed payments and refunds are as much a problem for us as it is for our customers. But in my experience, there is no single and straight answer to a difficult problem. It is almost always a culmination of simple and effective answers. This is what we tried to achieve.
Thanks for reading!