Replacing the engine mid-flight — How we rebuilt our card processor
In 2019, we at Qonto took our core banking system (or CBS) in house. Part of this involved building our own integrations with the various payment systems we support — Mastercard, SEPA, SWIFT and so on.
By and large, we’ve been happy with our core banking system, but as with any system it has its imperfections. One of those is that adding new features related to card payments was more difficult than we’d like, so in late 2020 we undertook a project to rebuild our card processor, which we completed in November last year.
What’s a Card Processor anyway?
When you tap your Qonto card in a store, the merchant sends a message to their bank, who send it to Mastercard, who send it to our connectivity partner, who send it on to us. We need to quickly make a number of checks such as:
- Is your card activated and enabled?
- Do you have enough money in your bank account?
- Is the purchase within your cards’ limits?
- Have you decided to block this type of transaction being made on this card?
All of these checks need to be done — and a response sent all the way back to the merchant — in less than a second.
Assuming we approved the transaction, this isn’t the end of the process, however:
- Fuel pumps place an initial block on your account — for, say, €100 — and then send a second message to inform us of the actual amount of fuel you pumped
- You might return the product, in which case the merchant sends a reversal
- Finally, the merchant needs to actually clear the transaction in order to receive funds. This happens as an entirely separate process through batch files.
Even this is a simplification — if you dispute a transaction, there may be several more messages exchanged as we raise a chargeback, the merchant contests it, and so forth.
Linking it all together
Properly handling the lifecycle of a payment requires us to link each of these messages together: when we receive a clearing message, we need to know whether it relates to an existing transaction or is a wholly new one. Doing this correctly is very important: if we get it wrong, we might debit your account twice, or we might conflate two different transactions with each other, making your statement confusing.
The card networks specify various fields by which messages may be linked together — a process we term Matching — stating things like “the value of this field from the initial authorisation message must be repeated in subsequent messages related to the same transaction”. In theory, you just look at these fields and everything should work — and the good news is that in practice, this works in greater than 99.9% of cases.
The bad news is that 99.9% isn’t good enough when you’re processing hundreds of thousands of transactions a day. Because of this, every issuer — including ourselves — has heuristics to fall back on for payments which don’t follow the letter of the rules.
The Overburdened Ledger
Inside our CBS, there are two services which we jointly consider to be “the ledger”, and which are maintained by our ledger team. One of them, which we called
autho tracks your account's "available" or "authorization" balance - as updated in realtime when e.g. you tap your card on a payment terminal, and the other tracks your account's "actual" balance - as updated when merchants complete and settle transactions.
autho, as initially built, has logic for every payment system we integrated with. There are a couple of reasons for this:
- When we were first integrating with our partner who provide our connectivity to Mastercard, they advised us that it was imperative that we respond quickly to messages. We therefore wished to reduce the number of services in the processing path.
- Similar to how premature optimisation is the root of all evil, so is premature abstraction; it’s sometimes hard to build the correct interface to support multiple systems until you have developed the sort of understanding which only comes from building those systems
- When processing a payment, there are a number of variables — such as your available balance and the usage of your card’s daily and monthly limits — that all need updating atomically. Doing this all within one service and one database is easier than doing it across multiple
The card networks are particularly complex payment systems, because of their age, the number of participants, and the number of different payment flows they support. They are also rather unusual in a number of ways — they’re the only common payment system that is both pull based (i.e. a merchant “pulls” money out of your bank account) and offers immediate confirmation of payment. They’re also unusual because they’re a “dual message” system — an instantaneous authorisation request is sent to request confirmation of funds, followed by a later ‘clearing’ message to actually perform settlement.
To deal with this, the ledger needed to contain an awful lot of logic to handle the specificities of card payments (and store an awful lot of associated card specific data). The amount of card-specific logic in the ledger services dwarfs that of any other payment system.
The good news is that in the time since building this initial implementation, we’ve also learned both that we have more than enough time to introduce a new service into the flow (and that the time overhead of doing so is minimal), and how to build the right abstractions to enable the ledger to become truly payment system agnostic — letting each team work on its own area of expertise, and hence move more quickly
Before we could begin, we needed to do a bit of preparatory work. Card transactions are not “one shot and done” matters — in the typical case, there’s a one or two day delay between when the merchant authorises the transaction (blocking the funds in your account) and when they settle it (actually collecting those funds). In some cases, merchants have up to 30 days from when they authorise a transaction to when they clear it, and in the case of a transaction which is disputed and goes through multiple chargeback cycles, it can take up to 6 months before reaching a final outcome. We therefore knew that our new system would need to be able to pick up and continue transactions which begun under the old one.
The first impediment to this was that the card transaction information was previously stored in the same database (and sometimes the same table) as the information that related to all kinds transactions. Before we could migrate to a new system, we’d need to separate this out into its’ own database that our new service could talk to:
The ledger team took up this project, additionally making major refactors to
autho's database schema which improved performance. During this refactor, they avoided making any behavioural modifications at all - in some cases necessitating (temporarily) re-implementing bugs they found in the legacy service - and verified this correctness by running all transactions through both old and new logic simultaneously and cross-checking them.
This groundwork laid, we were able to begin working on building our new system — and moving to a world where all card logic and data is self-contained:
Building the new system
Whilst the ledger team were doing the aforementioned refactor, we begun working on our new processor. Straight away, there were some improvements we knew we wanted to make:
- As the number of customers using our CBS grows, so does the ways in which we have seen merchants mess up the matching fields in their messages. With that knowledge, we knew several ways we could improve our transaction matching and improve our customer experience.
- The reasons given to our support team for why a transaction was declined were not always as clear as we’d like, and we wanted to improve that.
- We simultaneously wanted to be able to share more common code between the various message types, but also be able to tune the behaviour of each more specifically, requiring a generally different structure.
A consequence of these changes is that we would not be bug compatible with the previous system — the structural differences between the two processors would result in a baseline level of differences.
Instead, we would need to test things intensely, and proceed carefully. As a result, our new processor has extensive suites of both unit tests (testing portions individually) and integration tests (validating its behaviour when combined with the real versions of our other services).
We also setup a test environment based upon a copy of the production database, and played a copy of all messages received in production into the new system running in that test environment. Where the outcome differed, we manually compared them: we didn’t expect 100% matches, but we expected the behaviour to be the same in a big enough fraction of cases that the comparison was manageable. To help with this process, we consciously decided to mirror the behaviour of the old processor where doing so was easy, even if we had decided that the existing behaviour was not what we wanted. We could then revisit these pieces after migrating to the new processor, and release each change incrementally.
Rolling out safely
It’s very rare that a feature rollout goes 100% according to plan. Because of this, we normally perform incremental rollouts using LaunchDarkly
We’re generally very happy with LaunchDarkly, but didn’t want an external service involved in the flow of such a critical part of our banking system. Instead, we embedded a simpler version of similar logic which decided whether to route the transaction to the old or new system based upon an allow list of explicitly selected cards, and a hash of the card’s primary account number (the 16-digit number printed on your card).
Initially, we deployed our new system to production but with only explicitly listed card tokens enabled. This let us validate our behaviour in production with the real card networks, first using our test cards and secondly by enabling the cards used by Qonto employees. Once we were happy with that, we started rolling out to small fractions of our users — at first, a small enough fraction that we could monitor the behaviour for all transactions going through our system in real time. Slowly we rolled out the new processor to a greater and greater percentage of cards, pausing periodically to monitor behaviour, until recently we migrated 100% of customers over.
Despite the magnitude of the change, the rollout went very smoothly. While we had several bugs, they were all minor and generally found either before causing any customer impact, or after affecting only single digit numbers of customers.
Most of the bugs pertained to corner cases involving transactions started on the old system and continued on the new system — an area which was particularly difficult to test. One bug involved the opposite — transactions authorised on the new system and cleared on the old system (we rolled the two halves of the processor out independently).
Qonto customers shouldn’t notice any changes — card payments should “just work” in all the situations they used to, and work even better in certain corner cases where you might have noticed problems before. Despite doing more database queries, the ones we do perform have been simplified, and the overall result is that despite this our response times have become both faster and more consistent.
During the rollout, we asked our customer support teams to be extra vigilant for odd behaviours involving card transactions that they spotted while responding to queries (reporting to us even cases that they thought were probably merchant issues); an unanticipated side effect of this was uncovering a number of bugs in the old system that we were previously unaware of. In one case, we’d accidentally replicated the bug in the new system also, and so we’re able to fix it.
Because the card logic is now self contained within its own service owned by the cards team, both us and the ledger team are able to develop and iterate on our respective code bases more rapidly.
Our co-workers on other teams are working on (or have already completed) their own projects to migrate their payment systems entirely towards the ledger’s new generic layer. In the end, all payment specific logic should be separated entirely (and cleanly) from the ledger
What worked well
- Our approach to behavioural changes. We constrained them to be small enough in number to understand why our two processors made different decisions on the same message, while still giving us the flexibility to make the improvements we needed to make
- Our incremental rollout. While a handful of Qonto customers encountered issues while we were rolling out our new processor, on the whole we were able to identify and rectify issues
What worked less well
- We had to play “catch up” with some card-related product changes during the course of the project. With closer attention paid to the product roadmap, we could have avoided some re-work by implementing the new behaviour to begin with, rather than re-working areas we had already built
There’s some debate to be had as to whether — when reimplementing a system — one should go for a 100% behavioural match with the old one (including matching bugs), or whether to allow the two systems to differ. We now have experience with both — as the 100% match is the approach the ledger team took when splitting the database within
In general, I think the approach has to be dictated by the goals of your project. Our goals weren’t achievable inside a processor structured for 100% bug compatibility — and so we would have had to accomplish them as a part of a second follow-on refactoring project, necessarily extending an already large project. In light of that, we instead took our approach of allowing behavioural changes, while reducing the risks that could arise from that as much as possible.
Qonto is a finance solution designed for SMEs and freelancers founded in 2016 by Steve Anavi and Alexandre Prot. Since our launch in July 2017, Qonto has made business financing easy for more than 200,000 companies.
Business owners save time thanks to Qonto’s streamlined account set-up, an intuitive day-to-day user experience with unlimited transaction history, accounting exports, and a practical expense management feature.
They have more control, whilst being able to give their teams autonomy via real-time notifications and a user-rights management system.
They have improved visibility on cash-flows through tools such as smart dashboards, transaction auto-tagging, and cash-flow monitoring.
They also enjoy stellar customer support at a fair and transparent price.
Interested in joining a challenging and game-changing company? Consult our job offers!