Switching Databases while Processing Billions of Dollars

Published in

Melio’s R&D blog

7 min readSep 18, 2022

Like many other companies, Melio began as a small start-up with only a few employees.

This changed two and a half years ago when COVID forced small businesses to find alternatives for writing physical checks and delivering them to vendors.

Within the first six months of COVID, Melio started processing about 4,000 times the amount of cash it was previously processing.

This kind of scale has many implications for the company. We are going to look at R&D growth, and more specifically the effect it has on the architecture. In such a short time, Melio grew from just a few software engineers to a whopping 120 engineers (600 total employees).

How did we get to the point of needing to switch databases?

From an architectural point of view, Melio started like most start-ups do, with simple client-server architecture.

A main database on the server-side and a client:

As our R&D grew and our product solved additional use cases, more teams connected to the same main database, adding new entities, new fields to existing entities, and new business logic:

Multiple teams accessing the same database

Multiple teams depending on the same database had some unfortunate implications on our development process. We experienced issues with:

Domain encapsulation and ownership: Any team can edit, update and read data without the domain owner team being aware of that. There were times when it was unclear who owned which table, resulting in domain boundaries being blurred. A team that is not aware of changes in their own domain entities raises a data integrity issue.
Coupled model: Teams depending on a specific database entity structure made database entity changes more difficult and sometimes even impossible as other teams’ logic depended on it.
A maximum number of concurrent writes: Because Melio’s main database is MySQL, there is a hard limit of maximum writes per second.
The database connection limit is shared across teams: A team using many connections could result in the starvation of another team’s process.
Scaling headcount: It’s hard to scale headcount when everyone is “stepping on each other’s toes”. Migration affects an unknown number of developers and services.
Development environments: Because all services are communicating through the same database, there isn’t an easy way to test a single service locally without mounting all other services. This results in slower development given every developer has to be aware of all local mounted services implementation details, most of them out of their own domain.

The first team that decided to move out of the main database was my team — the payments processing team. To understand the implemented changes, you first need to understand how payments are processed.

How are payments processed?

Payments are created directly in the main database by our front-end clients aggregate service with a “future” processing scheduled date.

When the payment schedule date arrives, the processor fetches those payments by querying for “scheduled” payments by date. The processing time ranges from 1–3 days, during which the payments processor writes “transactions” to indicate the different financial processing steps (authorized, cleared, etc…)

Each payment has two transactions with two separate life cycles. The first transaction debits the payor into Melio’s bank account, and the second transaction credits the payee from Melio’s bank account.

Payors using Melio can pay however they like (via ACH, credit card, etc.), and payees can receive the funds however they choose (via check, ACH, wire, single-use card, etc.).

The important thing to understand is we had two entities; payment, and payment transactions that needed to be migrated to a new database with a new entity structure.

Migration requirements:

Decouple the payment processor from the old database: The payment processor should only read and write to the new database.
Write and process payments using a new entity structure that better suits business needs: More about the new structure in this previous blog post.
Front-end client data must be accessible through the old database: We had to sync the old database with the new one.
Zero downtime: Our end-users and front-end clients cannot be affected by the database migration.
Gradual rollout: Start with handpicked payments, then payment methods, and continue.
Always have the ability to roll back.

Let’s go!

The first thing was adding a pre-processing step of enqueuing payments for a separate process that will write to the “new” database.

Translating old to new entities and writing them to the new database

Our front-end clients unfortunately still needed the old database updated with payment and transaction history in order to present to Melio’s end users.

We essentially needed two databases synced with the same data. Simply adding another writer to the new database could cause data inconsistencies in cases of write failures.

oldDatabase.updatePayment(payment.id, paymentUpdate);
newDatabase.updatePayment(payment.id, paymentUpdate); // this throwing would cause inconsistencies

It would be ideal to do something like:

oldDatabase.createTransaction(transaction => {
  newDatabase.updatePayment(payment.id, paymentUpdate, transaction);
});

But in reality, it is not possible. There is no native way of creating a database transaction between two different database instances.

A common pattern for cross-database transactions is a “saga”. We preferred to keep our flow simple, without possible rollbacks. We thus decided to treat our new database as the “source of truth” and the old database as simply a “cache”.

We decided to treat our new database as the “source of truth” and the old database as simply a “cache”.

The solution was to develop a “sync” service that runs on each payment update, reads from the “new” database, and translates it to the “old” database:

Syncing the new database to the old database for front-end clients to consume

Sync service is triggered from payment status change events’. It does not read events content; instead, it reads the current state of payment and transactions from the “new” database and translates them to the “old” database. If the sync would write to the old database using events content it would force the order of processing events to be the same order as emitted, resulting in event processing failures blocking the next events from being processed, something we wanted to avoid.

Instead of continuing to update the old database, we could have shifted front-end clients to APIs exposing payments and their transactions. However, that requires changes from the clients’ side, both functional and product-wise. Doing that would also affect clients’ request latency (direct database query vs HTTP request) so we preferred to deal with those issues separately.

Introducing this change on a large volume of payments at once is risky, which is why we added the ability to roll it out gradually, slowly moving to write all payments on the new database.

Our mission was to first process specific payments by IDs using the new database, then move on to process specific payment methods, and then release percentages of the full volume.

First selecting specific payments to be processed in the “new” way

Then gradually opening percentage rollout for specific payment methods

While rolling out it was important to make sure that the new processing logic was correct and that our new infrastructure and database could handle the scale.

Launch-darkly is our tool for gradually releasing new features. It has some handy functionalities. For example, decisions can be based on multiple parameters (in our case, payment method type and percentage of payments). It also guarantees that once a decision is made, the result is consistent across all places in the system evaluating the same feature toggle.

Feature toggle enables us to roll out the new database gradually

After opening 100% of payments to be processed using the new database and entity structure, we started migrating historical payments and transitions from the old database to the new one.

Clients are being migrated to use dedicated APIs to stop reading from the old database.

Eventually, the old tables would be dropped.

Taking a look back, here are some things I’ve learned and would do differently:

Treating one database as a cache and the other as a “source of truth” is a valid decision to make, but in that case, “cache” database should be overridden from the source of truth, instead of only appending new data. Trying to update a “cache” that cannot be overridden is hard, especially when the cache was updated with invalid data.
Payment update events were published separately from the actual database writes. This caused data inconsistencies for consumers like Sync service. I will elaborate on how we are tackling this issue and event ordering using CDC, Debezium, and outbox patterns in another blog post.
While releasing changes that affect many areas in the code, some feature toggles will affect others. There is a built-in feature at Launch-darkly to define dependencies between feature toggles. This is something we should have used earlier but did not know existed.
When defining a new entity structure on a new database, it should be as close to perfection as possible, later migrations are hard. Do not postpone tough entity model decisions.

Thanks for stopping by, hope you enjoyed the read

Feel free to comment below, and don’t forget to push the “clap” button.