Invisibly Migrating Billions of Database Records

Ryan Baker
Singularity
Published in
5 min readMay 10, 2023

--

How we upgraded a database with well over a billion records without downtime.

We handle lots of requests. We handle even more data that we keep up to date. On top of all that, we also had a very old database. We recently upgraded our primary database. We had multiple things to upgrade with our database. We wanted to make sure we were able to upgrade our database with no negative customer impact. In this article, I will go over the methods we used at Singularity Energy to achieve this and the lessons we learned along the way.

Men and women gathered at a table staring at a laptop while laughing and cheering. The corporate joy is palpable. One woman is wearing a Microsoft Vista themed sweater.
Live shot of the Singularity team happily watching our database upgrade.

Reasons to upgrade

We had a lot of reasons to upgrade our primary database. For us, though, it came down to 2 reasons:

  1. It was over 3 years old and was running a version of PostgreSQL that was not going to be supported soon.
  2. It was originally the MVP database and was missing security and redundancy parameters.

Since we use AWS’ Relational Database Service (RDS), we get a lot of upgrading support for free. Out of the box, they offer minor version upgrades with configurable maintenance and backup windows. We were upgrading from PostgreSQL 10 to 14 (the latest version offered by RDS at the time of the upgrade). Based on the AWS documentation, we needed to upgrade the minor version to be able to upgrade to the latest 14.x version. Luckily, it’s simple to upgrade the minor version without downtime in AWS. Once we did that, we were ready to upgrade the major version.

Steps to upgrade

In order to make sure there was be no data loss, the upgrade would consist of 4 parts:

  1. Dual write
  2. Create a secondary DB from the snapshot
  3. Upgrade secondary DB
  4. Cutover

Steps #2 and #3 meant that the secondary database (the one we upgraded) was not reachable. Therefore, it is important that #1 is implemented in a way that can wait until the upgrade is complete before writing to the new database.

Additionally, we wanted to make sure there was minimal code change for step #4 in order to minimize the chance of failure from a coding error.

Dual writing

When we cut over to using the new database, we wanted to make sure that it had all the data from the old database. For this reason, we took a snapshot of the old database and created a new one from the snapshot.

However, what happens in the time between taking the snapshot and creating a new database from the snapshot? Our system processes millions of events per day, and we can’t just stop it because we want to upgrade our database.

The solution here is to “dual write” the data to both databases. There is a catch, though. Since we wanted to dual write before the new database existed, we needed to capture what we would write to the database if it existed. For our architecture, it seemed fitting to create a queue that would hold copies of the data that was written to the old database. We then needed to write that data to the new database in order to catch up. Implementing this architecture ended up looking like this:

When the API wrote to the database, it also needed to send to the queue. The code for that looked like this:

from contextlib import contextmanager
import json

import boto3


@contextmanager
def get_sqs_client():
sqs_client = boto3.client('sqs')
try:
yield sqs_client
finally:
sqs_client.close()


def get_queue_url(sqs_client, queue_name):
queue_url = sqs_client.get_queue_url(QueueName=queue_name)['QueueUrl']
return queue_url


def send_dual_write_msg(operation: str='insert', table: str, data: dict):
with get_sqs_client() as sqs:
sqs.send_message(
QueueUrl=get_queue_url(sqs_client=sqs, queue_name='dual-write-db'),
MessageBody=json.dumps({
'operation': operation, # insert, update, or delete
'table': table,
'data': data,
})
)

Then using that code in a model class of ours looked like this:


class CRUDMixin(object):
"""Mixin that adds convenience methods for CRUD operations.

all of our SQLAlchemy model classes inherit from this."""

def _add_to_session_and_commit(self):
with get_session() as session:
session.add(self)
session.commit()
session.refresh(self)
return self

@classmethod
def create(cls, **kwargs):
"""Create a new record and save it the database."""
instance = cls(**kwargs)
# all of our SQLAlchemy classes have __tablename__
# class variable declared
send_dual_write_msg('insert', cls.__tablename__, kwargs)
instance._add_to_session_and_commit()
return instance

...

Restore from snapshot & upgrade the database

In order to copy the majority of the data, we created a snapshot of our database in AWS. After creating a snapshot, AWS made it pretty simple to encrypt the snapshot (in the case that the database was not encrypted), and then create a database from that snapshot.

Luckily for us, we didn’t use any deprecated or removed features between the major versions of PostgreSQL. So we simply followed this AWS guide on upgrading the major version once the database was live.

Cutover

This was the most delicate step in the process. Before we cut the traffic over to the new database, we had to make sure that there was no interruption or data loss. We carefully combed through the database records to ensure the data was the same between the two databases.

The next step was making sure the cutover was invisible. To do that, we simply changed the config variable that represented the database URL. We started new tasks that used the new database and slowly rotated them into use and rotated the old ones out.

We now had an architecture that looked like this.

For the next few days we switched the writer worker to write to the old database, in the event that we needed to switch back. Luckily, nothing bad happened and we were able to use the new DB without issue!

Conclusion

While nerve-wracking, the upgrade was able to be carried out without a hitch! This blog post breaks it down somewhat simply, but I can’t advocate enough for having a documented plan. Share that plan with the team, and ask one more team member to be on duty to help troubleshoot if something bad happens.

Interested in problems like these? Thoughts on how we could do it better? Reach out! We’d love to hear from you 🌱

--

--