Major Upgrade without Downtime 🚧

How we upgraded our PostgreSQL Database (RDS) with AWS DMS without interruptions in production

Published in

Axel Springer Tech

7 min readNov 25, 2022

An AI generated oil painting of a mechanic working on a large machine up a ladder. Created by https://openai.com/dall-e-2/

In this journey, I will share with you my experiences with AWS Database Migration Service (DMS), what our problem was, how I solved it, and of course the pitfalls that exist.

The Problem …

Let’s start at the beginning. We are developing and maintaining a Content Management System (CMS) for editorial purposes. With our application you can create or update articles, videos, and other types of documents. You can also publish or unpublish these documents.

Database Major Upgrades could be like changing a tire while driving

One of our main customers is the online edition of Germany’s highest-circulation newspaper, which is one of the top ten most visited websites in Germany. It is extremely important for a newsroom to publish news around the clock. Therefore, a major upgrade could not include CMS downtime. Incidentally, this would have been 30 minutes.

In detail, I had to upgrade, four Relational Database Services (RDS) of Amazon Web Service (AWS). For our microservice architecture we use, among other things, four PostgreSQL RDS Databases. These were using version 10.18 and AWS is deprecated version 10 on November 10, 2022. That means that it will not receive any security patches after that date. My approach was to look at which versions are available in AWS for our database, and to test the latest versions with our services. The result was that version 13.4 worked without adjusting the services.

… and the Solution 💡

Click the Upgrade Button

It should also be mentioned that I use terraform/terragrunt as tool for our infrastructure as code. Normally, upgrading an RDS would be done with a few clicks or with changing a variable in the .tfvars file and a terraform apply command.

That was also the way I did it in our non-production environments. While the RDS is being upgraded on AWS, it is unavailable for a period of 10 to 30 minutes. Ideally, an application can still be used to a limited extent even if a database is not available for a short time. In our case, this was the case for two out of four services that use the corresponding four databases. Our application, the CMS, was no longer usable during the upgrade of the other two databases.

Our CMS during upgrading a database. The content could not be loaded and the user could not work. Luckily that’s just staging.

❗️So my tip is here:

“Test what happens to your application during restarts of different databases”

How AWS Data Migration Service Works

AWS DMS is a service that allows you to copy data from one database (source) to another database (target). If necessary, the data is continuously replicated from the source to the destination. There are many different use cases that are solved with this tool. You can find an overview in the documentation here:

https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.html

We used DMS to switch from a PostgreSQL version 10.18 RDS to a PostgreSQL version 13.4 RDS without downtime for our CMS.

We also encrypted the data at the same time. This is also not possible in AWS without downtime. After successfully copying the data, we mirrored all subsequent write operations to the target database. Thus, the data in the source and target database were identical.

Database migration tasks that show complete loading and ongoing replication

The Module as Code

To create the AWS resources, I wrote the following Terraform module. You can copy and adapt it for your purposes. It creates all necessary resources in AWS that can be seen on the diagram.

→ https://github.com/jbuettnerbild/dms

It creates everything you need. However, you still have to change two values in the parameter group of the source DB by yourself.

Set the rds.logical_replication parameter in your DB CLUSTER parameter group to 1
Set the wal_sender_timeoutparameter in your DB CLUSTER parameter group to 0

You can find out more about this in the AWS documentation under “To enable logical replication for an RDS PostgreSQL DB instance”

https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.PostgreSQL.html

Example of the parameter group with the needed values for the data migration service

# rds.logical_replication = 1
# wal_sender_timeout = 0# aws parameter group example
--------------------------------------------------------------------resource "aws_db_parameter_group" "default" {
  name        = var.db_parameter_subnet_group_name
  family      = var.db_parameter_group_family
  description = "${var.override_db_parameter_subnet_group_name} parameter group"

  parameter {
    name  = "client_encoding"
    value = "utf8"
  }

  parameter {
    name  = "timezone"
    value = "Europe/Paris"
  }

  dynamic "parameter" {
    for_each = var.enable_dms_replica ? [1] : []
    content {
      name         = "rds.logical_replication"
      value        = "1"
      apply_method = "pending-reboot"
    }
  }

  dynamic "parameter" {
    for_each = var.enable_dms_replica ? [1] : []
    content {
      name  = "wal_sender_timeout"
      value = "0"
    }
  }

}

🚨 A Reboot is Necessary

Unfortunately, these changes are only applied by rebooting the DB. For us, this took about 15 seconds. Since two of our services no longer worked properly due to the reboot, we performed the reboot automatically during the night using the AWS Systems Manager. You can find instructions for this at:

https://aws.amazon.com/de/blogs/database/schedule-amazon-rds-stop-and-start-using-aws-systems-manager/

scheduled reboot in the night with AWS Systems Manager

After the reboot, the migration task can be started (manually). This only took a few minutes for us and then all the data was migrated from source to target. This procedure had to done one by one for all four PostgreSQL databases.

Here you can see a screenshot of the manually started migration task for one of our databases. In the Table statistics you get an overview of the migration process. If the data migration service is done, the Load state will change to Table completed.

Table statistics of one migration task. To migrate all tables of a 30GB DB it needed 8m 45s.

What Must Be Done Afterwards

It is essential that we create the Hybernate Sequence in the target DB and set the value so that it was slightly larger than on the source DB. A larger value is necessary so that there is enough buffer to continue writing to the source DB during the change. Now the application is required to use the new DB. Since we have a rolling update with Kubernetes, the switch was non-disruptive. Finally, the migration task must be stopped and, of course, the AWS resources that are no longer required, such as source DB and all DMS resources, should be cleared again.

The Devil is in the Details 👻

Tricky Stuff with DMS

AWS DMS doesn’t migrate your secondary indexes, sequences, default values, stored procedures, triggers, synonyms, views, and other schema objects that aren’t specifically related to data migration. As already mentioned, I had to create the hibernate sequences for us. Indexes and secondary keys were initially created by our application. Note, however, that creating an index places load on the database and takes time. You should therefore do this before switching to the target DB to enable a smooth transition to the new database. AWS offers another tool, Schema Conversion Tool, to copy indexes and sequences, but I haven’t tried it.

Source DB != Target DB

We had another problem with large objects (LOBs) in a table. These could not be copied because the property in the target DB was not set to “Nullable” but to “Not Nullable” as in the source DB. The solution here was to update manually the target DB by changing the property to the necessary attribute, do the migration, and later change it back to the desired state.

https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Tasks.AssessmentReport1.html

Transaction Logs Explosion

When testing in our non-production environment, I paralyzed a database by rebooting after the parameter group changes (see under Example for parameter group with the needed values for the data migration service) without subsequently activating the migration task. Over the weekend, 40GB of transaction logs were written until the hard drive was full. Therefore, the migration task should always be enabled when the rds.logical_replication parameter value is set to 1 and this was applied by a restart of the db. Here you can find a solution to clean up the log.

https://aws.amazon.com/de/premiumsupport/knowledge-center/diskfull-error-rds-postgresql/

Transaction Logs go above 50 GB and free disk space going to zero over the weekend. That stops the database

Conclusion

Thank you for reading my journey to perform a major upgrade of our PostgreSQL Databases from v10 to v13. With the help of the Data Migration Service from AWS, we were able to upgrade our production databases without downtime. However, that was not without problems. What bothered us the most was that no complete copy of the database was created. In particular, the Hibernate Sequence that every database needs to continue writing where the old one left off last. There are other approaches to improve this, for example by importing a snapshot of the source DB first, or by using another AWS tool: Schema Conversion Tool. Overall, the upgrade took some time. A fully managed service that offers zero downtime upgrades as with MongoDB at Atlas would be desirable.

Connect to Jonas B:

Jonas B on Github