Migrating databases from Heroku to AWS RDS

Published in

Fresha Engineering

7 min readSep 23, 2019

This is the second post in our series about the challenges we encountered during the migration from Heroku to AWS. In this part, we focus on the migration of the most valuable thing we have — our users’ data.

You can read our previous post here:

Gradually switching traffic from Heroku to EKS

It contains the reasoning behind why we decided to migrate to AWS, and how we migrated our applications from Heroku dynos to Kubernetes containers.

Preparing for migration

We use PostgreSQL to store our data, and we have several database servers with the largest size over 500 GB (on migration day). From the very beginning of the preparation process, the whole Fresha Engineering team was very committed to minimizing impact on our clients. We expected downtime to be inevitable in some cases, but this did not deter us from checking alternative scenarios.

Comparing Heroku and RDS instances

Comparing Heroku PostgreSQL instances and AWS RDS ones is a hard task because we do not have access to all parameters set by the providers. Still, we had to find the RDS parameters and hardware configuration that wouldn’t cause a noticeable decrease in website performance.

Generally, when it comes to resources, databases from Heroku can be migrated one-to-one to databases in RDS (remember Heroku is also on AWS), so the differences in performance may be caused by the fact that:

Heroku uses asynchronous replication and RDS uses synchronous, so write times on RDS may be higher
Heroku has different base parameters than the default parameters on AWS (Heroku is more tuned, AWS more “safe default”)

Unfortunately, we did not have the infrastructure to test a new database setup with a copy of the production traffic (mirroring/shadowing) and we didn’t have enough time to prepare it. In addition, the whole process would be extremely complicated due to the fact that we were unable to prepare a faithful copy of the production database for testing without downtime, because Heroku does not allow replication to extend beyond its infrastructure. Ultimately, we decided to compare database performance with synthetic tests using pgbench. In addition, we decided to copy the parameters that were on Heroku (it was limited to the parameters that we could replicate).

We performed two rounds of benchmarking — first we used scenarios provided by pgbench tool itself. It showed no significant performance difference between Heroku and RDS. Unfortunately, these tests did not confirm whether the RDS databases will perform the same with our production workload.

Read performance between various configurations (higher the better).

Write performance between various configurations (higher the better).

In the next iteration, we prepared scenarios with the most popular queries run on production database, and we tested them on RDS instances with Heroku-like parameters and default ones.

Average transactions per seconds for default RDS configuration and Heroku-like (the higher the better).

Average latencies for default RDS configuration and Heroku-like (the lower the better).

The result of the tests showed, that for comparable hardware configurations, the Heroku-like parameters offers 55% faster reads than RDS defaults for our production-like tests. For pg_bench provided scenarios benchmarking showed no difference between Heroku-like and default parameters. From this comes another conclusion — always do performance tests based on your production workloads and not on synthetic benchmarks, or you will get the wrong end of the stick.

When it comes to write speeds, all tests showed no significant impact of parameter changes. The main changing factor was enabling Multi-AZ in RDS configuration. This caused a significant decrease in write speeds, but it is an expected change, as RDS uses synchronous replication which significantly reduces the chance of differences in data between current master and failover instance.

Minimizing impact on clients

Choosing the right time

Traffic levels on most web pages are not constant during the week and it is easy to find the period when the traffic is the lowest. Migrating at this time is the easiest way to reduce disruption to your clients’ work. In our case, clients all over the world are using our system, in a varied timezones. The safest time to migrate was on Sunday morning CET, but still 1/3rd of our usual traffic may have been affected.

Application read-only mode

Some of our services can still provide value to clients if they can only read their data. For such cases, we have prepared a read-only mode. We prepared read-only database user and switched service to be using it during migration. The application code was prepared to handle gracefully any errors so users would still have access to their data. We also ensured there would be no changes to data during migration. This enabled us to make a dump and restore the data on RDS’ live system.

Maintenance mode and communication

If the database could not be put into read-only mode due to the specificity of its usage we had to disable our services altogether. This resulted in the unavailability of critical functionalities for our clients. Therefore, it was extremely important to inform them in advance that maintenance will be carried out and they would not be able to access the system during the downtime period. This gave them the opportunity to prepare their business for downtime. Because we also have mobile applications, in addition to the website, we had to prepare them for the unavailability of the API. It is rather unreasonable to let the clients see connection errors during migration. They should see a message explaining what’s going on.

Fight for reduced downtime!

As downtime was inevitable for some databases we were looking for solutions that would allow us to reduce the time needed for database migration. Fortunately, the mere fact that our services were offline opened up new possibilities — we did not have to worry about how much we load the database doing the dump. For live read-only databases, we used the native Heroku backup tool, which is quite slow but does not burden the database too much. For offline systems we could use pg_dump directly with as many parallel jobs as we like. Such a small change allowed to shorten one of the migration steps from 2 hours to 20 minutes for our largest database (500 GB).

Migration procedure

The migration process itself was quite straightforward (and thanks to prior preparation not stressful). To reduce the communication overhead, the whole process was carried out by a small task force in the company’s office on Sunday morning. We have also decided that migration will be divided into many batches that will be performed on different weekends.

For databases that were switched to read only mode we switched credentials which applications use to those that had limited permissions. Then we backed up the database using Heroku tools and uploaded it to RDS. The next step was to switch the application to the base on AWS.

Databases requiring downtime demanded us to enable maintenance mode in mobile apps (the application periodically queries API for configuration changes) and start displaying an information board on our web domains. Afterwards we were able to shutdown all our services on Kubernetes cluster and start the database migration process which was very similar to situation when all services were live (dump on Heroku and restore on RDS). For the migration of the largest database, we decided to gradually let clients back onto our portal. Based on the previous traffic data, we have prepared IP ranges for which we started to let traffic into the cluster.

Summary

After this step all of our main services are now hosted on AWS. We still have RabbitMQ and Redis on Heroku, but they are not as heavily used as a database and moving them is only a matter of time.

What went well

Because of the time we invested in the preparation process, whole project ran smoothly without any significant delays. Of course there were minor problems, but we were able to mitigate them on an ongoing basis. Separating migration of all databases into smaller batches enabled us to learn from less critical ones and improve the process to be ready for the most critical ones.

What could be better

The migration process was semi-automated and there were errors in the manual configuration — for example, for one database we forgot to set provisioned IOPS (luckily for us the EBS baseline performance was really high as we added 1 TB disk to database instance). Our first migration ended with the applications still connecting to the old database as we’ve had a race condition between configuration change and reload in PgBouncer container (reload was triggered before actual change). We have added a checklist to our migration procedure to ensure that the situation will be quickly detected next time.