A guide to migrating your app to different cloud providers
Switching the underlying infrastructure your application runs on, whether you’re moving from bare metal to the cloud or migrating between cloud providers, can seem like a daunting task. Especially if it’s a legacy app that is not containerized and has had time to sink its roots into its current environment. Most of the steps are the same if you want to run your application on multiple cloud providers as well. Having been a part of 2 such migrations (most recently from AWS to Heroku), here’s how to make sure it goes smoothly:
We’re going to assume a standard Ruby on Rails web app, though this can be easily extended to other languages and frameworks. We’re also going to assume only the application is being moved and not the attached resources like databases and caches. That’s worth a separate post by itself. This guide will still be applicable when it’s time to move the application.
Before you start
Map out your existing process types (web workers/background jobs/cron or scheduled jobs) to your destination provider and make sure everything is supported. Typically, the biggest mismatch is going to be your cron jobs. If you’re already using a distributed scheduler, you’re home free. If not, the easiest approach is going to be to run your scheduled jobs via your background job framework.
If you’re using the excellent Sidekiq library, consider something like the Enterprise scheduler or sidekiq-scheduler. This is easier said than done though, you have to make sure the jobs are idempotent and don’t run too long.
Determine how you’re going to move your web traffic. If your application is behind a reverse proxy (OpenResty is fantastic for this), that would be the best location to make this change. If not, you’ll have to make do with DNS load balancing. So make your your DNS provider supports it. You might also want to lower your DNS TTLs to a minute or so in preparation of the move.
Create a Docker image
We’re going to start by containerizing the application. Your application may already be containerized, in which case you can just skip this step. Even if your chosen provider doesn’t support docker (e.g. Heroku Private Spaces doesn’t support docker yet) it’s extremely useful to gain an understanding of your application dependencies and its implicit assumptions about the environment.
For a standard Ruby app, the official Ruby docker image would be the best place to start. A minimal docker file would look something like this:
This may not actually work, because the application will likely have lower level dependencies that are not present in the image.
Determine your application’s dependencies
We’ll start with the lowest level and work our way up.
System level dependencies
Things like glibc version, OS version, native library dependencies for your gems etc. For example, if you are using the mysql2 gem, you’ll likely need the mysql_client or libmysqlclient-dev. To support this, we can add something like this to the Dockerfile right below the
This is also a good place to figure out if there are any dependencies on system level environment variables and set them up. For example, if your app depends on a particular time zone, you might want to add this:
If you have any local file dependencies that your system depends on and they’re not checked into source control, bake them into the docker images. Perhaps you depend on the Maxmind Geoip Database:
You might want to host such files yourself in your own S3 bucket so the image build doesn’t depend on 3rd parties being available. If you have a lot of such dependencies, version them separately like the Aptfile so your Dockerfile is DRY. Make sure to do this near the beginning of the file, remember that Docker caches the images for each instruction, so you want to keep the less frequently changing parts of your application near the top.
If your application is writing to any local files, it’s best to avoid that if possible. The most common case is going to be log files. Ideally, configure the app to write to
stdout and have your log aggregator/archiver/cloud provider pick up the logs from there.
If you’re dealing with uploaded files, or similar, write them directly to a blob store like S3 from the app, or even better have the clients upload it directly so the web server doesn’t have to deal with slow clients.
This would be things like MySQL, Memcache, Redis, S3 etc. You likely have most of these set up locally for your development environment. For this case, you can either configure docker to communicate with the host. My preferred approach is to use docker-compose.
At this point, assuming the application is a standard Rails app, the configuration is going to be gated based on Rails.env checks. But the docker image should be treated as a separate instance of the development environment. Trying to run the app on docker now will not work as it will be trying to connect to development resources that are not accessible within the image.
The idiomatic way to handle this is to move all such configuration to environment variables and not rely on Rails.env checks. The dotenv gem makes this pretty straightforward. Actual environment variables take precedence over the
.env* files. Now, you're ready to wire everything up with docker compose. A typical docker-compose.yml file could look like:
Other environment specific checks
We’re going to use environment variables to help the app distinguish between the old and the new provider.
ag the code for all
Rails.env.production? checks and convert them to environment variables if the behavior is going to be different between the old and the new provider.
Are there any 3rd party dependencies on your application? For example, are there any IP whitelists you need to update with 3rd parties? Are you planning on updating your SSL certificates? If so, are there any 3rd parties who rely on certificate pinning that might need to be informed?
The final docker file would look like:
We’ve essentially converted our app to a 12-factor app.
Finally, make sure all your tests pass on the docker/docker compose environment and manually test it to make sure there are no other hiccups.
Now that the application is configured, let’s get the destination environment setup. The first step is to make sure the old and the new environments can talk to each other.
Ideally, your services and resources are already in a VPC. In which case, you just need to create a VPC in your destination provider and set up peering between them. VPC peering allows two VPC to securely communicate with each other. Google, AWS and Heroku all support this.
If you’re not in a VPC, consider creating one in your new environment. Make sure to whitelist the new Egress IPs in your existing security rules. A downside to this is that the VPC peering link is going to become a single point of failure for your application. So, strongly consider keeping this temporary and moving your resources into the same VPC as the application. A possible exception is if you’re planning to run the app on both providers at the same time, in which case it might be an acceptable risk depending on how long it takes to fully fail over to the VPC containing the resources.
Set up the environment
Time to configure your environment. Treat it as a brand new environment in your monitoring systems. We’re big fans of Datadog, and we configured the new providers as separate environment and set up some dashboards with key metrics like request latency, request throughput by status code, background job latencies, and job throughputs broken down by environment.
Similarly, configure it a separate environment in your error reporting tool (we use Rollbar). These two steps are very important to make sure any problems during the migration can be isolated and tracked down quickly. Set up the environment variables and deploy manually to make sure everything works as expected.
While this is a huge change under the hood, you want to minimize the impact to the other developers on the team. This means maintaining your existing workflow as much as possible. Configure your existing deploy scripts/CI/CD integrations to deploy to both environments at the same time. Treat them exactly the same. Even though the new environment is not receiving traffic yet.
Start with the low risk background processes
Now the fun part. Let’s begin to serve production traffic from the new environment. Ideally, start with a low-risk and idempotent background job. In our case, we started running additional Sidekiq workers for a few low-risk queues first and let them run for a day or two to catch any issues. We let them run on both environments for a while and gradually scaled up the new env while scaling down the other.
Once you have this working, repeat for the rest of the queues until all your jobs have been moved over.
The first step is make sure your SSL certs are configured correctly. Configure the load balancers in the new provider and make sure they can serve HTTPS traffic.
Next, make sure your DNS provider supports DNS load balancing. This is going to be a lot more stress free if you’re not moving all traffic at once. Most major providers like Route 53 and Cloudflare support this. This is not without downsides and risks though — some DNS clients fully don’t respect the TTLs, so you’re going to have a small percentage of clients that go to the wrong host. It is also not immediate (except if you’re behind a proxy like Cloudflare), This could be especially problematic if you need to rollback the DNS update due to issues. But we should be able to mitigate any potential impact by starting with a small enough traffic slice. Set up a new DNS name, spin up your web processes and make sure SSL works (e.g. https://heroku.tophatter.com/health).
Now, time to move the traffic. Do you have any low-risk subdomains you can move over first? If so, start with those. We started by sending 1% of ruby.tophatter.com and gradually ramped it up to 100% after a few days.
Keep a close eye on your error monitoring and dashboards, and make sure to give it a few days for issues to arise. If everything looks good, the main domain. Follow the same procedure.
Pay special attention to webhooks if you have them. Make sure your total request rates and status code rates are in line in the traffic distribution.
Some issues we ran into
Our CDN was configured to fetch assets from the app on a cache miss. However, since our new provider does rolling deploys over a few minutes, there’s a period of time where both versions of the app are served. This means that it’s possible for a client to hit a new version of the app which will serve URLs to an asset that doesn’t exist on the old version. This is fine, unless the CDN happens to hit the old version of the app and return 404s.
Our DNS loadbalancing was configured like this:
DNS Type Target Weight ruby.tophatter.com CNAME heroku.tophatter.com 1 ruby.tophatter.com CNAME ec2.tophatter.com 99
This works for *.tophatter.com, but you cannot CNAME the root/naked domain
tophatter.com. Some DNS providers like Cloudflare and DNSimple provider their own custom record types (e.g. ALIAS) that achieve the same thing, but Route53 does not. The downside is providers like DNSimple don't support load balancing.
In our case, we were able to validate the new environment by sending sufficient traffic though the subdomains (since they are essentially the same applications underneath) that we just switched our DNS provider to DNSimple when it was time to go to 100%.
A relatively obscure issue we ran into was that HTTPS requests from really old Android devices (Android 5.0 Lollipop and below) were failing with this error:
javax.net.ssl.SSLHandshakeException: java.security.cert.CertPathValidatorException: Trust anchor for certification path not found.
We suspected an issue with our SSL certificates or SNI, but could not figure out why. Various online tools were able to successfully validate our certificate. Running this also worked:
openssl s_client -servername tophatter.com -connect tophatter.com:443
We hooked into the TrustManager and were able to determine that an intermediate cert was failing to validate. Turns out, we had uploaded a version of our SSL cert without the intermediates and the Android 5.0 devices were falling back to a local expired version of the intermediate cert and failing to validate.
Tophatter Inc was founded by Ashvin Kumar (CEO) and Chris Estreich (CTO), and launched in January 2012. The company has raised $35M to date from Goodwater Capital, CRV, and August Capital. The company has 75 employees globally, and is actively hiring at its offices in Silicon Valley and Shanghai. For more information about Tophatter, please visit: http://www.tophatter.com/about.