Porting a legacy application to the cloud

Vignesh Venkataraman
Curai Health Tech
Published in
10 min readJul 23, 2019

At Curai, our mission is to scale the world’s best healthcare for every human being. While we’ve stayed under the radar thus far as we try to build a product that achieves that goal, one high-profile consequence of our quest was the acquisition of First Opinion, a chat-based platform that allows users to chat with health coaches for free. This deal was finalized in the early summer months of 2018. Shortly thereafter, we had a decision to make: should we try to roll the First Opinion users over to our nascent platform, or commit to overhauling and developing features on the existing First Opinion stack? We chose the latter option — which meant taking control of and modernizing a large legacy codebase. This was a significant undertaking for a young startup, at the time just a year old and with a single-digit-sized engineering team.

If you’ve gotten this far, why should you read further? Because, if you work for a tech company, chances are you’re thinking about porting or migrating something — and if you’re not, you soon will be :) These kinds of projects are difficult — but, if you get things right, they act as significant accelerants to your engineering team’s productivity. So, here’s our story — hopefully the details help you too!

The Status Quo

The legacy First Opinion application stack consisted of the following components:

  • A public-facing website, hosted on WordPress Engine
  • An iOS application, written in Objective-C and packaged through the iOS App Store
  • A customer-facing web application, written in Backbone with many first-party extensions
  • A health-coach-facing web application, written in Backbone with first-party extensions
  • Some small web applications for things like terms of service, privacy policy, surveys, etc.
  • An API server, responsible for HTTP request/response routes
  • A chat server, responsible for handling WebSocket connections and events
  • A PostgreSQL database as the primary persistent datastore
  • A pgbouncer connection pooler for the Postgres DB
  • A Redis cache as the secondary persistent datastore
  • A slew of cron jobs for things like sending followup messages, sending unread emails, etc.
  • A slew of daemons for message postprocessing and asynchronous callbacks

Other than the public-facing website and the iOS application, the rest of the components lived on a German hosting service. The hosting service provided for machine-level abstraction — meaning all of the provisioning and fabrication of the boxes would have to be handled in-house. The legacy application stack accomplished this through Chef, Puppet, and fab. The difficulty in procuring and fully provisioning new VMs meant that many services in the application stack were single points of failure.

The external hosting service was good for the use case of just provisioning machines — but not so good given the complexity of the First Opinion stack. In particular, the service provided little to no other “cloud” functionality — things like cloud storage, secure VM-to-VM networking, and higher level abstractions like container services or even app engines were not part of the hosting service’s portfolio. The lack of internal VM-to-VM communication meant that the stack needed to handle encryption between services, a complication that meant juggling a variety of spiped symmetric encryption keys. We also had a number of issues in the early months following the acquisition related to a faulty network card in our data center, which resulted in more downtime than we were willing to bear. In order to alleviate some of the above issues, consolidate our cloud providers, and reduce the burden of context switching between how our Curai services operated and how the First Opinion services operated, we decided to embark on the journey of a lifetime: porting the entire application stack to Google Cloud.

Project Goals and Requirements

Goal setting is vital for projects of this scale, and so, before we even started typing into our terminals, we put together the following requirements:

  • Move the First Opinion stack fully out of the external hosting solution and into Google Cloud
  • Minimize service downtime

The only explicit non-goal we had was to avoid porting services already in the cloud from one provider to another. For example, the First Opinion stack already used Amazon S3 and SQS for storage and asynchronous message delivery), and porting these to their Google Cloud equivalents (Cloud Storage and PubSub, respectively), while convenient from a consolidation perspective, would add complexity to an already sprawling task.

Taking Inventory

The first step in this process was getting the stack to run outside the provisioned German cloud environments. This proved to be mostly straightforward, as the original developers left behind local provisioning scripts for Vagrant that could be reverse engineered.

To avoid any conflicts, and to accurately identify exactly what each application component required as dependencies, we tried to get the entire stack running locally on our own development machines. The rationale here was that once we had things running locally, we could make a decision on what came next. This led to our first major architectural decision: how we’d ship our code to production.

Enter Docker

(source: https://www.docker.com/resources/what-container)

The problem was fairly textbook at this point: how do you bundle up dependencies reliably, ship simple artifacts, and tie the different artifacts together? One option is to do exactly what the legacy stack did: provision virtual machines and put code onto them to run. But even within the context of First Opinion, this proved fraught — all dependencies had to be installed in the global scope (e.g. `sudo apt-get install python2.7`), and in rare instances, the necessary dependencies would clash. The “modern” approach for this is to containerize! And if you are thinking about containers, the biggest dog in the fight is Docker. Docker, for the uninitiated, is a way to “securely build, share and run any application, anywhere.” It accomplishes this by way of building containers — packaged units of dependencies and (potentially) code that can be run nearly anywhere. Docker containers are a higher unit of abstraction than virtual machines; in fact, a single virtual machine can run a whole bunch of Docker containers, each of which can be based on a different operating system with different dependencies and different code to run.

Given the sheer variety of the First Opinion stack’s components, Docker seemed to make a lot of sense — and indeed, the Dockerization step was fairly straightforward, with usually a one-to-one correlation between the Vagrant provisioner and the corresponding Docker directives. Where things got dicey was pulling in code from private repositories. For example, a privately-hosted JavaScript library on GitHub requires the presence of valid GitHub credentials within the Docker container at the time of build, which requires injecting those GitHub credentials (e.g. an SSH key) into your build environment. While initially daunting, this problem was surmounted with a little help from Google Cloud KMS and Docker build arguments. Now, we had a bunch of containers, just waiting to be run. They just needed someone (or something) to tell them what to do.

Enter Kubernetes

(source: https://kubernetes.io/)

There are a number of container orchestration platforms out there — Apache Mesos and Docker Swarm being two of the canonical examples. But the really popular buzzword is Kubernetes, a former Google project that has grown into its own beast, backed by a large open source software community. Given that our existing infrastructure all lived on Google Cloud, to use Google’s managed Kubernetes Engine (GKE) was a no-brainer. Kubernetes, in its most basic sense, is a way to tell a complex system of containers how to run themselves. It allows for services to scale up horizontally, for containers to restart on failure, and for services to discover each other cleanly, among a number of other key features. Kubernetes has abstractions like deployments, services, ingress, jobs, and cronjobs that allow arbitrary workloads to be run on a cluster of nodes. Each abstraction is configured through a YAML manifest which declares things like what container to run, what command to run in the container, what environment variables to inject, what volumes to mount, etc. In short order, we were able to assemble working staging and production clusters that had all of the necessary components of the stack running on it. Huzzah!

Flipping the Switch

Unfortunately, we weren’t quite done. As stated above, one of our goals in this process was to try to migrate over to Google Cloud with as little downtime as possible. This meant that we’d have to find a way to roll over a very large Postgres database and a pretty sizeable Redis cache in a short time window. Alas — at the time that we were doing the migration (and at the time that this article was written), none of the Google Cloud-managed options for Postgres or Redis supported replication. We were thus forced to roll our own within the cluster, a painstaking process of trial and error that involved a lot of #hacks. One particularly amusing hack was setting a container’s command to `sleep 3600s` after a cryptic failure to try to debug where exactly a quickly-erroring replication command was going wrong. Side note: this was by far the biggest argument against running stateful services, particular databases, within the Kubernetes cluster. However, given our requirement that we minimize downtime, this was the only mechanism available to us.

After some effort, we managed to get read replicas of both Postgres and Redis running in our cluster. Now, a new issue presented itself: somehow, one of our tables was not able to perform string comparisons correctly. After a lot of hair-pulling, we tracked this down to a single “hash” index (as opposed to a standard b-tree index) in the original Postgres table, which doesn’t work across replication boundaries. We rectified this quickly by reindexing the offender. Finally, after a few practice runs, we declared our plan of attack to be ready for primetime — and, one fateful morning, we flipped the switch!

The result: three GKE clusters, each with a fully provisioned First Opinion stack.

The Aftermath

There’s a cliche about the best laid plans always going awry, and this was no exception. One issue that never reared its head in practice runs, but bit us in the actual migration, was a snafu with Redis replication, wherein about 6 hours of data was dropped. We never dug too deeply into how this occurred, as we managed to work around the problem over the course of a 24 hour mitigation period. It also took a wee bit longer than expected for our Cloud DNS settings to propagate over to our health coach base; relatedly, it took some time before the `cert-manager` Kubernetes controller was able to issue new certificates, which also slowed down our eventual “all clear” signal. But, other than that, things went off without a hitch! Within about 25 minutes of going offline for the migration, we were back online and able to cordon off our legacy servers from the internet at large. We deleted them for good a week or two later.

Would We Do it Again?

(source: https://media.giphy.com/media/ckeHl52mNtoq87veET/giphy.gif)

In a word, absolutely! While there were some missteps along the way, the decision to port the First Opinion stack to Google Cloud has been a fairly resounding success. We are now able to better isolate services from each other, scale them horizontally as necessary, and more tightly control how they communicate and fit together. Our engineering velocity has undoubtedly increased, in no small part due to a better shared understanding of how the entire stack works and fits together. Also, with the caveat that hindsight is 20/20, less than a week after we completed the port, our old hosting provider suffered a major outage that lasted nearly 12 hours. Collectively, the engineering team breathed a sigh of relief, as we completely dodged that bullet. It is worth noting that even now, our infrastructure remains almost entirely portable. While we do use a few GCP and AWS-specific services, if we were to decide to switch providers entirely, we would be able to do so without too much disruption.

Next Steps

In order to further take control of the First Opinion ecosystem, we proceeded forward with rewrites of the customer, health coach, and iOS clients using React and React Native, which were completed a few months later (see this article for more details). We’ve also begun the process of migrating the entire backend off first-party frameworks to established open source solutions; this comes with the benefit of Python 3 compatibility, which is a requirement before Python 2 reaches end-of-life in early 2020. We also moved our database out of the Kube cluster (for our own sanity) and into Google managed Cloud SQL earlier this year, and have similar plans to do so for our Redis cache. Finally, we are migrating to Helm to better manage our configuration and deployments.

We’re nowhere close to done! There are ambitious plans afoot. While we continue to improve our existing infrastructure, we’re also investing heavily in the product itself. We’re not just improving the existing First Opinion experience; rather, we’re building a totally new one (using the existing infrastructure as a starting point) that integrates our in-house machine learning solutions and improves how users get access to care. If you’re at all interested in anything that was discussed above, or want to find out more about Curai and First Opinion, feel free to reach out to me personally at viggy AT curai.com, or on Twitter @Viggyfresh. Also check out our available job openings here.

--

--