How FINN.no moved 800 microservices to the cloud in 5 hours
During the night of September 16, we migrated FINN’s production environment from an on-premise data center to Google Cloud Platform (GCP). This meant moving a high-traffic website backed by a complex distributed system consisting of 800+ applications, together with 145 databases and 16TB of data. We had a planned downtime window during the night, but the smaller we could make that window, the better. How did we do it? Read on!
Internal conversations about moving FINN.no out of our data centers and into the cloud started years ago. We’ve been experimenting with various cloud technologies and vendors since. When we chose Kubernetes as our platform in 2016, we were guided by the idea of running FINN in the cloud.
While we’ve had a “cloud ready” mindset for a long time, we haven’t had a strategy or plan for actually making it happen. Parts of our product have been in use since around 1998, and moving them is a daunting task. However, a crumbling data center, the need for more flexible solutions, and our successful migration of Sybase from Solaris to Linux last year gave us lots of incentive to consider it seriously.
We started evaluating different cloud providers in January 2019. The short list was AWS, Google Cloud, IBM Cloud and Azure. We participated in workshops, meetings and calls, evaluated options for managing our services either by ourselves or “as a service” through other vendors, before finally settling on a “polycloud” recommendation with GCP as the preferred option for most of our services. This recommendation was finally approved by Schibsted in the middle of August 2019.
It was getting real. We had to come up with a plan for moving everything we have to GCP, while keeping FINN up and running at the same time. We decided on a gradual migration where developers should be able to move services over time. However, time flies — and we realized that we had less and less of it available. The deteriorating infrastructure, planned network refurbishment in our existing data center, and lack of resources made it difficult to see our plan of a gradual migration through. The global pandemic didn’t make the job easier either. We were forced to make some hard decisions.
In June 2020 we understood that we needed to go for a much more direct approach and set a date for the cutover to GCP. We set a target date for September 15, and got approval for a night of downtime for FINN.no from FINN’s management group. In July, cloud migration was set as priority number one in FINN; meaning all teams had to finish any cloud migration related work they were responsible for before they could work on other planned tasks. It was time to go to work (from home).
When we made the decision to abandon the gradual migration and go for the cutover, the daunting task ahead of us started to materialize. We had to prepare a platform that would allow us to move 800+ applications, 145 databases, more than 16TB of data — and 183 virtual machines practically overnight. FINN’s infrastructure team had been preparing for the cloud migration for a long time, but this decision made us refocus. We now had to prioritize ruthlessly, taking time for technical deep dives when necessary but always keeping the goal in sight. In some cases this meant pivoting and abandoning solutions we had invested a lot of time in.
From the moment the summer was over, we worked hard to make this a success. We had to make some tough choices on what we had time to do, and what had to wait. But we tried not to cut too many corners, and stood by our principles, such as Infrastructure as Code. As the days to migration day flew by, and the workload increased, somehow our confidence grew with it.
Changes to FINN.no are normally rolled out around 350 times a day. In the last 24 hours before the cutover, we decided to recommend a “release slush”, meaning that changes that weren’t either fixing live production issues or had to do with the cloud migration should wait until the next day. The day before the cutover, the rate of changes was reduced to about half, and the last production deploy to our on-premise infrastructure happened only minutes before the cloud cutover process started.
The cloud cutover
At 23:00 on September 15, the infrastructure team gathered virtually to get ready and go through the pre-cutover checklist. Since everyone was in different physical locations, we relied on a detailed runbook and video conferencing for collaboration. Changes to FINN.no are deployed without downtime, and the site being down is normally a severity 1 incident. This wasn’t a normal Tuesday night though. At midnight we took FINN.no down by redirecting users to a static fallback page. Half an hour later we were shutting down all applications in our on-premise Kubernetes cluster.
Then we were ready to migrate data. One of the cornerstones in our microservice architecture is Kafka. Almost two billion messages per day — averaging 30,000 per second — go through our Kafka cluster, and its stability is crucial for FINN to be functioning. Migrating our Kafka cluster was handled by our Kafka group temporarily running the cluster in a “stretched cluster” configuration, spanning our on-premise data center and the cloud as a single cluster. The stretched cluster configuration was planned and implemented carefully weeks in advance. Kafka topics were replicated to brokers in GCP in the week before cutover, and the brokers in GCP became primaries during cutover.
Services that need persistent storage generally use either one of our 25 PostgreSQL clusters or our Sybase cluster. We migrated these database clusters by setting up replicas of our databases in GCP beforehand, and switching the database primaries after all applications were stopped during the cutover. At 01:35 on the cutover night, Kafka and all our PostgreSQL and Sybase databases were running in GCP.
After moving the persistent data, we triggered deployments of all applications to our new Google Container Engine (GKE) cluster. All 800 applications — more than 1500 Kubernetes pods — were deployed to GKE by 02:30. At this point we had gone through the night so far with only minor issues and delays, and were ready for internal testing.
Leading up to the cutover night all domains had done excellent job in preparing shift and test plans, and when the time for testing came in the middle of the night, it was a great relief to us in the infrastructure team to see that green lights came flying in. The self-organized testing of all the different parts of the platform worked even better than we could have imagined!
After good teamwork involving all teams, fixing some deployments, a corrupted database table and some other small issues, FINN in the cloud went live at 04:43 with no major incidents.
We are extremely proud to be able to celebrate the successful cloud cutover!
The success of this migration would not be possible without the outstanding group of people in FINN Technology. We’ve had buy-in from all parts of the organization, and when the call to action came, everybody stepped up and did their part. During the preparation phase, developer teams have been banging their heads against firewalls, network routes and load balancers, discovering issues in our new GCP infrastructure, working with the infrastructure team to resolve these issues. We also did a cutover exercise for our development environment a few weeks before the cutover night, this time during work hours. This “dress rehearsal” gave us confidence that we could execute the cutover in the designated downtime window, helped us discover and rectify issues with our tooling, and was a great help in refining the runbook for the production cutover. Both of these things helped a lot in getting the risk down to an acceptable level.
Since we had a hard deadline for this migration, many systems had to be moved in a lift-and-shift approach. When we settle into our cloud environment a bit we are looking forward to making also these parts of our infrastructure more cloud native.