​From Takeoff to Landing: Replacing Cloud Infrastructure with Zero Downtime

Gonçalo Costa
Miro Engineering
Published in
8 min readFeb 12, 2024

Last year, we explained how Miro’s Data Stream migrated its analytics event tracking system, EventHub, to a brand new infrastructure. In a nutshell, back in 2023 we moved away from an older setup consisting of raw EC2 instances (responsible for handling ~3.5 billion events per day) to using managed Kafka (Amazon MSK).

After running it in production, we discovered that the cluster configuration itself was incorrect. Fixing it required recreating the cluster and its underlying network, which meant downtime and, consequently, loss of data. Spending another quarter planning and executing such risk mitigation was not an option: we had deadlines to onboard more consumers. So, the sooner we acted the better.

Photo by Mike Kitchen on Unsplash

This post tells the story of how we successfully replaced the infrastructure of our data backbone with zero downtime (and how we could do it all over again).

(Almost) like flying an airplane

In the business of digital products we can often afford to fail, considering that no human lives will be at risk (disregarding obvious exceptions). In some other industries, this is not the case: aviation is one of them.

Personally, I studied to become an airline pilot long before I began my engineering career, so I got to know the basics of how this industry works (I am also really fond of it). Flying is considered one of the safest means of transport, and yet, it involves highly complex machinery which, if operated wrongly, can have disastrous consequences. Because of this, there are a set of activities and principles that help guarantee the safety and success of such operational endeavors.

It got me thinking about how we could guarantee such a level of operational maturity in our work. Why are the same principles not applied across other industries? What if we could achieve that same level of maturity?

The following sections describe how we ran this initiative in the same fashion as operating a flight, applying some of the principles that have existed for decades.

Pre-flight

This is the first stage: pre-flight preparation; an essential step of any flight. This basically consists of planning the route, gathering the charts to help us get there, collecting relevant information, briefing the crew, and more.

Draw charts and write a plan

We already knew the existing infrastructure and its relationship to the cluster we needed to migrate. However, information was spread across multiple repositories, making it difficult to grasp the overall architecture. Moreover, we weren’t sure if other teams had production workloads consuming events from our cluster.

Example of Amsterdam approach chart (source).

To get a clear picture of how everything fits together, we started by creating a comprehensive diagram of our cluster’s ecosystem. This visual representation was invaluable for identifying the critical points and resources involved. This knowledge allowed us to proceed with replicating and testing downstream components before making any major changes.

Summary AWS diagram of the involved ecosystem.

With the chart, we could see what the critical points were and how to navigate the ecosystem. The Eventhub microservice receives Miro events and routes them to Kafka, so this was identified as a critical component. Every downtream resource could then be replicated and tested in isolation, hence we opted to go with that strategy.

However, we didn’t have a plan yet. So, we decided to first replicate the chart’s infrastructure in our staging environment and only then move to production.

Initial steps of the action plan.

We wrote down what needed to be done while implementing it in staging, (ex. creating a new VPC and new MSK cluster, etc.). Details mattered and finding out all of them required collaboration with teams that owned impacted components. These steps and details were then composed into an action plan: all the steps necessary to go safely from A to B.

Write checklists for each stage

As we delved deeper into the migration plan, it also became increasingly lengthy and error-prone. The sheer number of steps made it easy to overlook crucial actions or miss executing a seemingly insignificant command. This oversight could have disastrous consequences at critical points in the migration process.

Example checklist from KLM blog.

So, inspired by the aviation industry’s reliance on checklists, we decided to break down the action plan into manageable and concise stages. This yielded four distinct checklists:

  1. Take-off Checklist: A comprehensive list of pre-migration tasks to ensure a smooth launch.
  2. In-flight Checklist: A vigilant guide to monitoring and validating the replication of the existing stack.
  3. Landing Checklist: A checklist of critical tasks and immediate post-migration checks.
  4. Shutdown Checklist: A meticulously detailed list of cleanup tasks.
Section of the checklist we used for the initiative.

These checklists served as indispensable tools, allowing us to quickly verify if all the steps were completed and if it was safe to proceed. If any doubts arose, we could always refer to the comprehensive action plan for further clarification.

By employing checklists, we transformed a complex migration process into a series of manageable tasks, significantly reducing the risk of errors and ensuring a seamless transition.

Briefing affected teams

Just as an airline pilot doesn’t fly solo, we knew we couldn’t tackle this migration alone. We gathered representatives from directly impacted teams to brief them on our plans and seek their expertise. This collaborative approach allowed us to identify and remove unnecessary steps from the plan while incorporating valuable insights we might have overlooked.

Photo by Marvin Meyer on Unsplash

By tapping into the collective knowledge of the teams involved, we refined our action plan and updated our checklists. With these enhanced tools in hand, we were confident in our ability to execute the migration seamlessly.

In-flight

There is no lack of automation in an airplane as plenty of human tasks can be outsourced to the auto-pilot. However, pilots still have crucial tasks to accomplish and must keep an eye on the complex dashboard ahead of them.

Take-off phase

With the necessary approvals and preparations in place, we embarked on our migration journey. From now on we would be working in production only.

Our first order of business was to handle the preliminary tasks, ensuring they didn’t disrupt the existing infrastructure. This included deploying the required components and infrastructure without causing any downtime.

Photo by Jonathan Letniak on Unsplash

We also coordinated with affected teams regarding our plans and timelines. We chose to perform the migration during off-peak hours on a weekday, when traffic was lighter, to minimize any disruptions.

Once all the preliminary steps were complete, we were ready for takeoff. Such steps included establishing a new networking infrastructure, creating the cluster, IAM roles, applying Kafka configurations and deploying the MSK connect jobs. Our focus now shifted to monitoring.

In-flight checks

In the aviation world, pilots rely on a complex array of gauges and dials to guide them through the skies. Similarly, we crafted a set of Grafana dashboards to provide us with real-time insights into the migration’s progress. These dashboards served as our navigational instruments, ensuring we remained on course throughout the journey.

Photo by Mike Petrucci on Unsplash

We divided our dashboards into two primary categories:

  • Full Dashboard: This comprehensive dashboard encompassed all the metrics and indicators we might need during the migration process. It was a wealth of information, but it could also become overwhelming, especially for quick health checks.
  • High-level Dashboard: Inspired by the concept of the “six pack” in aviation, we created a simplified dashboard that distilled the critical information into a compact format. This dashboard, akin to a pilot’s essential instrumentation, provided a quick overview of the system’s health.
High-level side-by-side dashboard to monitor migration during production migration.

As we navigated the migration, our high-level dashboard served as our primary reference point, allowing us to monitor the system’s vital signs and make informed decisions. The full dashboard remained at our disposal for more in-depth analysis when necessary.

Landing phase

As our migration progressed, we meticulously prepared for the critical landing maneuver. This involved pointing our Eventhub microservice to the new MSK cluster, effectively steering all traffic to the updated infrastructure. We coordinated critical activities in real-time with two other engineers whose work would be impacted. This ensured they were fully informed and had access to the necessary dashboards, checklists, and infrastructure.

Photo by Pascal Meier on Unsplash

Just as flights have a contingency procedure for emergency landings, we prepared our own “go-around procedure.” This involved a pre-approved rollback pull request that could be quickly merged in case of any unforeseen issues. We then implemented the breaking change, finally initiating the real migration between clusters.

Together, we monitored the dashboards and performed the necessary health checks for each required domain (infrastructure, events streaming and data processing). From implementing the breaking change to validating the health of our systems, we took around 2 hours.

Shutting systems down

To finalize the migration process, we undertook a thorough cleanup of the unused resources. This included the decommissioning of the old MSK cluster, followed by the removal of other unnecessary components. The codebase was archived, the Terraform state was purged, and outdated kafka consumers and MSK connectors were deleted.

With the completion of the final checklist, we officially declared the migration as a success.

Photo by Ashim D’Silva on Unsplash

Final remarks

To close, it’s important to touch on a collection of principles that were covered in previous sections. However, we must emphasize that without allocating time and collaborating with other teams, this initiative wouldn’t have been successful.

Here’s a summary of the principles applied in this initiative:

  1. Draw charts and write down an action plan
  2. Write clear checklists to help you throughout the process
  3. Brief with affected teams and seek feedback from experts
  4. Create overview dashboards for quick monitoring during critical phases
  5. Create extended dashboards for full assessment of your systems
  6. Continuously communicate your intentions to the impacted parties

As a final note, a special thanks to Alexey Seleznev from Event Management and Tomás Machado from Production Analytics for their full support during this process.

--

--