Does Kubernetes really give you multicloud portability?

by Seth Dobson — Principal Cloud Architect II, John Roach — Senior Cloud Architect, McKinsey & Company

We have been working with Kubernetes regularly since 2017, and in that time, we have migrated countless applications to the cloud leveraging Kubernetes as the back-end container orchestration platform. Some of those migrations have gone well, while others have been quite challenging. Similarly, we have leveraged cloud service provider (CSP) native container orchestration solutions to do the same and had similar results with ease of migration. This article is not intended to talk about those experiences, or to champion one technology over the other, but rather to talk about the reasons business leaders and architects choose to leverage Kubernetes.

In our experience, depending on your organizational structure and operating model, leveraging Kubernetes at scale comes with more overhead than leveraging other CSP-native solutions such as AWS Elastic Container Service (ECS), AWS Batch, Lambda, Azure App Service, Azure Functions, or Google Cloud Run. Most of this extra overhead comes into play simply because of what Kubernetes is when compared with the other products.

Kubernetes is an open-source container orchestration engine that is by its very nature designed to run anywhere. Its architecture is brilliant in how it achieves this portability by utilizing plugins and extensions natively. However, it is the responsibility of the operator of the cluster to manage these plugins and operate them. We know that certain services like EKS, GKE, and AKS are working to make this experience better. Even then, you must select your version of Kubernetes, install and configure the plugins, and ensure compatibility between your deployment manifests, application interfaces, and the exposed APIs of the Kubernetes cluster as well as these plugins. We know this is “normal” maintenance for most enterprises and doesn’t scare them off, but we want to ask why. Why would you take on this maintenance? Why sign up for overhead when the CSP native solutions maintain backward compatibility of their APIs for years longer than Kubernetes? When we push on this topic, the most common response is that business leaders and architects are concerned about vendor lock-in and/or feel that their application must run actively in either multiple CSPs or in CSPs and a data center. However, most of these same organizations are leveraging CSP native solutions for their databases and, in some cases, leveraging functions-as-a-service capabilities for their greenfield applications. If a company was genuinely concerned about vendor lock-in to this level, it should be fully relying on Kubernetes, running its own databases, and hosting all its own tools and systems rather than leveraging the CSP native solutions at all.

Some industries (high tech) may require the engineering capacity to run Kubernetes at this level or scale, but most industries (banking, automotive, manufacturing, etc.) typically do not have the same business drivers. If you find yourself in such an industry and want to maximize the value that cloud can bring, then this article is for you.

The findings from our experiment (detailed below) show that, given an application designed to run in one CSP’s managed Kubernetes and integrated with other CSPs’ services (e.g., DNS, LB, Database, etc.), it was roughly the same amount of effort to migrate that application to another CSP-managed Kubernetes as it was to migrate that application to another CSP’s native container orchestration service. Given our findings, we feel that organizations that are defaulting to Kubernetes solely to have future portability are limiting the value that the cloud can provide them, especially given the vast array of broader technology drivers at play. For an organization to maximize the value of the cloud, they should be leveraging the highest-order CSP native cloud services available for a given workload, and publishing a decision matrix similar to the one below is a good way to provide the guidance to solutions architects and developers within your organization on which compute solutions they should be choosing.

The Experiment

Our working hypothesis is that managed Kubernetes is a nice-to-have, and both application architecture and data gravity are the largest factors in cloud migrations. There are scenarios where Kubernetes is the only choice, such as applications that cannot run in Google Cloud Run and Azure App Service. Not all CSPs offer a service that can orchestrate containers in a similar manner as Kubernetes — rather, those CSPs opted to provide a managed Kubernetes. Thus, we will not be analyzing workloads that fall into this category because they will likely fall into Kubernetes by default if multicloud is truly a requirement.

For our experiment, we chose a 12-factor application that is published by Google called the microservices-demo. Then we stood the application up in Google Cloud’s GKE, Azure AKS, AWS EKS, and AWS ECS and measured the effort of moving the workload across all three CSPs using Kubernetes, and the effort of moving the workload from Google GKE to AWS ECS. The results in engineering effort are detailed below, and most of the effort documented is in initial system setup. We feel that further migrations for all three compute solutions would be significantly shorter, but also not differentiated between compute solutions.

The application architecture of the microservices-demo app was as follows:

We considered the “migrations” complete after we had the application running with no errors in the logs and the logs offloaded to a log aggregation solution. While we acknowledge that there would be more effort required to get the product production ready; the application itself is not a production-ready product, and therefore we omitted this scope. We also added the challenge that we will not modify the source code of the microservices-demo application as changing the source could make our work easier and influence our findings.

GKE

Google, in its microservices-demo, provides the required Kubernetes deployment configurations; however, it does not provide the code for the necessary infrastructure. We chose to use the GKE-Autopilot type cluster deployment to make the deployment and management a more effortless experience. The Autopilot type deployment ensured GKE provisions and manages the cluster’s underlying infrastructure, including nodes and node pools, giving us an optimized cluster with a hands-off experience. Below is the architecture of what the application looked like running in GKE.

The process for getting the application up and running in the GKE cluster was as follows:

  1. Set up the required VPC for GKE.
  2. Set up the necessary DNS zones. These zones will be used by the external-dns service to create the required DNS records for the application.
  3. Build the GKE cluster using Autopilot
  4. Set up the necessary service account permissions to allow Autopilot to configure the essential cluster monitoring capabilities.
  5. Create the necessary service accounts for external-dns to manage the DNS records.

Once the Kubernetes deployment was complete, additional steps needed to be taken for the Kubernetes deployment:

  1. Installed external-dns service.
  2. ManagedCertificate needed to be defined via networking.gke.io/v1 API for the SSL cert used on the load balancer.
  3. A Service definition was created that uses a network endpoint group (NEG) in the GKE VPC-native cluster. Ingress is the recommended way to use container-native load balancing, as it has many features that simplify the management of NEGs. When NEGs are used with GKE Ingress, the Ingress controller facilitates the creation of all aspects of the load balancer, which includes creating the virtual IP address, forwarding rules, health checks, firewall rules, and more.
  4. A FrontEndConfig definition was created via the networking.gke.io/v1beta1 API to ensure a rule exists to redirect HTTP traffic to HTTPS.
  5. A new Ingress was created leveraging the previously created Service and FrontEndConfig. This Ingress definition will also be leveraged by the external-dns service, which will configure the necessary records to point to the load balancer.

Overall, the configuration of the cluster and the deployment of the microservices-demo with the additional configurations took roughly two days.

However, it must be noted the Ingress rules, which defined the load balancer configuration and ensured HTTP was redirected to HTTPS, used an API still in beta (networking.gke.io/v1beta1); another critical note to this configuration is that the FrontEndConfig will also create another load balancer to forward the traffic as visualized below.

AKS — Migration Effort Two Days

For AKS, to test out the ease of portability, we decided to go with the AKS cluster leveraging AKS Virtual Nodes type deployment. With virtual nodes, we would have quick provisioning of pods and only pay per second for their execution time. You don’t need to wait for the Kubernetes cluster autoscaler to deploy VM compute nodes to run the additional pods. However, we noticed that the microservice-demo’s frontend and redis-cart deployed components would intermittently fail under specific loads. We therefore decided to deploy these components to a separate node pool and allowed the remaining services to deploy to the virtual nodes. Below is the architecture of what the application looked like running in AKS.

To set up the cluster and deploy the microservice-demo we took the following steps:

  1. Set up Azure Network for AKS. As part of this effort, created three separate subnets, one for Virtual Node (ACI), one to be used by the Gateways, and one for the remaining cluster components.
  2. Set up necessary DNS zones.
  3. Set up Log Analytics Workspace for the AKS cluster.
  4. Set up the AKS cluster.
  5. Enabled the following Kubernetes add-ons:
  6. Monitoring: Container Insights monitoring with the cluster
  7. Virtual Node (ACI): Use virtual nodes with the cluster
  8. ingress-appgw: Application Gateway Ingress Controller with your AKS cluster

Once the infrastructure was complete, the following Kubernetes deployment configurations needed to be done:

  1. Installed and configured external-dns service
  2. Installed and configured cert-manager service
  3. Changed deployment definitions provided by the microservices-demo to allow deployments to be done on Virtual Node node pools by defining the required nodeSelector and tolerations
  4. An Ingress that used the ingress-appgw add-on and cert-manager service needed to be defined. This Ingress definition will also be leveraged by the external-dns service, which will configure the necessary records to point to the gateway.

Overall, the configuration of the cluster and the deployment of the microservices-demo with the additional configurations took roughly two days.

However, it must be noted that due to the required add-ons and services needed for the whole Kubernetes experience, the amount of toil necessary for maintaining this cluster increased. Unlike in GKE Autopilot, AKS add-ons were required for monitoring, using Virtual Node and Application Gateway. Furthermore, AKS needed the cert-manager service to automate cert-management on the load balancers. All these components require maintenance by the admin of the cluster.

EKS — Migration Effort Two Days

Moving the workload to EKS was less straightforward than you would imagine given that we had the Kubernetes manifests from the GKE deployment. We chose not to utilize Fargate for the EKS implementation because at the time logging required a side car and we opted for EC2 with a Daemon Set running to gather logs. The architecture of the EKS migration is below and followed by a description of the migration process.

Environment Configuration

  1. Set up the VPC for EKS.
  2. Set up the Route53 Domain.
  3. Provision a certificate from ACM.
  4. Build the EKS Cluster.
  5. Provision the Managed Node Groups for the cluster.

Migration Effort

  1. Install the Kubernetes plugins:
  2. External DNS Plugin
  3. AWS LoadBalancer Controller
  4. AWS Container Insights with Fluent Bit
  5. Modify the Kubernetes manifests to utilize the new plugins:
  6. Modified the nodeSelector and tolerations.
  7. Created an Ingress definition for the externally exposed endpoint for the application that handled creating the ALB, managing R53 records, and applying the certificates previously created

This process took us roughly two days, most of which was spent analyzing which plugins we would need to achieve our goals with the EKS ecosystem.

However, much like the AKS configuration, we have a few plugins that need to be installed, monitored, and operated in order for the application to successfully run in the EKS cluster. Organizations will therefore be assuming the burden of upgrades, maintenance, and incident management for these third-party plugins.

ECS — Migration Effort Two Days

Moving the workload to ECS seemed at first to be a large effort but was not all that challenging. We ran into one major challenge while getting the application running. The application was hardcoded using the insecure setting for its GRPC calls. This led to a few hours of head scratching as we could hit containers directly but could not hit them through an AWS Application Load Balancer, as the ALBs do not support unencrypted traffic for GRPC now. This was not a problem with EKS because service-to-service calls do utilize ALBs for east/west traffic in favor of the built-in Kubernetes services. While this may seem like a blocker, we were able to quickly pivot to using AWS Cloud Map instead for service-to-service traffic. After solving for the GRPC issues, the architecture and steps for the ECS solution are as follows:

Environment Configuration

  1. Set up the VPC for ECS.
  2. Set up the Route53 Domain.
  3. Provision a certificate from ACM.
  4. Set up Cloud Map.
  5. Set up ECS Cluster with Fargate & Container Insights configured.

ECS Migration Effort

  1. Utilizing the Kubernetes’ manifests from the GKE deployment to write the Terraform scripts that will deploy the ECS tasks, ECS services, Route53 records, configure the ALB, and configure Cloud Map.

This process took us roughly two days, and we had the application running and logging with zero VMs and zero plugins needed to achieve logging.

The largest development effort difference between the ECS and all other Kubernetes-based deployments was dedicated to the creation of the ECS Task and Service deployment Terraform scripts. Those took an afternoon to write, but once we wrote the code once, we were able to reuse it for every other service. The maintenance burden of platform upgrades, maintenance, and incident management are shifted to the AWS side of the shared responsibility model in this scenario, freeing up an organization’s staff to focus more on the differentiated code that drives business value.

Summary

All in all, deploying to a managed Kubernetes can’t be considered fully portable (or the silver bullet for portability), as there are add-ons or services that you would need to install and manage to ensure the application is deployed and configured as it is supposed to be. You are spending less time on the core components of the deployment topology, and most of the cloud-dependent configurations come into play when you wish to have critical capabilities such as:

  • Automated DNS record management
  • Automated managed certs
  • Monitoring
  • Load balancer management
  • Secrets Integration
  • Scaling

If you use managed nodes (e.g., AWS Fargate, AKS Virtual Nodes, GKE Autopilot), you will come across limitations that might affect application behavior, such as not being able to host state or use Daemon Set type deployments. Falling back to managed nodes means, as a cluster admin, you are now responsible for managing the upgrades as well as scaling. All of that is to say that Kubernetes is the higher-maintenance solution for an enterprise, but that is not a bad thing, as it is also the most flexible solution.

While there are certainly concerns about cloud portability as it pertains to the CSP services, we feel that those concerns when applied to container orchestration do not hold much water. The effort to migrate from GKE to ECS Fargate was similar to the effort to move from GKE to EKS/AKS, which we feel proves that the “portability” argument doesn’t really stand up. Vendor lock-in in the cloud is somewhat inevitable as you shift toward leveraging the higher-order compute services and start to shift your data to the CSP-managed services as well. Kubernetes is a powerful tool, and if you have a solid technical reason, of which there are many, or simply need the app to run both in and out of the cloud, then Kubernetes just may be right for you. However, we are seeing too many organizations putting Kubernetes at the top of their compute decision matrix and therefore not realizing the full value of the CSP that they are deploying within.

If you want more information about containers and multicloud portability, our colleagues published an article recently talking about just that: https://www.mckinsey.com/business-functions/mckinsey-digital/our-insights/getting-the-most-from-cloud-services-and-containers.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store