Expedia Group Technology — Platform

Downtime-Free Shift: Transitioning from Instance to IP-Based NLB amid Live Traffic

We’ll cover the steps we took, the challenges we faced, and how we overcame them along the way

Isha Batra

Published in

Expedia Group Technology

8 min readApr 17, 2024

A person sits in front of a lake surrounded by mountains — Photo by Jon Tyson on Unsplash

Introduction

This blog post will take you through our experience migrating services from the default Instance Type NLB (Network Load Balancer) to the IP Type NLB. We’ll cover the steps we took, the challenges we faced, and how we overcame them along the way.

Understanding the need for migration

Issue: Clients are experiencing delays and failures due to worker node churn in Kubernetes (K8s) clusters.

Root Cause: The problems stemmed from the abrupt termination of worker nodes during node churn, primarily caused by Autoscaling as Auto Scaling Group (ASG) is unaware of the NLB’s target group.

Short-term Fix: Increased min replica count in ASG to mitigate immediate issues.

Long-term Solution: Transitioning to AWS Load Balancer Controller with IP mode target type to enhance stability and performance.

With this approach, NLBs can establish direct connections to the pods, eliminating any extra hops. Additionally, this enables the ingress pods to directly manage the draining of connections.

Addressing immediate concerns

As we embarked on the migration journey, we encountered immediate concerns that needed attention. During the rollout of the ingress gateway pods, while new pods were coming up fairly quickly, it took a fair while for the NLB target group health checks to mark them healthy. This led to a period where K8s terminated the old pods but the NLB had no healthy pods. We worked around this by slowing down new pods getting ready by adding a fixed (15s) initialDelaySeconds period, and using a conservative maxSurge configuration in the rolling update strategy. However, this solution posed its own set of challenges. The fixed delay period could not guarantee a resolution, especially if the NLB health checks took longer than 15 seconds. Additionally, the maxSurge percentage was less effective when the number of pods was low.

Fortunately, we discovered a potential solution through the AWS Load Balancer Controller - the Pod Readiness Gate. By incorporating this feature into the ingress gateway pods during the rolling update, we could ensure that the Pod Readiness condition would only be set to "True" when pods were marked as "Healthy" in the NLB target group. We deployed Istio chart by adding label "elbv2.k8s.aws/pod-readiness-gate-inject: enabled" to istio-system namespace. Pods will have a readiness gate injected automatically if they are connected to NLB IP mode.

How did we begin?

Our journey started with three pilot phases — Alpha, Beta, and Gamma — followed by the final rollout. This approach was designed to gain confidence in the effectiveness of the solution in addressing the problem and to minimize any unforeseen challenges during the migration. Final migration is planned while the traffic is low. The preferred time was IST daytime
Well, the rule is clear-cut: Swap the NLB without causing an outage 🙂
Expect some network drops; perfection isn’t guaranteed! ⭐

Steps involved

1. Creating a temporary IP-based NLB

As part of our migration process, the first step involves setting up a temp IP-based Network Load Balancer. This NLB, configured in IP mode, will be deployed through Istio’s configuration. Specifically, we’ll add an ingressgatewaytemp service block in the IstioOperator custom resource of Istio, enabling the creation of the new NLB IP mode service managed by the AWS Load Balancer Controller. All the temp Istio resources will be globally deployed in one go across all K8s clusters.

When setting up the new load balancer, several considerations were kept in mind:

It should have a different name. An Istio-operator will be confused if we add two ingress gateways with the same name.
It should point to the same Fully Qualified Domain Name (FQDN) as the current one.
It should be configured as type nlb-ip mode to align with our migration strategy

It’s essential to note that this temporary NLB won’t affect our traffic flow, as all traffic is currently served by the default Instance-based NLB.

- name: istio-ingressgatewaytemp
  k8s:
    overlays:
      - kind: Deployment
        name: istio-ingressgatewaytemp
        patches:
          - path: spec.template.spec.terminationGracePeriodSeconds
            value: {{ include "istio-ingress.terminateGraceful" .}}
          - path: spec.template.metadata.labels.tags\.datadoghq\.com/service
            value: federated-istio-operator-config
    affinity:
    resources:
      requests:
        cpu: {{ .Values.ingressGateways.resources.requests.cpu | quote }}
        memory: {{ .Values.ingressGateways.resources.requests.memory | quote }}
      limits:
        cpu: {{ .Values.ingressGateways.resources.limits.cpu |quote }}
        memory: {{ .Values.ingressGateways.resources.limits.memory | quote }}
    service:
      externalTrafficPolicy: {{ .Values.ingressGateways.service.externalTrafficPolicy }}
      ports:
      - name: https
        port: 443
        targetPort: 8443
    hpaSpec:
      maxReplicas: {{ .Values.ingressGateways.hpaSpec.maxReplicas }}
      minReplicas: {{ .Values.ingressGateways.hpaSpec.minReplicas }}
      metrics:
      - type: Resource
        resource:
          name: cpu
          target:
            type: Utilization
            averageUtilization: {{ .Values.ingressGateways.hpaSpec.cpuavg }}
      {{- with .Values.ingressGateways.hpaSpec.behavior }}
      behavior:
      {{- toYaml . | nindent 14 }}
      {{- end }}
    podDisruptionBudget:
      maxUnavailable: {{ .Values.ingressGateways.podDisruptionBudgetSpec.maxUnavailable }}
    strategy:
      rollingUpdate:
        maxSurge: "100%"
        maxUnavailable: "25%"
    podAnnotations:
      proxy.istio.io/config: |
        concurrency: {{ .Values.ingressGateways.concurrency }}
        terminationDrainDuration: {{ .Values.ingressGateways.terminationdrain }}s
        drainDuration: {{ .Values.ingressGateways.drainDuration }}s
    serviceAnnotations:
      external-dns.alpha.kubernetes.io/hostname: "{{ .Values.ingressGateways.dnsSuffix }}"
      service.beta.kubernetes.io/aws-load-balancer-internal: "true"
      service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
      service.beta.kubernetes.io/aws-load-balancer-type: external
      service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
      service.beta.kubernetes.io/aws-load-balancer-additional-resource-tags: {{ .Values.ingressGateways.lb_tags }}
      service.beta.kubernetes.io/aws-load-balancer-subnets: {{ .Values.ingressGateways.subnet }}
      {{- with .Values.ingressGateways.extraAnnotations }}
      {{- toYaml . | nindent 12 }}
      {{- end }}
  enabled:  {{ .Values.ingressGateways.enabled }}

One of our main concerns during migration is ensuring uninterrupted traffic flow to both old and new Istio pods. To address this, we implemented a migration strategy that maintains traffic flow to both new and old pods seamlessly. We achieved this by setting the min and max HPA (Horizontal Pod Autoscaling) to 10 for the istio-ingressgatewaytemp so that when we bring down the temp NLB and Istio, we do not have a high amount of traffic running on temp pods. Traffic from the temp NLB is going to all pods of the default and temp deployments of IngressGateway. This happens due to the same selectors on both default and temp Istio deployments

2. Update all the virtual services to point to temp IP NLB

Now that we’ve set up our temporary IP-based NLB, the next crucial step is to ensure that all virtual services are directed towards this temp IP NLB. We’re using external-dns to manage this, which allows us to override service endpoints. Usually, external-dns automatically updates Route53 with NLB endpoints. However, when there are multiple ingress NLBs, it might overlook the new one.
The simplest and most efficient solution is to update the current DNS entries to point to the new NLB. This can be achieved using a simple annotation in Virtual Service — “external-dns.alpha.kubernetes.io/target”

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  annotations:
    external-dns.alpha.kubernetes.io/aws-geolocation-country-code: '*'
    external-dns.alpha.kubernetes.io/set-identifier: <region>
    external-dns.alpha.kubernetes.io/target: <temp NLB DNS>
    meta.helm.sh/release-name: <Name of Helm Release>
    meta.helm.sh/release-namespace: <Namespace of Helm Release>

3. Leveraging Kyverno for automated migration

Of course, we’re cautious about migrating all services to the temp IP NLB at once across all environments to avoid potential widespread issues. Therefore, we’ve broken down the process into phases. Initially, we migrate one application on one cluster, followed by the entire cluster. Then, we tackle one PCI category within a region, followed by another PCI category in the same region. Next, we migrate all clusters within that region. After a week of monitoring for any issues reported by app owners in that region, we repeat the same phased approach for different regions.

But how did we streamline this complex migration process? ⚡

We’ve used Kyverno to automate this. Kyverno is a K8s policy engine used for K8s native policy management. It enables us to define and enforce policies using K8s Custom Resource Definitions (CRDs).
Specifically, we implemented Kyverno ClusterPolicy to automate the mutation of Virtual Service resources within K8s clusters. This involved establishing rules to mutate created during specific operations and applying mutations to existing Virtual Services based on predefined conditions.

Here’s a glimpse of how we’ve utilized Kyverno to migrate all Virtual Services to temp IP NLB on only one cluster:

- name: nlb-target-mutation-on-request
  match:
    any:
    - resources:
        kinds:
        - VirtualService
  context:
  - name: lbendpoint
    apiCall:
      urlPath: "/api/v1/namespaces/istio-system/services/istio-ingressgatewaytemp"
      jmesPath: status.loadBalancer.ingress[0].hostname
  - name: cluster
    configMap:
      name: cluster-properties
      namespace: kube-public
  mutate:
    targets:
      - apiVersion: networking.istio.io/v1alpha3
        kind: VirtualService
    patchStrategicMerge:
      metadata:
        annotations:
          external-dns.alpha.kubernetes.io/target: "{{ lbendpoint }}"

Utilizing GitOps principles, we seamlessly deploy Kyverno cluster policies. Once changes are merged into the GitHub master branch, the policies are automatically deployed across all clusters. Since our policy specifies only the specific K8s cluster, it is applied exclusively to that specific cluster, ensuring targeted and efficient policy enforcement.

4. Migrating default Instance NLB to IP mode

Having successfully migrated all virtual services to the temp IP-based NLB and thoroughly verified that no requests are being handled by the default Instance NLB, we are now ready to proceed with converting the Instance NLB to an IP NLB.

To enable support for external NLB using IP mode in AWS Load Balancer Controller, we just need to add a specific annotation to the existing ingress gateway service configuration in Istio:

service.beta.kubernetes.io/aws-load-balancer-type: external
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip

Note that deleting the default block will result in the creation of a default ingress gateway setup, so it’s important not to remove it.

5. Migrating virtual services from temp NLB to default IP NLB

Once again, we’ll deploy a Kyverno rule that dynamically retrieves the hostname of the default LB, which has now been converted to IP mode. This rule updates the virtual policy annotations accordingly. Similar to previous phases, this process is conducted gradually. We start by migrating one application on one cluster, followed by the entire cluster. Subsequently, we address one PCI category within a region, followed by another PCI category in the same region. Then, we migrate all clusters within that region. After a week of monitoring for any issues reported by app owners in that region, we repeat the same phased approach for different regions.

6. Deleting additional resources created during migration

Lastly, once we’ve confirmed that the temporary NLB isn’t handling any requests, we simply delete it, as it’s no longer necessary.

Observability

Certainly, we cannot perform the entire migration without carefully monitoring or observing all the components involved in the process.

Are AWS LB controllers operating smoothly on all clusters?
Are Istio pods functioning correctly?
Is there any scenario where the load balancer might still be serving traffic despite virtual services being migrated to another NLB?
Are there any Istio temporary services in a pending state? This could potentially cause issues when migrating virtual services to the temp NLB, as it may not find the FQDN of the temp service.
Monitoring NLB Flow Count. This ensures that the flow count remains the same before and after the migration
Observing Kyverno rule failures. We need to confirm whether the rule has been successfully applied to all virtual services, or if any errors have occurred during the process.
Monitoring to ensure that the request count has decreased to zero or minimal before proceeding with the deletion of the temp NLB

Conclusion

In short, our journey to migrate was challenging but successful. We carefully planned, executed smartly, and automated the process, smoothly moving our services to the IP Type NLB. This ensured our services remained uninterrupted and reliable.

Learn about life at Expedia Group