Update Istio the GitOps way

Thomas Perronin
BlaBlaCar
Published in
9 min readMar 18, 2022

At BlaBlaCar, we use Istio service mesh in our clusters, an open source service mesh that layers transparently onto existing distributed applications. A common use case of a service mesh is to handle the network traffic in a Service Oriented Architecture. In our case, the service mesh is built on top of Kubernetes. The service mesh offers several features like advanced traffic management, authentication, authorization, observability and more.

The GitOps philosophy can be seen as an extension of the “infrastructure-as-code” (IaC) concept, where every part of your architecture and application is described in Git. The main difference relies on CI/CD that will automatically validate and deploy all your resources.

The BlaBlaCar Engineering team chose to embrace the GitOps methodology, so in this article we will explain why we needed to create something to handle our deployment of Istio mesh.

Leaving the helm chart deployment method

We started to use Istio in production with version 1.0 and we happily used the official Helm chart of Istio until version 1.4. After that version, the Istio team announced the end of support for Istio Helm chart (cf Istio 1.4 — Message of deprecation of Helm install method, and Upgrade Notes — Istioldie 1.6 End of Helm chart).

After some discussion with the Google Tech team, Istio maintainers and community members, we decided to stay in version 1.4, waiting for the release of 1.6 to be ready because that version came with the canary update. The canary method makes the cohabitation of two Istio control-planes possible, which offers the possibility to test a new version and upgrade without issue, but we also need to leave the Helm chart deployment method. (cf Upgrade from 1.4 to 1.6)

At that time Istio official method was to use the istioctl install command. The future option to deploy Istio seemed to be with the Istio operator, but that option was not ready yet (and still not 02/2022). So we had to use the new Istio method. (cf Istio install guide)

The istioctl install method is using an Istio operator custom resource (Istio / IstioOperator Options ) to declare and configure the mesh. We could have used it and added that configuration file to Git to save it. But that would have forced us to manually do the update of the control plane from our computers or to create a specific CI job to apply it.

As we use GitOps methodology, having a CI job that will deploy the Istio mesh does not suit us. Instead we wanted to use Flux CD as for any other workload in our clusters. Flux is not just there to deploy but also to ensure that the cluster is synced with the source repository, there is a reconciliation process between our git repository and the cluster.

The other downside, of using the istioctl install method, is that we have some custom tweaking in the deployment of the Istio mesh. With the Helm chart, it was easy to edit the template if the official one did not expose enough custom configuration. With that Istio install method it would not be not that simple anymore.

We decided to choose a new solution: using the istioctl generate manifest command plus a homemade script to cover our needs.

Canary deployment with Istio Operator generate manifest

The Istio team does not recommend using the istioctl generate manifest to install the service mesh. However at BlaBlaCar we find it suitable for our GitOps approach, hence we are sharing it as an example. Istio / Install with Istioctl

All the steps below could be done manually but we decided to implement a Golang script that will automate the generation of resources and avoid human errors during the process.

Our approach is fully leveraging the canary update feature, for each new major version of Istio we deploy a new revision of the control plane to smooth the transition and reduce incident risk.

Istio update process steps (light blue are manual, dark blue done by our script)

You can see above all the steps to update our Istio mesh.

  1. Istio admins use and populate the official istio-operator.yaml file
  2. To start, the script call the istioctl generate method
  3. Then, generated resources are split and grouped by Kubernetes Kind in separate folders
  4. Then, not useful resources are cleaned
  5. Then, our custom git patches are applied
  6. At the end, everything is committed and deployed with FluxCD.

1. We use the official istio-operator resource

Istio / IstioOperator Options

That resource allows us to modify pretty much every aspect of the Istio installation. That’s the starting point of our Istio configuration.

2. Our golang script will call the istio generate manifest command

out, err := exec.Command(istioctlBin, “manifest”, “generate”, “-f”, operatorFile, “ — set”, revision).Output()

We are using the Istio canary deployment and the revision flag we choose is based on the Istio version we are installing ( istio110for Istio 1.10 or istio111 for Istio 1.11). The command above will generate a big output file with all the Kubernetes resources needed, tagged with our revision flag.

3. We split all the generated resources into dedicated folders and files

Each resource will be separated in a dedicated file and organized in a folder matching its Kind. That way it is easier for us to read and git diff is simpler, indeed there is no guarantee that the generate command will output the resources in the same order and that will create unwanted git diff. We also put all the resources of one revision in its own folder.

The common folder host hosts all Kubernetes resources that cannot be duplicated or that are specialized, for example to indicate which revision is in charge.

You also will find in this folder the latest custom resource definition for Istio, or RBAC related file since it can be used for both revisions.

istio-system/
├── common
│ ├── clusterrole
│ ├── clusterrolebinding
│ ├── customresourcedefinition
│ ├── istiod-svc.yaml
│ ├── role
│ ├── rolebinding
│ └── serviceaccount
├── istio111
│ ├── generated
│ ├── ingress.yaml
│ ├── istio.operator.yaml
│ ├── cabundle-mutatingwebhookconfiguration.patch
│ ├── cabundle-validatingwebhookconfiguration.patch
│ ├── failure-validatingwebhookconfiguration.patch
│ └── sidecar-injector-template.patch
├── istio110

├── istio-system-ns.yaml

In the generated folder we have a folder for each kind of resource (deployment, hpa …) that keeps things well organized.

generated/
├── clusterrole
├── clusterrolebinding
├── configmap
├── deployment
├── envoyfilter
├── horizontalpodautoscaler
├── mutatingwebhookconfiguration
├── poddisruptionbudget
├── role
├── rolebinding
├── service
├── serviceaccount
└── validatingwebhookconfiguration

4. Clean all the unwanted resources

Indeed the resources are evolving, for example the Istio / Envoy Filter needed for the version 1.8 are not the same as those in 1.9 because of an API change (Envoy filters are often pinned to a Istio major version to avoid issues), so it’s necessary to handle previous version and by default the istioctl install command will generate resource for that. With our setup we are keeping the previous version in git (gitops and canary deployment) and we are not live patching the cluster with the istioctl install command so we do not need those files targeting the previous version. That is why we delete them.

5. We apply our custom patches

The script will then apply all the git patches present in the folder on the generated resources. That allows us to have our custom modification or fixes before it will be available in the next Istio version. As a positive side effect, being able to apply git patches this way makes it easy to prepare and present contributions to the Istio community.

6. Deploy all resources via FluxCD

All the resources are committed to our git repository that will be synced with FluxCD and deployed in our cluster.

NB: Due to Flux behavior, you should consider to commit first the new custom resources definition if you want to be sure that the new Istio resources will be deployed without harm, anyway if you are patient enough Flux will finish to reconcile the cluster and deploy all the resources needed.

Now we need to upgrade istiod

As we said before, we are using the canary deployment to be able to have multiple control planes at the same time. Each resource is tagged with the revision label so we can have several versions at the same time.

In the same way, the Istio resources that are not following the revision pattern are put in a dedicated folder that will match the most recent version of Istio (Istio guarantees backward compatibility with previous version). The same way we also have the Istio CRDs (customresourcedefinition) folder that contains the resource with the latest version deployed in our clusters.

├── common
│ ├── clusterrole
│ ├── clusterrolebinding
│ ├── customresourcedefinition
│ ├── istiod-svc.yaml
│ ├── role
│ ├── rolebinding
│ ├── serviceaccount
│ └── validatingwebhookconfiguration

As you can see, we have a special istiod-svc service that will target the active version of Istio we are using in the cluster. One usage of that service is for the Kubernetes admissions controller to validate the Istio dedicated resources.

When we are deploying a new version of Istio, we start by upgrading our staging cluster, then the pre-production cluster and other secondary clusters, before going for production. We are always reproducing the same procedure, and using canary update to ensure as much reliability and availability as possible.

After upgrading the Istio control plane, we can proceed to upgrading the data plane. A new label needs to be placed on each namespace to be updated istio.io/rev: XXX. Every BlaBlaCar service team is owner of their namespaces, they will also have to rollout their deployment to get the new Istio proxy version attached to their workloads. That way our service teams are in control of the update and they are aware of the new Istio version.

Istio canary update for namespaces and workloads in ubernetes

Our tool will also help us update all the namespaces with the correct label and create a PR for the service teams that own the related namespace. So that way each team can update their workload when it suits them (obviously with some deadlines to avoid divergence of Istio proxy version).

How we manage our gateways

Along with the new version of Istio control plane we deploy new gateways. We have 2 sets of ingress gateways both linked to the revision tag (ex: istio-ingressgateway-110 and istio-ingressgateway-111). Both of these sets will receive the exact same configuration, indeed the gateway object is the same for the 2 gateways, and having the same object allows us to have all the virtual services that register for public traffic to be available on both gateways.

Each set of gateways pods will receive a dedicated Kubernetes ingress with a GCLB attached.

Pay attention to cert-manager annotations if you are using it, having the annotation sets on both ingress can cause issues, so keep one main ingress dedicated to cert-manager and certificate generation process.

DNS switch with blue/green Istio gateways

In fact this is a blue green setup, we are now able to switch the DNS entries one by one to the new ingress public IP.

Because the services are registered on two different gateways we are able to do that migration before all the namespace are migrated to the new Istio revision.

When we are sure that the new gateways are handling the traffic correctly, we can totally switch and remove the old ones.

Rollback and Cleanup

In case of any issue on the migration step, we are able to rollback any part of the update, the namespace revision, the DNS pointing on the new gateway or the Istio control plane.

After everything is migrated to the new version, you just have to clear the folder in git from the previous version and all resources dedicated to the previous revision will be cleaned up.

Conclusion

In conclusion, our tool is a simple script but it allows us to keep our deployment in GitOps, and be confident in any new change as the diff in git will be easily visible and deployed by our Flux CD.

Fun fact, the Helm chart is now back and officially supported for the deployment of the mesh.

Istio / Install with Helm, but we will keep our method that smooths upgrades and allows us to customize any point we need.

I thank the Core-Infrastructure team (past and present members) for the work on this critical part of our software factory.

--

--