Terraform for Existing kOps Managed Kubernetes Clusters

Carlo
Async
Published in
4 min readFeb 28, 2022

Creating a safer workflow for managing your cluster.

One thing I’ve always wished for with our Kubernetes cluster managed by kOps was that we had started with terraform configurations. I’m talking about adding the --target=terraform to the kOps commands to generate the configuration then running terraform to make those changes to our infrastructure. This gives us two things:

  • We can check in the terraform configuration. This allows us to have better visibility of our intended changes as well as viewing a history of them. Now we can go through our usual version control workflow of creating a PR for these changes, getting reviews, and having a CI apply them after merge.
  • 1–1 rollbacks. We don’t have to rely on kOps giving us our old infrastructure when reverting the changes we’ve made to the cluster or instance group spec.

Our experience before terraform configurations

We’ve successfully upgraded our k8s cluster from version 1.8 all the way to 1.21 in regular maintenance cycles without the terraform workflow but it would have made our journey much smoother with it. There were instances where rolling back by reverting the cluster specs during an upgrade failed. This stemmed from kOps giving us a different version of kube-proxy or a different AMI from the one we had specified in that same k8s version. With checked in terraform configurations, we could have reverted changes much more easily.

My workflow of testing changes in a test environment then bringing those up to production was also more tedious than it would have been if we had terraform output for those changes. I would need to take detailed notes on what needs to change, whether that’d be lines that needs to be changed in the cluster spec or updating the instance groups for new resources. With terraform, our changes for one environment is already checked in for reference.

Where to go from here

So with all these advantages of having terraform configurations, have we just missed the boat by not creating the cluster with it in the first place? We could try to manually add all of our existing resources with terraform import(which would take forever!) but I’ve found a better solution by using terraformer to pull existing resources into the terraform state and have it match with your current kOps generated terraform configuration. It’s still a manual process but it’s definitely worth it.

Steps:

  1. Install terraformer 0.8.18 and terraform 0.13.x (assuming we have kOps 1.21.x)
  2. Get terraform configuration from kOps
  3. Get the planfile from terraformer which lists the resources it intends on importing to state.terraformer plan aws -resources=iam,elb,ebs,route,route_table,route_table_association,keypair,auto_scaling — profile=$AWS_PROFILE — connect=false -C — path-pattern {output}/{provider}/ . We didn’t include security groups here because there’s no easy way of mapping those resource names so we’ll just opt to create new ones and then manually destroy the old ones once we’ve ran a rolling update.
  4. Clean up the plan file to match the resource names from kOps and to remove resources not managed by it. We don’t have to be perfect here (we’ll do more on step 8) but there’s a lot of low hanging fruit here. For example, we can clean up the resource names with the following %s/tfer — //g , %s/002E-//g
  5. Begin importing the plan: terraformer import plan generated/aws/plan.json
  6. Now we should have a new tfstate file with our existing resources. Replace the configuration files generated by terraformer with the one we have from kOps.
  7. Run terraform state replace-provider 'registry.terraform.io/-/aws' 'registry.terraform.io/hashicorp/aws’ to have the provider match the one kOps is using.
  8. Run terraform plan. At this point, we’ll be manually renaming resources that is in our state with the ones in our kOps terraform configuration. One type of resource that we definitely do not want destroyed are our etcd volumes since we would lose all of our cluster data. We’ll have to use the name from the volume’s tag as the resource name. terraform state mv aws_ebs_volume.vol-076931fdc669b404c aws_ebs_volume.a-etcd-events-dev-k8s-local . The goal is to end up with 0 to be changed and 0 to be destroyed. Remove any resources that aren’t managed by kOps by using terraform state rm resource_type.resource_name which could be your ELB for ingress or some IAM users.
Terraform execution plan. Disregard the high number of items to destroy, most of these are security groups that we will clean up after the new security groups are created.

By now you should be able to get a rough idea of what it takes to match your existing resource with your new terraform configuration. It will take some cycles of manipulating the state then running terraform plan or even changing the planfile by terraformer to make some batch changes. But keep at it and you’ll eventually have a non destructive execution plan.

Async builds high performance, reliable, and cost-effective applications by combining technical expertise and deep knowledge of industry trends.

For more information on development services, visit asy.nc

--

--