Cloud story: Spin down unneeded infrastructure quickly with terraform to save OPEX

The benefit of infrastructure as code and cloud: If you don’t need it, don’t pay for it.

Published in

datamindedbe

3 min readMar 24, 2020

The COVID-19 pandemic puts many companies in trouble. Some need to save costs where possible. Development projects may completely stop until cash flow improves. In one case, we were asked to see what savings we can offer to a client’s infrastructure. They use Kubernetes and Kafka as well as our custom datalake frameworks based on Apache Spark and AWS glue.

Since we have all infrastructure defined as code, we are able to to spin down components that aren’t needed. We actually disabled the TST and UAT environment completely and reduced the DEV environment to just the bare minimum, spinning up the K8S and Kafka cluster only when needed.

The production environment is kept running to support existing use cases. But since no new deployments (except for hot-fixes) will be done, we can also scale down here. Adjusting all applications CPU and memory requests and limits to measured values according to vertical pod autoscaling.

This can lead to savings of more than 75% of infrastructure costs. Shutting down 3 out of 4 environments and reducing the costs of the production environment can be achieved very quickly thanks to infrastructure as code.

It is these moments that show the value of cloud and the investments made into defining the infrastructure in code. This would simply not be possible with a traditional on premise setup.

Let’s compare this to an on-premise installation, where hardware was purchased years in advance. Or a more naive cloud approach which uses a set of VMs that need to be running 24/7. These architectures and their costs would remain more or less static, regardless of the load. The nice thing about cloud-native architectures is that they not only easily scale up, but can also easily scale down.

Gotchas we faced

AWS MSK doesn’t scale down. Only up. We decided to delete the entire cluster because it’s not needed and doesn’t contain valuable data. All data stays in production
Deployments on top of our K8S may have created RDS managed network interfaces. Since our core infrastructure code is separate from the application infrastructure code, we ran into security group deletion issues. Manually removing these in the console resolves the issues. Documenting the process is key here
What if someone else has to spin this back up? Creating a screen recording of the process helps giving an exact picture of what happened to whoever comes after you.

Lets hope for a quick recovery of Europe and all the other countries affected by this pandemic. When it’s time, we’ll run terraform apply again and bring things back up quickly. Then the feature teams can pick up where they left off and redeploy their applications and services to the platform.

Facing similar issues? You know how to find us

Cloud story: Spin down unneeded infrastructure quickly with terraform to save OPEX

The benefit of infrastructure as code and cloud: If you don’t need it, don’t pay for it.

Gotchas we faced

Written by pascalwhoop