Using Terragrunt and Harness to Improve Configuration Management for Astra DB
Author: John Raber
Can we improve and automate configuration management and build an easy-to-understand picture of the desired state for everyone in an organization? We set out to uncover the answer and came up with a greatly improved config management model for DataStax Astra DB that leverages Terraform, Terragrunt, and Harness.
Configuration Management is a set of processes and tools that help establish infrastructure in a desired state. For cloud companies, configuration management is a core capability we need and heavily rely on.
We make lots of infrastructure changes every day to DataStax Astra DB, our multi-cloud database-as-a-service built on NoSQL power engine: Apache Cassandra®. This ensures that we can meet our own development and testing needs, and the production needs of our developers and enterprises.
With this amount of scale, we needed to improve our configuration management process. In this post, we present an enhanced config management using Terraform and Terragrunt. We also wanted to automate the processes after we modelled them in our delivery platform: Harness.io.
Terraform is a well-known open-source infrastructure as code tool from HashiCorp. It lets you define both cloud and on-premises resources in human-readable configuration files which you can version, reuse, and share.
Terragrunt is a thin wrapper for Terraform that provides extra tools to keep your configurations dry. It works with multiple Terraform modules and manages remote states. With Terragrunt, you can define your Terraform once and promote a versioned, immutable copy of that exact code from one environment to another.
We came up with this configuration management to solve the challenges that we were facing with our existing system. Let’s take a look at what they are.
Defining the problem
At DataStax, we manage two levels of configuration:
- Infrastructure — via our public cloud providers (AWS, GCP and Azure).
- Workloads running on these public and government clouds:
a. We achieve this configuration through CI process which is realized as fully baked/tested docker images.
b. Using standard deployment process on Kubernetes using Helm charts and values files (config) stored in git repos.
For the purposes of this post, we’ll limit ourselves to configuration management for level 1 infrastructure.
Various developers with the right level of access and knowledge execute some combination of Terraform scripts, shell scripts, and configuration files with the right input variables to establish level 1 infrastructure.
Our current method of config management for level 1 infrastructure is to document the commands run along with their input variables as .md files in Git. The desired state is essentially expressed by the series of .md files recorded in this Git repository.
While this is a very rudimentary form of configuration management, it’s highly problematic for several reasons, including:
- Comprehension: It’s nearly impossible to build a mental picture of a desired state by reading a series of .md files with commands. If you jump into the code, it’s very difficult to understand where all the configuration is and what the process was. We want to make the desired state clear so that it’s human-readable and follow the same declarative model as Kubernetes. A small set of engineers can fix the Terraform modules and standardize and test them while everyone else can understand how to change and configure the Terragrunt templates if needed.
- Quality: It’s very challenging for us to ensure our code works and that the actual state mirrors desired state because of the lack of a clean description of desired state. Even if you were an engineer, it gets harder to detect changes and to do any type of reconciliation process to ensure the quality of your solution and keep your declared state in the config with the actual state running.
- Barrier to entry: Making changes requires deep knowledge of a lot of code, not just about the needed infrastructure changes.
- Accounting: We can’t easily tie any of the infrastructure deployed across clouds back to the purpose it serves for us.
- Access controls: Lots of developers need access to production and there are no safeguards. For instance, they might run commands directly without following the .md process.
The proposed solution
The solution we put forth is to continue using Terraform, a tool to provision our infrastructure in managed services, but to add an additional level of configuration management: Terragrunt.
As mentioned in the beginning, we want to implement greatly improved configuration management that leverages Terraform, since we were already using it. An additional goal is the automation of the process(es) after we model them in our delivery platform: Harness.io.
Harness is a Software Delivery Platform that uses artificial intelligence (AI) to automate DevOps processes, such as CI/CD, Feature Flags, Cloud Costs, and more.
Configuration can get really specific. The idea is to put all the pipelines behind Harness, and then have an infrastructure manager call the Harness API, which will run the workflows/pipelines. This gives you a self-service via the config touchpoint hearing git. You can also call Harness programmatically.
Figure 1 below introduces Terragrunt and Harness.
In Figure 1,
- The Terraform modules represent the code we use to manipulate the state of infrastructure across all the clouds.
We only make changes to the code when we want to leverage new infrastructure capabilities or we want to modify the way in which we leverage existing infrastructure capabilities. Only platform team members would typically make changes to the code.
- The Terragrunt templates represent the desired state of infrastructure we want across all the clouds.
We make changes to these templates whenever we want to add, update, or delete infrastructure. All team members would be able to make changes to these templates as and when they needed to make infrastructure changes.
- The Harness continuous deployment job represents the controller that continuously reconciles desired state with the actual state in the cloud. Harness listens for commits made to the desired state and automatically reconciles these changes with what is deployed across the clouds using the code.
- The infrastructure deployed across the clouds represents the actual state.
With this setup, we won’t have to cut and paste different modules for different customers. It also allows us to be able to make configuration changes per cloud platform, per product (e.g. control plane, data plane, database), and per region.
This proposed model solves all of the challenges mentioned above and offers these advantages:
- Easy comprehension: The model produced easy-to-read desired states that reflect how we think about our infrastructure and the purpose it serves.
- High quality: We can now more easily test our code by simply comparing actual state with desired state.
- Low migration effort: We now have a relatively shorter path from where we are currently to this new approach.
- Low barrier to entry: Developers making infrastructure changes do not need any knowledge of any code. They just need to understand our desired state model.
- Simplified accounting: We can more easily tie back infrastructure in the cloud to the purpose it serves through automated and simple to understand tagging.
- Locked down production: We can lock down all access to production — all changes will only flow through the desired state process.
- Shared mental model: Ease of comprehension of the desired state isn’t just about the ability to have easy configurability of your system resources. It also answers questions about the cardinality and relations of resources. This strongly hints at architectural dependencies and resource lifecycle in a solution architecture.
Combined with the generated architecture diagram(s) from live environment(s), it becomes much easier to reason about the system both statically and dynamically. This knowledge feeds many different teams in an organization to assist decision making and strategic metrics gathering.
There used to be a huge knowledge gap regarding config management among engineers, top level management, and other developers. Terraform modules have very limited access and only Platform or Security Engineers could access it. While they’re responsible for fixing issues, others in the organization should also be able to understand the system and judge whether it’s working properly.
Now, having the desired state match the actual and the running system, and then being able to generate pictorials allows us to understand our system and build a shared mental model. When this happens, everyone on the team can start having more informed discussions.
Throughout the process, we asked ourselves some questions and here are some answers we have for them:
- Is there an easy way for us to extract this actual state out of the cloud? This would help with quality and repairs for example.
There are two ways:
- Engineering: “$ terragrunt plan” will read the declared state and compare it to the actual. It reports planned changes, if any, to reconcile.
- Visually: Lucidscale will allow us to generate visual diagrams with point-in-time state of environments.
- Do we need some form of repair to compensate for failures? Are failures deltas?
“Terraform plan” can detect failure or drift.
Repair ( strong-handed — illustration only ): If drift is detected, you can automatically run terraform apply with declared state. This can cause resource deletion even while working towards creating the declared/desired state. The reason for this is in how the cloud providers have implemented their managed services. This has to be understood and worked with regardless of the tools used.
- Are there alternatives to the proposed model?
Yes. You can keep Terraform but use an alternative for Terragrunt. It’s more based around the solution, rather than the tool. In other words, if you want to use something else later, you can.
In the long term, you can use Crossplane as an alternative for Terraform or Terragrunt. Instead of designing everything in Terraform, you can design your environments or regions as custom resource definitions (CRDs). Then, you can try to create using Kubernetes and provision the inferent managed services.
Just like Terraform has modules already built, in Crossplane, the providers each have CRDs for their offerings.
In this post, we showed you how we improved our configuration management by using Terragrunt to keep configuration out of Terraform, and Harness to automate processes. This model will allow stakeholders in an organization to have a full picture of the desired state and make more informed decisions.