Enhancing Terraform Infrastructure Management: Challenges

Kerem Yaldız
Trendyol Tech
Published in
7 min readApr 16, 2024

--

The Platform Engineering Team responsible for Infrastructure as Code (IaC) maintains critical Terraform modules for OpenStack, vCloud, and Alibaba Cloud providers. Additionally, we manage virtual machine images across multiple cloud platforms, including OpenStack, vCloud, GCP, AWS, and Alibaba Cloud, utilizing Packer.

Managing tens of thousands of virtual machines, created and destroyed regularly, leveraging Terraform modules and Packer images. The provisioning of virtual machines is conducted with Terraform, utilizing the images baked through Packer. Application modules and images are built on top of our base module and image by other platform teams. This leads to the spread of any anti-pattern throughout the entire company, and vice versa. Consequently, standards for both Terraform provisioning and Packer image building workflows are established and enforced by us.

Furthermore, pipelines for this process are developed and maintained, utilizing both GitLab-CI and internally developed tools and services.

For quite some time, while we have made small changes to both the modules and the images, the overall process has largely remained unchanged. During this period, various difficulties have been encountered in both development and maintenance. Recognizing the need for improvement, a bold decision was taken to completely redesign and rebuild the entire infrastructure from the ground up. After extensive efforts, these changes have been successfully implemented.

We have begun writing a series of Medium articles to share this journey and insights, with the first installment focusing on ‘Enhancing Terraform Infrastructure Management: Challenges’.

Infrastructure Overview

The base modules for the three providers are consolidated within a single Git repository. Within this repository, each module comprises vm and disk submodules. Additionally, there are two other modules named instance-name and vcloud-flavor-converter. instance-name is used for dynamically naming VMs for all providers, while vcloud-flavor-converter is utilized for converting imaginary flavors into CPU and memory values for vCloud, due to the absence of flavors in vCloud.

Application modules such as Elasticsearch, Kubernetes, Couchbase, and others are built on top of these base modules, utilizing Terraform’s source mechanism. These base modules manage tasks such as creating virtual machines, disk management, and common provisioning activities, including retrieving secrets and certificates, as well as configuring and launching various agents like NTP, service discovery, security, configuration management, and logging.

Notably, tasks like app installations and agent configurations are performed during the image baking process using Packer. Moreover, distinct configurations are executed during provisioning time to address specific requirements.

These modules are integral to our infrastructure, enabling declarative infrastructure definition and streamlined provisioning processes. Ultimately, they facilitate the deployment of tens of thousands of virtual machines.

Challenges in Versioning and Tag Management

Over time, requests for new features in these modules are encountered. Some of these features apply to all providers, while others are specific to just one provider. Some changes are even limited to a small part of these modules, like a minor attribute adjustment in the vm submodule. As a result, any change, whether for the entire module or a small part, leads to a new version for the entire repository.

Occasionally, situations arise where changes must be made to all versions. Some of these changes relate to the vm and disk modules, which are not changed frequently. However, since a new tag is created for each feature, numerous tags are accumulated. Even though the vm and disk modules have infrequent changes, all the versions still need to be retagged.

For example, the Infrastructure team responsible for VMware products sought to upgrade the vCloud API and requested us to test Terraform compatibility. Following thorough testing, everything appeared to work seamlessly. However, a new patch version was released during this period. Viewing it as a patch update, the Infrastructure team proceeded with the upgrade. Unfortunately, this decision resulted in unforeseen complications, as the new API enforced a minor change, leading to errors during Terraform’s destroy operations.

To resolve the issue, the VMware team promptly provided a fix. However, the solution introduced a new field for a resource, necessitating a switch to the latest release version of the vcd provider for Terraform. As a result, we had to update version constraints for the vcd Terraform provider and integrate the new attribute into the relevant resources.

Addressing this challenge required creating a branch for each tag, implementing the necessary changes, and then retagging them accordingly. This process proved cumbersome and time-consuming.

Additionally, for tasks such as configuration scripts, Terraform templates are utilized, with each provider’s vm submodule housing its respective templates. Consequently, these templates are duplicated across each provider, leading to redundancy in the configuration process for VMs and disks within the provider modules.

Limitations of Git Notation for Module Sourcing

Instead of utilizing a dedicated registry with version constraints, we opt for Git notation. However, this choice confines us to exact version references. Consequently, when exact versions are used, even a new version in one module triggers a cascade of new versions in all dependent modules. For instance, a non-breaking patch change results in the creation of new versions in all dependent modules, propagating from bottom to top.

Issues with Resource Creation and State Management

The count meta argument is currently utilized on resources that need to be created more than once. However, with this approach, these resources are written into the state as an array. Unintended sliding effects can result from insertion or deletion operations other than from the tail.

Multiple clusters of a service are consolidated within a single Terraform state, organized based on factors such as tribe, provider, region, and environment. This aggregation can encompass several dozens of clusters, each potentially housing up to several hundred nodes. Consequently, managing thousands of instances within a single Terraform state presents numerous challenges.

This aggregated approach introduces several challenges. Firstly, each operation takes significantly longer than necessary due to the extensive terraform init and terraform refresh operations required. Additionally, issues within one cluster can have ripple effects on other clusters, worsening the impact of Terraform failures.

Furthermore, this setup imposes constraints on the versions of Terraform providers that can be used. Terraform must utilize a version compatible with all version constraints across the clusters, limiting flexibility and potentially hindering the adoption of newer features or improvements.

Code Quality and Security Concerns

The absence of various static analysis tools exacerbates our challenges with code quality, security posture, and overall development efficiency. Additionally, without automated testing, it is challenging to identify bugs or invalid code without manual testing. As a consequence of the absence of automated testing, all testing needs to be done manually.

Additionally, the absence of code formatting enforcement makes it impossible to maintain uniform formatting, which can result in chaos over time. It’s essential to emphasize the importance of enforcing code formatting with automated checks like terraform fmt. Without such measures, maintaining uniform formatting becomes unattainable, further contributing to chaos over time.

Furthermore, while modules may be technically valid, testing across each provider is essential to ensure they function properly.

Dependency Management

The absence of the .terraform.lock.hcl file is another significant concern. This file plays a crucial role in maintaining consistent dependency management, thereby preventing compatibility and versioning issues.

For example, after a new version of the Terraform vcd provider was released, Terraform began using the new version due to broad version constraints for vcd. While the majority of our vcd installations were unaffected, one installation lagged behind by a minor version. As a result, the new version of the provider became incompatible with Terraform, leading to failures due to this discrepancy.

Documentation Challenges

The lack of an examples directory hinders our ability to offer clear usage examples and reference implementations for our Terraform modules. Providing such examples would significantly aid users in comprehending how to utilize our modules effectively.

Additionally, the documentation has been outdated for a while, leading to inaccuracies that may misguide users. Ensuring the documentation remains current with new features or changes is crucial. However, the lack of automated documentation generation complicates the task of maintaining its accuracy.

Acknowledgments

Special thanks to Mustafa Karakaya and Tuan Susam for their valuable contributions and feedback on this article.

Conclusion

In conclusion, our journey to improve Terraform infrastructure management has been filled with challenges. From dealing with versioning complexities to managing resource creation and state, ensuring code quality, handling dependencies, and documenting effectively, each hurdle has taught us valuable lessons.

Despite the difficulties, our dedication to improvement has paid off. We’ve rebuilt our infrastructure, making it more efficient and reliable. Looking forward, we’re committed to refining our processes and embracing new technologies to keep improving.

Stay tuned for future installments in our series as we delve deeper into our experiences, insights, and lessons learned in enhancing Terraform infrastructure management. Thank you for accompanying us on this transformative journey.

About Us

Want to be a part of our growing company? We’re hiring! Check out our open positions and other media pages from the links below.

--

--