Cloud infrastructures and workflows automation: a perfect marriage

A brief description of a solution to cloud infrastructure management through an automated pipeline.

Federico Sala
Sep 14, 2020 · 8 min read

Managing appropriately our cloud infrastructure is essential for maintenance and reliability purposes. In this blog post, I will describe how the SRE team in Quantyca automated the company cloud infrastructure management exploiting the Infrastructure as Code approach on Google Cloud Platform.

Photo by Joshua Earle on Unsplash

🛫 A new era of infrastructure management

In modern IT environments, where cloud computing is spreading more and more, there’s a constant need to manage and maintain the entire cloud infrastructure efficiently. Changes to the infrastructure must be delivered fastly, with the guarantee of consistency, reliability, and reproducibility of the whole system.

In the old, traditional IT world, servers and infrastructures were altered in place by manually applying updates or configuring any sort of package upgrade. Just to give an example, suppose you have a completely working V1 web server in production and you need to deploy the newest V2 of that server.

How would you do this? Would you upgrade the V1 server or set up a completely new V2 server? Would you do it by hand or exploiting automation tools?

Nice questions… I bet you are wondering how to realize an efficient, simple, and automatic process. The answer is… (drum rolls 🥁)… Infrastructure as Code!

Infrastructure as Code (IaC) is a process by which SREs or DevOps teams automatically supply and manage an IT infrastructure, avoiding any manual configuration of devices or operating systems. In this vision, the hardware infrastructure becomes, counterintuitively, a software-defined and programmable object.

Indeed, the concept of IaC is very similar to programming scripts: you simply write your infrastructure definition in a pure declarative approach and dedicated tools automatically provision the infrastructure for you. IaC can manage any type of IT device, such as bare-metal servers, virtual machines, databases, and all the associated configuration resources.

One of the greatest benefits introduced by IaC is that infrastructure definition code can be put into a Version Control System (like Git) able to track every single change made to the infrastructure environment. Moreover, VCS allows to automate the infrastructure provisioning process via CI/CD: commits on a specific branch may trigger an automated workflow to deploy infrastructure objects.

Last but not the least, IaC is a good approach to implement the Immutable Infrastructure paradigm. The Immutable Infrastructure paradigm is based on the concept that infrastructures are never modified in place: if something needs to be changed, new objects are created from scratch, replacing the old ones.

Figure 1: Immutable Infrastructure paradigm at a glance.

This was a brief introduction to Infrastructure as Code and infrastructure management. We are now ready to see how we automated the provisioning process of our cloud infrastructure in Quantyca!

🎢 The need for an automated pipeline

Quantyca is a rapidly growing company: more and more people are joining us, more and more applications are being developed. In parallel, technologies are evolving at a very high pace. As you can imagine, there’s the essential need to define guidelines, reference papers, and to guide all the team members to a correct development process. For this reason, our SRE team is constantly seeking for solutions to simplify application deployments and cloud infrastructure management.

Before introducing the Infrastructure as Code approach, we used to manage our Google Cloud infrastructure manually through the Cloud Console or via Cloud Shell. This was a very annoying process carried out by one or two people inside the whole company.

We found out the process had many drawbacks: infinite manual operations, infrastructure maintenance by hand, no tracking of infrastructure changes, centralized responsibility on a few people, no notions sharing; all of these leading to entropy and confusion.

That’s why the SRE team decided to explore the world of IT automation.

As I mentioned before, there are a lot of tools for infrastructure provisioning via IaC. Among those, we picked up Terraform, by HashiCorp.

Terraform allows you to write declarative configuration files for your infrastructure, to evolve and version it on VCS, and finally, it automates the provisioning. You can manage the full lifecycle of the infrastructure: creating new resources, updating existing ones, destroying those no longer needed.

The focal point of Terraform is that you create a fully reproducible infrastructure, starting from staging up to production. When you need a new environment similar to an existing one, you just execute the Terraform script and the tool will provide the entire infrastructure for you.

Another plus of Terraform is that it is fully integrated with the most common cloud providers (Google Cloud Platform, AWS, Azure, and more). You can create pipelines to exploit infrastructure provisioning via cloud services.

Just to show the easiness of usage, here is an example of a MySQL database provisioning in Google Cloud using a Terraform script.

Going deeply, when you run a Terraform script, Terraform basically executes 3 steps:

  1. terraform init: initializes the working directory containing Terraform files and Terraform state;
  2. terraform plan: creates an execution plan and determines what actions are needed to achieve the desired infrastructure state;
  3. terraform apply: applies the execution plan to the infrastructure so as to reach the desired state.

Furthermore, there’s a command to dismiss the entire infrastructure: terraform destroy . As the name suggests, the command destroys the whole infrastructure and resets the Terraform state allowing you to start again from scratch.

🌉 Pipeline architecture

Well, we are ready to discuss the infrastructure provisioning pipeline we built up. Check the architecture schema and then we will analyze each block separately.

Figure 2: pipeline architecture for infrastructure provisioning on Google Cloud.

The entry point of the pipeline is the git push of Terraform code into a BitBucket repository. So as to limit the freedom to make infrastructure changes without any supervision, we organized the deployment process via GitOps as follows:

  • Developers or SRE team members push their Terraform code for infrastructure definition on a branch that we call infrastructure update branch. When they are ok with their modifications, they must open a Pull Request for a merge into the master branch.
  • BitBucket generates an email for the master branch supervisors (e.g. the SRE team master) asking for a code revision.
  • When the supervisor accepts the Pull Request, the code is merged into the master branch, a “push” commit is generated on the master branch, and a trigger fires.
Figure 3: git repository management and trigger firing only after a Pull Request acceptance.

The trigger is configured on the Google Cloud Build service: whenever there’s a “push” commit on the master branch, Cloud Build starts executing a pipeline. First, at a security level, you have to enable Google Cloud Build to make infrastructure changes. In short words, you have to grant permissions to the Cloud Build service account to run Terraform scripts and to manage other Google Cloud resources inside the project.

The pipeline definition lives inside the Git repository itself and it is configured using the cloudbuild.yaml file. This file contains the workflow to be executed in the Cloud Build environment. In our particular case:

  1. Download the Terraform image from a container registry
  2. Launch terraform init
  3. Launch terraform plan
  4. Launch terraform apply

In order to get the cloudbuild.yaml file and the Terraform scripts, Google Cloud Build locally downloads the git repository. All the steps defined in the yaml file are then carried out automatically, without any manual intervention. During terraform init , the Terraform state is initialized inside a bucket stored in the Google Cloud Storage service. Based on this state, terraform plan decides the infrastructure objects to be created, changed, or destroyed. Finally, terraform apply applies the inferred changes. At the end of the job, Terraform automatically updates the remote state in the Cloud Storage bucket to keep a new version of the provisioned infrastructure.

That’s it! 🎉

Such an approach is used to provision the sole infrastructure elements, but not applications. Our infrastructure includes a Kubernetes cluster for containerized web applications deployment, a Cloud SQL database for data management, Cloud Functions and Cloud Pub/Sub queues for internal services exposition, and all the necessary network connections.

All the provisioned services run in a VPC, protected by security policies controlled by Cloud IAM and Identity-Aware Proxy (IAP). IAM and IAP are used by the SRE master to monitor people's authorization on the Google Cloud project, resources, and exposed applications.
Without IAM policies, anyone with write access to the project could launch Terraform commands from the Cloud Shell and change the infrastructure environment. Imagine if someone launches a terraform destroy command to your infrastructure… Boom! Your infrastructure is gone!
But never mind, this approach is completely safe! Repeatability allows you to fully reproduce your infrastructure from scratch and to restart again in minutes!

Our infrastructure comprises two different environments: test and production, each having a dedicated Google Cloud project. Since the pipeline is related to a single Google Cloud project, we decided to manage each environment separately. This means that we have two BitBucket repositories each linked to the proper Google Cloud project.

Keeping the configuration separate helps us to have more control over processes allowing us to provide different responsibilities depending on the environment (e.g. test is monitored by senior engineers, production is monitored by SRE masters).
Moreover, the separation allows having slight differences between the two environments. Although this is not recommended, in some cases we want to save costs and so we deploy cheaper cloud resources in the test environment.

🏁 Conclusion

This was a short description of how we automated the process of infrastructure provisioning on Google Cloud.

Infrastructure as Code is a simple and effective approach, and Terraform is really an interesting tool which deserves further study. IaC is only a medium, so you keep full control over the provisioned resources. You are free to decide whether to apply or not the immutability paradigm, for example!

In our case, however, there’s still a lot of work to be done. Our dream is to fully manage the entire cloud resources via GitOps and IaC, even network connections and security configurations. We wish to have every operation tracked on Git: the BitBucket repository must become a single collection point where the whole infrastructure is described in a way that everyone can understand.

Thanks a lot for reading, I hope this was interesting! Stay tuned for the next episode!
In the meanwhile, take a look at our LinkedIn profile!

Quantyca — Data at Core