Scaling Infrastructure as Code Culture in Xendit

How we are using Terraform and Atlantis to implement Infrastructure as Code and where we are heading next

Published in

Xendit Engineering

8 min readMay 23, 2022

Orchestrating Xendit Infrastructure using Terraform

Xendit’s vision is to build digital infrastructure in the Southeast Asia (SEA) region. We start from Indonesia and expand to other SEA countries. Digital infrastructure is the bedrock of our online lives. It enables end-users to have a seamless payment experience such as buying daily needs, paying for tickets, and other stuff. In this post, we are going to talk about how we use Infrastructure as Code (IaC) to manage underlying IT infrastructure that enables Xendit digital infrastructure and where we are heading next.

What is Infrastructure as Code?

Modern IT infrastructure provides Application Programming Interface (API) for users to interact with it. This API enables various ways of managing IT infrastructure. We can use tools to interact with the API. There are two paradigms when using tools. The first one is the declarative paradigm in which we give the tool the intended state that we want. While in the imperative paradigm, we give the tool set of instructions that it needs to perform.

Infrastructure as Code (IaC) is a way of managing infrastructure by combining the API of modern IT infrastructure with a declarative paradigm. In IaC, we use code to declare to the system the end state of our infrastructure. The system figures out step by step actions required to bring the current infrastructure to the intended state. The system in IaC is smart in the sense that it is able to calculate necessary actions and carry out the actions to bring the infrastructure to the intended state.

Another paradigm of managing infrastructure is imperative. In this style, we provide the system steps by step actions that it needs to carry out to achieve the intended state. The system in the imperative paradigm will happily accept instructions and perform them without the need of knowing the intended state.

IT Infrastructure in Xendit

Xendit uses modern IT infrastructure ranging from on-premise data centers to cloud providers such as AWS, GCP, and Alibaba Cloud. We are using multiple infrastructure providers to meet compliance requirements from regulatory bodies such as PCI DSS, ISO 27001, and BI/OJK (Banking and payment regulatory body in Indonesia). Meeting the compliance requirement is not enough. We also need to move fast for the business to survive and thrive. All these infrastructures from diverse providers are chiefly owned and managed by infrastructure engineers/SRE under the Infrastructure Engineering organization. In Xendit infrastructure engineers, infra engineers, and SRE are the same role. We are using them interchangeably in this article.

Infrastructure as Code in Xendit

Terraform is the main IaC tool we use to manage IT infrastructure in Xendit. In Terraform, you write the intended state for your infrastructure in a declarative language called HashiCorp Configuration Language (HCL). We currently have +1000 Terraform projects owned and managed by the infrastructure team. We expect this number to grow since Xendit is still growing and we need more IT infrastructure to support the growth of the business.

All those Terraform projects are stored under the infrastructure team git repository on Github. We turn to Atlantis to help us orchestrate the execution of Terraform projects. The infrastructure team uses Github Pull Request (PR) to ask for code reviews from other infrastructure engineers. Atlantis brands itself as Terraform Pull Request automation which fits perfectly with our workflow. See the picture below to illustrate our Github and Atlantis workflow

We have dozens of private Terraform modules used internally in Xendit. Terraform modules are reusable components that we use to provision standardized IT infrastructure. For example, we have an internal Terraform module to provision database clusters. In the microservice architecture, it’s desirable to give each microservice its own database. Having a Terraform module for this specific use case enables us to provision databases faster and in a standardized manner.

Journey of infrastructure changes

Having a safe and secure workflow to deliver infrastructure changes into the production environment is important for all information technology companies. As an IT company that offers financial services, we need to abide by various compliance regulations. I won’t go deeper on each of the rules but we are going to focus on the following two principles:

Intent and reason behind infrastructure change must be captured in a long term written format
Changes must be reviewed and approved by other people (a.k.a 2-person rule)

There are a lot of potential sources where changes are originating. It can be from an infrastructure roadmap, conversation in Slack, or action items from post-mortem meetings. The intent and reasoning for the changes are captured in Jira tickets. Once the tickets are vetted and approved, it is assigned to the corresponding infra engineer. The engineer implements the changes in Terraform configuration on their local machine. Once it is done, the engineer pushes their changes to the Github repository and creates a new PR based on the submitted changes. Atlantis receives notification when a PR is created or updated. It runs the terraform plan command and updates the PR with the result. See the screenshot below for an example

Github PR planning Terraform via Atlantis

The PR description contains a link to the corresponding Jira ticket to help reviewers check the intent and reason behind this change. Reviewers also use information from Atlantis to make sure that implementation is sound. At this point, changes can’t be delivered yet since Atlantis is configured to check whether the PR has been approved. Once reviewers are satisfied with the PR, they approve it. The PR review process is complete and infrastructure changes are ready to be delivered. The engineer instructs Atlantis to apply the changes by giving a certain comment on the PR page. Atlantis will report back the result once it’s completed the process. The last step of the workflow is cleaning up. The engineer merges the PR into the main branch and deletes the old branch. See the screenshot below for an example

Github PR applying Terraform via Atlantis

Simplifying access management for joiner/mover/leaver

The number of engineers is growing fast in Xendit. When a new engineer joins Xendit, Infra engineers need to provide access to various parts of our internal systems for the new engineer. There are dozens of internal systems we use to support engineer productivity. For example, we need to give engineers access to our Kubernetes environment for them to start development. Another example is managing access to Datadog (monitoring platform) and PagerDuty (alerting platform).

The detail of access varies depending on the product team the said engineer joins. We codify the access permission for each product team on their own Terraform projects. So when an Infra engineer receives a ticket to onboard a new engineer to a certain team, they look up Terraform projects for that team and append the engineer’s info into there. The changes go through a PR before it gets applied via Atlantis. The workflow is depicted below

A similar process happens when engineers move to another product team or when they leave Xendit. Instead of adding engineer info into Terraform projects, Infra engineers will remove the engineer’s info from Terraform projects.

Codifying the access permission for each product team in Terraform makes it easy to onboard and off-board engineers from and into product teams. We can be sure that we are not missing any system when granting access. It takes less time for Infra engineers to manage the access control compared to configuring the access control via each system administrator user interface.

Standardizing basic microservice monitoring

Xendit is using a microservice architecture. We have a lot of services that generally can be categorized into three types. They are web servers, queue runners, and cron jobs. The web server type receives HTTP requests from the client and returns HTTP responses to the client, while the queue runner type receives input from a message broker. The cron job type is a task that runs periodically in the background. For each type of service, we define basic aspects that we want to monitor such as service health, service availability, and reliability. We observed across the product development team that there are similarities to basic service monitoring.

We are using Datadog for our monitoring system. Operational metrics for the microservices are sent to Datadog. We have an internal Terraform module to create standardized basic monitoring for the microservices. Each product development team uses the Terraform module to expedite building standard monitoring for their service. This allows them to focus on adding additional monitoring specific to their needs. Having an extensive monitoring system is important for product teams since they own the development and operation of their microservices. The workflow to set up the basic monitoring for microservices are illustrated below

Provisioning basic monitoring for microservices

What’s next?

We are not done yet with our Infrastructure as Code in Xendit. Infrastructure as Code culture changes the way we see infrastructure work. It’s already given us a lot of benefits and we want to push it further. We are working hard to make it the default way of provisioning infrastructure across Xendit engineering. Expanding the IaC coverage in terms of the number of use cases and affected engineering teams is the natural progression for us. One of the use cases is a compliance audit. Many activities before, during, and after audit days are performed manually. We are thinking of applying IaC tools and practices to automate many aspects of the audit.

Terraform and Atlantis are the heart of our IaC workflow. We are open to exploring other tools that complement our existing IaC tooling. We understand that there is a limit to certain tools. IaC practices are evolving and we need to keep an open mind to a new trends. One of those trends is the GitOps style of managing resources in Kubernetes. We are evaluating the trend and adapting our tools and practices to reap the full benefit so we can pass it on to our customers. While it’s important to keep track of the trend, we need to stick to our own vision which is

Infrastructure should only be one Pull Request away from any engineers in Xendit

Conclusion

This post shows how we use Terraform and Atlantis to provision infrastructure following the Infrastructure as Code principle in Xendit. We have +1000 Terraform projects that govern various parts of Xendit infrastructure. We show a couple of use cases where Terraform solves big problems for us such as faster provisioning of database clusters, reduced time to onboard and off-board engineers, and establishing standard monitoring for services. It’s safe to say that IaC is the default way of provisioning infrastructure for Infra engineers in Xendit. We are not done yet with our Infrastructure as Code. We want it to be the default way of managing any infrastructure across Xendit engineering. Any engineer should be able to deliver infrastructure changes safely and securely. Infrastructure should only be one Pull Request away from them.

If this kind of work inspires you, join us. We have a lot of job openings on https://www.xendit.co/en/careers/job-opening/