Our Journey in Adopting CI/CD on Multiple AWS Accounts

Rizki M
Traveloka Engineering Blog
7 min readJun 6, 2022
Isometric illustration vector created by macrovector — www.freepik.com

Push Forward is one of Traveloka’s 10 guiding principles that Rizki and his team walk the talk in their three-year journey in relentlessly improving Traveloka’s system; from migrating their backend codebase (single repo to multi-repo), to enhancing their services deployment and automation processes through CI/CD with multi-account approach.

Rizki Muhammad is Engineering Manager for the Private Accommodation team overseeing Traveloka’s Holiday Stays product.

Being an organization in a fast-paced industry, we need to adapt constantly to the changes of internal & external environments. As one of the unicorns of Southeast Asia, we’ve come a long way in terms of technology improvement and it shouldn’t stop there. To keep our system in top performance, we need to continuously improve our various processes of designing, developing, monitoring, maintaining, and improving a system. I know it’s similar to DevOps’ culture, because it’s indeed what we need to do to ensure our system is stable, scalable, and maintainable.

Figure 1. DevOps Culture [1]

Our entire backend codebase was one big monolithic repository that each build would take more than 30 minutes to complete for each Pull Request (PR). That’s inefficient, right? Hence, in late 2018, we started to adopt a multi-repo approach with clear boundaries across teams, resulting in up to 10x faster build time.

As one problem was solved, another emerged; we frequently bumped into our deployment process and services limit on our Amazon Web Services (AWS) account that we needed to ask other teams to delete their unused resources so that we could deploy our changes. Although we could have easily overcome such a problem by requesting additional limits from AWS, it’s a temporary solution that wouldn’t scale for our process. The long-run solution? We initiated the AWS multi-account initiative, where every Product Domain (PD) would be given three different AWS accounts for each environment: development, staging, and production.

By the time we completed both multi-repo and multi-account initiatives in mid-2019, we found yet another problem unrelated to the codebase or infrastructure as before, but related instead to our deployment process. Our services production release cycle is on a weekly/biweekly basis that turned out to be quite a hassle when we also needed to release multiple services with their own release cycle. Because the number of services managed by one PD is usually more than the number of their software engineers, manually releasing all those services also takes time and we see that it can be improved. Based on the book “Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organization”, there are four key metrics that can expose the performance of a software development team; Change Lead Time, Deployment Frequency, Change Failure Rate, and Mean Time to Recover. If we look at the metrics mentioned then our lead time will be 1–2 weeks at the very least because we’re still adopting weekly/biweekly release cycles. To improve those metrics Continuous Integration/ Continuous Delivery (CI/CD) could help a lot because we opened up a way to fasten up our deployment process (instead of waiting for a weekly/biweekly cycle, we can just create a release everyday as we see fit).

At the beginning of implementing CI/CD, our main pain points were the effort and time needed to release multiple services because engineers needed to standby in case the deployment went wrong. The concept of CI/CD itself is quite simple, but the tricky part is that our multi-account implementation in every release candidate has to be deployed in Staging and tested through Automation, before being deployed to Production. Usually, not only do we use Terraform to provision infrastructure and release new service versions, but we also implement CI/CD through an internal solution (by utilizing Terraform). Although that can automate the deployment process, we are still not sure if that was the best solution for our needs because Terraform currently doesn’t offer granular change and deployment restrictions to prevent our engineers from deploying unintentional catastrophic configurations to our infrastructure and service versions. After doing some online research, we found that there’s a native AWS solution (sample) that could solve such a problem, although with the limitation of best suited for a single AWS account configuration.

As a software engineer that limitation can become a good motivation to explore creative solutions in solving our problems with all known variables and limitations. Remember that the initial goal of implementing CI/CD is to reduce our engineer’s effort and time in manually releasing all services into production. Hence, automation is key. By automating the process of releasing services to production, we’ll be able to trigger multiple releases automatically with confidence and just need to monitor the reports (e.g. automated sanity testing report). As we all know the adage “with great power, comes great responsibility”, being an engineer in Traveloka means that we have a great power (responsibility) in making our services run smoothly (or not).

Back in 2019, there were not many references in adopting AWS CodePipeline with multi-account configurations. So, we just had to use all the tools (and the knowledge) that we had. As can be seen on figure 2 below, our goal is to automate the deployment process by following certain procedures (staging deployment and pre-production deployment approval mechanism) while also collecting the four key metrics mentioned earlier. One of the perks of adopting AWS CodePipeline, their managed CI/CD service, is the flexible adjustment of the individual process. For example, we can enable automation before requesting for a deployment approval to ensure that all existing capabilities are not broken and our changes are stable to be released to production.

Figure 2. Simplified Deployment Pipeline Flow

Looking back at the three-year journey, we went through multiple phases of adopting AWS CodePipeline with different architecture, starting from gathering requirements (mostly about our engineers' release procedure’s pain points), doing research and analysis of several available options (carefully considering their pros and cons), doing a Proof of Concept (PoC), collecting feedback of the idea (and the PoC), enabling automated release into staging environment for some our services, improving the process based on the findings after enabling it in some of our services, enabling automated release into the production environment (and enabling approval mechanism to prevent unintended releases), improving the production process by enabling automated sanity testing, and so on. I believe there’s no one-size-fits-all solution. Hence, we have to remember our goals and adapt the solutions accordingly (every service has its own infrastructure and characteristics).

For the past four years (I joined Traveloka back in 2018), we have been constantly improving our development and deployment processes such as migrating our repository from one big monolithic to a multi-repo one as well as our deployment automation migration from Ansible, Jenkins, Terraform, and as of today, to native AWS Solution (CodeBuild + CodeDeploy + CodePipeline). As one of ten Traveloka’s principles, it’s our DNA to Push Forward on improving ourselves, aside from all the achievements we’re always looking forward to the next improvement on our system. Two examples of several potential improvements that could be made to our deployment pipeline is to enable Integration and Load Testings to ensure every change is safe to release and scalable (also we can make sure that our Auto Scaling Policies are effective confirmed by passing the load test) and to handle our services multi load-balancer applications.

As a software engineer, it’s our responsibility to solve problems by utilizing our technical expertise (e.g. automating processes to save time in the long run). As a software architect, it’s our duty to ensure that our system/solution is both scalable and maintainable. While as an engineering manager, it’s our job to make sure that our team performs well. So for my fellow engineers, architects, managers, we should never settle and be easily satisfied with our current system because there’s always room for improvement given enough time. Remember that our job is to create great solutions that solve problems efficiently, not the other way around. Sometimes, we don’t realize that we have a problem because it has become part of our routine (while maybe we can solve it with automation and make our (and others’) lives easier in the long run). Hence, it’s always good to be intentionally and acutely aware of our surroundings.

If you’re interested in continuously challenging yourself while improving our customer and developer experience, come and join us.

--

--