Improving the CI/CD pipeline for mobile app development

6 key measures Coupang took to improve the existing pipeline

Coupang Engineering
Coupang Engineering Blog
8 min readJul 15, 2022

--

By Johnson Li

Illustration of the infinity symbol which commonly represents the CI/CD pipeline

This post is also available in Korean.

Continuous integration (CI) and Continuous delivery (CD) facilitates teams to efficiently deliver the product to the market without compromising quality. It also helps teams to focus on more important tasks like fulfilling business requirements, code quality, and security by eliminating manual tasks. In this post, we share how Coupang improved the existing CI/CD pipeline for mobile app development to enhance engineering productivity, focusing on the benefits of migrating from physical machines to AWS EC2.

Initial CI/CD pipeline of Coupang

We had first designed the mobile CI/CD pipeline for the Coupang e-commerce app in 2020. We used feature-driven development and developed features on an isolated branch which was then integrated to the main branch. This encapsulation enabled multiple full-stack teams from different regions to develop new features concurrently. (Coupang’s full stack teams are made up of members from different domains, regions with different skillsets, working together to scale the end-to-end business vertical.) The general overview of the pipeline is shown below.

Diagram showing an overview of the Coupang’s mobile CI/CD pipeline from writing code to deploying code
Figure 1. Overview of the mobile CI/CD pipeline

In our initial pipeline, engineers built features on the feature branch in a personal repository that forked from the release candidate (RC) branch. We then manually ran QA testing and initialized a pull request. Once integration testing and code review was manually conducted and passed, the code was merged to the RC branch on the production repository for deployment. The pipeline relied on Mac minis as dedicated physical servers.

All mobile teams adhered to the unified workflow to ensure that tasks were completed in a validated and systematic way. However, the pipeline had many manual steps that could not scale to support services other than the e-commerce app.

What did we need?

We quickly realized our need to improve the pipeline because of the frequency of updates by our engineers across the globe — we were conducting more than 3,000 builds per day and releasing a new version of the app every week.

We needed to reduce time spent on downloading code from the production repository. We had ample lines of code for Android and iOS, and hundreds of engineers working on the code base every day.

We needed to increase efficiency. Our CI process involved too many steps that required manual initiation for jobs like checking out the code, compiling, conducting static analysis, unit testing, and even automation testing. Our CI/CD process also needed to be better integrated with other systems like the device farm, monitoring system, and A/B testing platform and add efficiency to a larger workflow.

We needed scalability. Because our CI/CD was based on a fixed number of physical machines, we could not scale our operations automatically.

We needed to minimize potential security issues. Due to the enhanced security policy of Coupang, we had various issues we needed to address. The main issue was the need to strictly distinguish the office network environment and the production environment.

We needed to reduce maintenance costs. The costs associated with maintaining the physical machines were high. Also for engineers, it was becoming increasingly strenuous to conduct regular and urgent on-site checkups as we were transitioning to remote work.

Mobile CI/CD 2.0

In the beginning of 2021, we decided to build version 2.0 of the system. Here are some key measures that we have taken to upgrade the system.

1. Use the Git reference repository to reduce download time

The first step of using the initial CI pipeline involved downloading a specific version of code to a personal repository. This consumed high network traffic and time due to the volume of code which was projected to increase even further. To avoid downloading the entire repository in full and simply retrieve parts in increments, we introduced the Git reference repository by creating a local mirror Git repository on each physical build machine.

2. Introduce incremental static analysis

We used GitFlow as a branching model and conducted static analysis on code when a pull request was made for the code developed from the engineer’s personal repository. However, if static analysis detected a large number of issues or blocker issues, it was an arduous task to fix the issues on time especially given the short release interval and the code freeze period.

To allow engineers to check for any issues during development, we created a dedicated server-side branch for feature development that can be used instead of a personal repository. We then set up a dedicated multi-branch CI pipeline that was triggered by Git commits. This meant that prior to submitting a pull request, any changes to a branch of the repository will automatically trigger the pipeline to conduct static analysis and send the result to the engineer.

3. Organize the pipeline to enable parallel tasks

We had many jobs that were heavy and sequential. So we reorganized the CI pipeline to enable parallel execution of tasks that were not dependent on another task. This increased the CI pipeline throughput and accuracy of the build and helped to generate faster analysis reports.

4. Migrate physical machines to AWS for scalability and stability

Considering the stability, scalability, and maintenance cost of physical machines, we decided to migrate our CI/CD pipeline to the AWS production (prod) zone. AWS originally did not support macOS EC2 but had announced the availability of macOS-based EC2 instances in 2020. So we were able to migrate the entire pipeline for Android and iOS to AWS and discontinue the use of physical machines in 2021.

The system was deployed to the prod zone. And this enabled us to address our needs for enabling rapid scaling and failover of services, reducing maintenance costs for the Mac mini server farm, and removing security concerns related with the communication between the office environment and prod environment.

Diagram showing an overview of the Coupang’s CI/CD pipeline on AWS
Figure 2. Overview of the CI/CD pipeline on AWS

5. Build a monitoring system for CI/CD

Once we have migrated to AWS EC2, we built a monitoring system to monitor and run health checks on all build agents to prevent and detect incidents.

We required various types of metrics. For hardware, it included CPU utilization, IOPS, free disk space, network latency, and load. For software, it included key health metrics like the number of online and offline nodes, number of free or used executors, duration of jobs or job queues, current queue size, and ratio of failed jobs. We also required other metrics like HTTP 2xx rate (%), HTTP 4xx rate (%), and HTTP 5xx rate (%). We registered these metrics to a monitoring tool to collect and gather them and used a separate tool to visualize the ingested metrics, configure alerting rules, and send notifications to a chosen channel. Using the newly built monitoring system, engineers could observe the pipeline and take immediate action when necessary.

6. Apply auto-scaling controls to CI pipeline

Once we migrated to cloud, we could scale the CI pipeline on demand by using the EC2 Auto Scaling feature. We added container lifecycle hooks to deploy and register or unregister new build agents to the build master as needed.

Diagram showing an overview of auto-scaling for CI pipeline at Coupang
Figure 3. Overview of auto-scaling for CI pipeline

To evaluate the timing for scaling, we applied several auto-scaling controls based on the data like the idle duration of the build agent or the assessment of metrics. For example, the throughput of the CI system is calculated using the following formula.

Throughput = Number of processed jobs / Unit of time

When throughput is lower than threshold, it means that the CI system has reached bottleneck and needs to be scaled out. We also factor in other aspects to the scaling strategy like periods of time where we have an increase in the number of engineers working from all over the world.

Expanding to family services

Coupang started as an e-commerce platform, but we are rigorously expanding our family services to food delivery, streaming, and more. Despite the distinction between the business domains, we made the mobile CI/CD pipeline available to other services so that we can share the resources to enhance scalability and flexibility and proliferate the benefits of CI/CD.

  • Code repository: We share the same standards, practices and conventions for branch management, code synchronization, and commit messages.
  • Workflow: The CI pipeline has an environment for development and a separate one for production. Family services can choose to use both environments or just the development environment.
  • Deployment: Family services share the deployment environment where each service is treated as a tenant. To reduce the interaction between the services, we use a separate auto-scaling group for each service.
  • Utilities: We share configuration templates and utilities with family services to enable any service can use the CI/CD pipeline.

Achievements

One month after the launch of CI/CD 2.0, we discovered from our analytical dashboard that we:

  • saved 5% of time on collaboration for 18 full-stack teams.
  • reduced the build time by 45%, and reduced the time spent on static analysis by 77%.
  • reduced the maximum wait time of CI jobs from 1 hour to less than 15 minutes.
  • saved 2 hours on the time spent for adding a new build agent.
  • reduced AWS cost by 16% due to autoscaling.
  • can now support 9x traffic.
  • can now support any new product within 2 man-days.
  • have eliminated security risks related with accessing the production environment from the office environment.

What’s next?

Mobile CI/CD is a critical part of the mobile development workflow. Continuous improvement is an ongoing task for us to adapt, meet demands, and sustain success. We still have several areas we want to improve on to better service our mobile teams and improve productivity and usability.

Introduce auto-scaling for iOS

Even though we have migrated to AWS, we manually scale macOS build agents due to technical challenges. We are currently in the process of testing a proof of concept for iOS.

Improve productivity further

The core value of the mobile CI/CD is to improve team productivity and product quality in full automation. Large part of the CI/CD hinges on automation but we still have various manual tasks that can be automated. To meet the expansion of our business and team size, we must continually improve the efficiency of the pipeline, eliminate bottlenecks, and help to enhance the productivity of engineers and deliver products of highest quality.

Develop more tools

We have implemented a lot on CI/CD like automatic code review, static analysis, incremental static analysis, check on A/B test modifications, and more. But we want to go above and beyond our customer’s expectations. We are in the process of developing further tools like automatic clean-up of completed A/B tests, automatic detection of code conflicts between features, and force closure of A/B tests in case of a crash. With these convenient and efficient tools, integration with mobile CI/CD will greatly help teams to better execute processes in the standards we have defined and greatly simplify development.

Did you enjoy reading the post? We are growing fast and always looking for talented individuals to grow with us. Check out our open positions for information about our new and exciting opportunities.

--

--

Coupang Engineering
Coupang Engineering Blog

We write about how our engineers build Coupang’s e-commerce, food delivery, streaming services and beyond.