Integrating Coupang platform services to a microservice architecture

Part II of a two-part series about transition from a monolithic architecture to a microservice architecture

Coupang Engineering
Coupang Engineering Blog
17 min readMay 25, 2018

--

By Jaehoon Jeong

This post is also available in Korean.

The microservice architecture at Coupang enables our engineers to independently deploy and test the hundreds of new features that are developed every day. However, it is a complex structure that is difficult to manage and operate efficiently.

In this post, we detail the challenges we faced in operating a microservice architecture, and the platform services we built to address these challenges.

Configuration Management Database

When we transitioned to a microservice architecture, services that were previously tightly coupled were separated as hundreds of loosely coupled microservices. Moreover, we developed new services every month to meet our growing business needs. Eventually, we were using more than 10,000 servers to maintain our microservice architecture. As the number of services exploded, we needed a system to manage the myriad of services and the resources these services used. To this end, we built the Configuration Management Database (CMDB) in 2015.

The CMDB is a metadatabase that stores information about our internal services and the assets each service uses. It stores metadata as a collection of key-value pairs with hierarchical relations.

A look at how metadata relationships are mapped in the Coupang’s CMDB
Figure 1. A look at how metadata relationships are mapped in the CMDB

The metadata relationships are mapped as in the Figure 1. For example, if there is a service named Member, that service has many components, including member-api, member-front, member-db and more. Each component also has a physical instance that also has metadata information such as service configurations, a domain name system (DNS), Git repository, and more.

Not only did the CMDB aid resource management, but it also came in useful when Coupang transitioned to a cloud environment in 2017. In a cloud environment, server resources are dynamically switched and have short lifecycles, necessitating a system to track and manage them. The CMDB could serve such cloud environment needs. The CMDB and its RESTful API was especially handy when building automation between platform services for cloud servers. When new features are deployed in a cloud environment, existing instances are dynamically reconfigured. The CMDB provides automatic alarms, enabling engineers monitor servers with ease.

The Coupang’s CMDB in a cloud environment
Figure 2. CMDB in a cloud environment

Coupang Deployment System

The lifecycle of a Coupang service
Figure 3. The lifecycle of a Coupang service

We deploy around 100 to 200 microservices each day, during which more than 2,000 new instances are formed. To expedite the service lifecycle at our scale, we developed a robust deployment system.

To meet our requirements, we built a cloud-based online deployment system practicing a blue-green deployment strategy. Our deployment system has the following functions:

  • Deployment authority control
  • Automatic configuration of service resource stack
  • Quick and cost-efficient deployment
  • Rollback within 10 seconds of deployment in the case of an incident
  • Efficient use of resources using Auto Scaling group with target tracking
  • Graceful shutdown support
  • Health check and service warm-up support
Coupang’s deployment pipeline consists of three phases: stage, canary, and all. The confidence system automates the release and merge processes between the release/rc and the release/master.
Figure 4. Our deployment pipeline consists of three phases: stage, canary, and all. The confidence system automates the release and merge processes between the release/rc and the release/master.

A/B tests

Service improvements are gradually released in a microservice environment. To efficiently determine which new features should be rolled out, we conduct A/B tests. An A/B test is an experiment where the existing feature, or the A feature, is shown to the A group and the new feature, or the B feature, is shown to the B group. We run the experiment for a certain period of time and then compare the performance metrics of each group to conclude whether the new feature B is better than the existing A feature.

Since converting to a microservice architecture, we have developed an A/B test experiment platform to verify every new feature. As shown below, users can control what platform or device to run the A/B test on, how often it should be exposed, and the duration of the test. For each test, users can see a variety of performance metrics, such as gross merchandise volume and conversion rates. The experiment platform has been at the forefront of leading our data-driven business decisions.

A partial look at the Coupang’s internal experimentation platform
Figure 5. A partial look at our internal experimentation platform

Coupang API Gateway

Due to the loosely coupled nature of the microservice architecture, each service has its own separate API. Although the customer only sees a single, integrated service on the Coupang app or website, internally, there are over a hundred services and over 10,000 APIs. It is challenging to manage the myriad of service APIs because the level of access granted to each consumer may differ or an API may have a variety of different versions.

Above all, the largest challenge we faced occurred when an API was altered. With each modification, the individual developer had to notify all the API users and manually identify possible errors. As the number of users and services grew, it became unproductive for one developer to tackle this task. To automate this process and efficiently manage the APIs, we built the Coupang API Gateway.

A screenshot of the Coupang API Gatewa
Figure 6. A screenshot of the Coupang API Gateway

Confidence System

Incidents occur for three main reasons: code bugs, performance issues, and hardware failures. Internal reviews revealed that at Coupang, incidents were largely caused by code bugs or performance issues. To automatically detect and contain these two types of incidents, we created the Confidence System.

A visual representation of the Coupang’s Confidence System at work
Figure 7. A visual representation of the Confidence System at work

Our deployment pipeline is made up of three phases: stage, canary, and all. During the stage phase, we test the new feature without any customer traffic. In the canary phase, we rollout the new feature to one server and thus to a limited number of customers. Finally, in the all phase, the new feature is deployed to all the servers.

The Confidence System kicks in during the canary phase of the deployment process. It monitors the metrics of the existing servers and the canary server to determine the stability of deployment. If there is an issue with the canary phase, the system automatically rolls back the new feature and blocks its rollout to all the servers. In actual production settings, the Confidence System prevented multiple incidents from occurring and dramatically enhanced service uptime.

Circuit breaker system

A high-level look at the Coupang’s circuit breaker system
Figure 8. A high-level look at the circuit breaker system

If the Confidence System is an error prevention measure before deployment, a circuit breaker system is a fault-tolerance measure after deployment. In our architecture, numerous microservices communicate with each other in a complex network of interdependence.

To avoid one service failure from cascading to other services, an internal circuit breaker system named Valve was developed. Integrating Valve into our operations has been instrumental in improving service stability. Valve is a centralized circuit breaker system that can sync to the various internal platform services while also controlling infrastructures and services. For instance, if one server in operation fails, Valve will automatically remove the problematic server and insert additional servers to minimize business disruption. If the error is tied to a specific service, such as the shopping cart service, Valve will hide the shopping cart button on the product page to prompt the customer to make the order without going through the cart page.

Conclusion

From its humble beginnings, the Coupang service architecture has evolved into a complex microservice system operating hundreds of microservices in a timely and efficient manner. Although we have implemented various platform services such as the CMDB and experiment center to aid microservice operations, we believe the Coupang architecture must perpetually evolve to realize our business ambitions.

Every day, we experience new challenges in operating our microservice architecture and constantly seek out new innovative methods to address these challenges. Currently we are looking into improving issues related to latency visualization and monitoring, release engineering, complexity management, and error root cause analysis.

Series index

This is part 2 of a two-part series about our transition from a monolithic architecture to a microservice architecture.

Part 1 — How Coupang built a microservice architecture

Part 2 — Integrating platform services to a microservice architecture

If any of the one of the above challenges excites you, come join our team!

--

--

Coupang Engineering
Coupang Engineering Blog

We write about how our engineers build Coupang’s e-commerce, food delivery, streaming services and beyond.