Data platform 2022: Transforming bytes to business insights

Part II of a two-part series about Coupang’s data platform, focusing on the analytics platform

Coupang Engineering
Coupang Engineering Blog
9 min readJun 24, 2022

--

By Youngwan Lim, Michael Sommers, Eddard (Hyo Kun) Park, Thimma Reddy Kalva, Ibrahim Durmus, Martin (Yuxiang) Hu, Enhua Tan

This post is also available in Korean.

In our mission to “Wow the customer,” all critical business decisions at Coupang are backed up by data and statistics. We let the data talk for the customer, whether it be about a minor feature change on the app or an important decision about the company’s operational strategy. If you’re interested in learning more about how we ingest data and use it to develop ML models and conduct A/B tests, check out part 1 of this series.

In this post, we will focus on our analytics platform and how it plays a critical role in transforming raw data into business insights.

Table of contents

· Analytics platform
· Redshift data warehouse
· Hadoop infrastructure and big data
· Customer Experience Analytics Platform
· Metadata management
· Conclusion

Analytics platform

The analytics platform is an integral component of the Coupang data platform architecture
Figure 1. The analytics platform is an integral component of our current data platform architecture

The analytics platform is an integral component of the data platform that allows our teams to absorb, discover, and analyze data to reveal actionable insights that inform business strategies and decisions. It provides tools and fundamental infrastructures that support experimentation metrics calculation, machine learning model development, operational and executive dashboarding, research, and ad-hoc analytics.

Today, we will be discussing how we improved our data warehouse, Hadoop infrastructure, customer experience analytics platform, and metadata management strategies.

Redshift data warehouse

On our analytics platform, Amazon Redshift is used largely as a data warehouse for sales and financial data, which require sophisticated access control and data analytics to support critical financial reports and sales-related metrics. We also use Redshift in other critical domains such as product catalog and marketing.

Challenges

Although Redshift was a great starting solution to conduct analytics with minimal maintenance and quick onboarding, we began to face several problems as both data volume and user queries grew significantly.

  • Limited concurrency and queue. Redshift is a massively parallel processing (MPP) platform, where the data and processing power is split among several nodes. When we were onboarding new users and use cases to the Redshift cluster, we ran into issues such as lock and performance downgrades.
  • Lack of data sharing. Because data is not shared across clusters in MPP, we had to maintain numerous copy jobs across clusters. It was a heavy load process on top of each cluster, not only taking away the resources needed to process and query data and causing delays, but also making it difficult to manage across numerous source and destination clusters.
  • Low scalability. Redshift is not as flexible as Hadoop in terms of scaling in and out. On-call engineers spent more time on user communications and healthiness checks during scaling than actual engineering.
  • Lack of support for unstructured data. Commonly used structured or unstructured data required additional processing to be integrated to Redshift.
  • High operational costs. Cost was also a key factor that prevented us from spinning up larger clusters to meet growing processing and storage demands.

Solution

Coupang AWS Redshift architecture
Figure 2. The Redshift data warehouse was split up into processing clusters and customer query clusters.

To resolve scalability issues with Redshift, we split up the architecture into three processing clusters and five customer query clusters. Each cluster is tailored to specific load and use patterns and access controls, which allows each cluster to focus on one type of query and reduces issues of varying jobs competing for resources.

The three processing clusters are split into one critical processing cluster and two common processing clusters. The two common processing clusters are used for raw data extraction and load via the ingestion platform and some light and less critical transformations. The critical processing cluster runs critical and heavy transformations from raw data into the data mart.

The five customer query clusters are sandboxes for each different domain to conduct light-weight transformations and analytic queries. For example, these queries can be used for domain specific ad-hoc analysis, additional transformations needed within the domain, and for calculating recurring queries to support dashboarding.

In addition, all clusters are turned on as Redshift RA3 clusters to enable Datashare, which provides instant, granular, and high-performance access to data across clusters without the need to copy or move it manually. Although Datashare introduces additional costs, we decided the costs were more reasonable than physically copying clusters and costing latency to our users.

Lastly, we added support for Redshift Spectrum. Spectrum allows us to efficiently query and retrieve structured and semi-structured data from files in S3 without having to load the data into Redshift tables.

Next steps

As next steps, we hope to migrate Redshift tasks to EMR-based queries to take full advantage of the scalability options provided by EMR. Our current Redshift architecture has been designed not only to address scalability and efficiency, but also to smoothly onboard new datasets and migrate Redshift processes to Spark-based operations on EMR.

To facilitate migration, a series of training and onboarding sessions are being held for Redshift users. As we move forward with our EMR-based data mart, any Redshift use cases are still supported with data stored in S3 and any existing Hive processes will also be migrated to Spark to obtain the benefits of the Spark ecosystem.

Hadoop infrastructure and big data

The Hadoop ecosystem has been an undoubled trend in big data over the last several years. On our data platform, Hadoop runs on EMR and supports data from search ranking and recommendations, machine learning feature stores, experimentation metrics, and more.

Challenges

To build a robust Hadoop infrastructure, we had several challenges:

  • Strong SLA requirements. EMR spot nodes are recommended for cost saving. However, at peak usage times, we had EMR scaling problems with high levels of spot interruptions, which we could only have limited estimations of by contract. Thus, our enterprise data warehouse could not meet 99.9% SLA landing time. The impact on use cases was significant, resulting in a large backfill or running decisions with obsolete data sets.
  • Limited support for upsert and deletion. Upsert and deletion are important data features widely required among users at Coupang, who want to conduct analytics using the most recent, deduped, high-quality dimensional data. However, a generic Hadoop system did not support upsert and deletion to meet all our needs.
  • Incompatibility with unstructured data. Initially, we used Hive as our primary platform on top of EMR. However, Hive was not an ideal option for unstructured data processing for the use cases mentioned above, and Spark became a more prevalent solution.
  • Low scalability. Logging data increased by three times the normal amount on year-over-year (YoY); datasets across storages and the number of processing jobs doubled. Because there were several large Hadoop clusters in our platform all running different versions of Spark and other technology, each cluster had its own challenges, making it impossible to implement a single scaling solution. We also wanted to scale but keep costs relatively flat.

Solution

Coupang Hadoop abstraction layer
Figure 3. Integrating the Hadoop abstraction layer made data analysis easier for end users.

After analyzing our user needs and insights, we adapted an open-source project as our Hadoop abstraction layer.

The abstraction layer provides a simple interface for EMR job submissions and auto-scales for job requirements through a centralized monitoring and management system. With it, users can perform large Hive and Spark jobs in a matter of minutes through either the Airflow operator, Python, or REST API, without worrying about Hadoop or EMR maintenance. Users can take advantage of our Hadoop infrastructure more efficiently than before but with even less engineering effort.

The Hadoop abstraction layer improved our workflow in three ways: simplifying big data EMR job submissions, abstracting the physical details of data clusters for users, and allowing us to scale without increasing the number of Hadoop clusters.

Customer Experience Analytics Platform

A sizable portion of the raw data ingested on a day-to-day basis is user behavior data, such as impressions and conversions. To extract insights from this raw data, the Data Platform team used to manually build custom pipelines that would transform data points into insights that could provide answers to critical business questions such as which parts of the product is used most often and what types of customers have high retention rates.

Challenges

However, with increasing use cases and data volumes, building customized pipelines on a team-by-team basis took too much time. We felt a need to shift to a self-serve model.

Our goal was to boost productivity of our internal users, reduce time-to-insight, and reduce the need of building or using SQL statements for both surface level and complex analytics. Surface level analytics includes standard reports such as impression counts and clickthrough metrics while complex analytics refers to customer journey analysis, and conversion or clickthrough funnels.

Solution

Customer Experience Analytics Platform interface at Coupang
Figure 4. CXAP visualization showing the Sankey diagram of the user journeys from intro page to home_today_recommendation page and the remaining 3 levels

To this end, we built the Customer Experience Analytics Platform (CXAP). CXAP is a self-serve analytics portal that helps internal users perform various analyses on the client-side of event log data. With a simple interface and several visualization options, users can track customer journeys, funnels, and trends with a few clicks.

Furthermore, to reduce the wait time for an analysis from hours to minutes, the CXAP query backend has migrated to ClickHouse. The previous system had a 1 to 4 hours refresh schedule delay, but ClickHouse supports data ingestion in tens of seconds when output data is ready. To improve data visualizations, we integrated Apache Superset, which supports a wide range of charts with low rendering latencies.

Since the upgrade, we’ve seen a 40% increase in adopted users of the self-serve tool, proving the improvement to the customer experience of CXAP.

Metadata management

A vast number of tables were being created across various systems with accelerated business growth. As a result, data analysts and engineers struggled to find the appropriate metadata information they needed, creating a demand for a system where they could perform a search on all Coupang’s data tables.

Challenges

Although there was an existing system that allowed users to search metadata, it was developed in the early stages of our business, and its architecture did not reflect the characteristics of our new systems.

While a general RDBMS focuses on checking the schema information of a table, Hive manages the table schema information through Hive metastore and stores data in a distributed file system. To efficiently use Hive, users wanted to search and view additional information such as the data’s physical location or the file format in addition to table schema information.

Solution

Coupang data discovery tool for metadata management
Figure 5. The architecture of our data discovery tool that enables easy metadata management and search.

The Data Platform team re-launched the data discovery tool to fulfill the above user requirements. This tool enables Coupang’s data analysts and engineers to quickly discover metadata, data lineages, freshness, and landing time. In addition, this data discovery tool automatically collects metadata of various systems. If the system information is correctly registered, its metadata is automatically collected and provided to users within the hour.

This new tool has become one of our most popular features, supporting about a hundred daily users who search for information across more than a thousand tables.

Conclusion

The improvements we made to the analytics platform has made it possible for us to support more data analytics more quickly than ever before. Hundreds of users access the analytics platform to transform raw data into valuable insights that drive our rocket-speed business growth.

To read about how we ingest data, develop machine learning models, and conduct A/B tests on our data platform, check out part 1 of this series about our data platform.

We’re always looking for creative engineers who can help us solve complex big data issues to join us at Coupang.

--

--

Coupang Engineering
Coupang Engineering Blog

We write about how our engineers build Coupang’s e-commerce, food delivery, streaming services and beyond.