Big Data Platform: Evolving from start-up to big tech company
How the data platform evolved at Coupang and the roadmap ahead for future investments
Coupang aims to revolutionize the e-Commerce market by reimagining the mobile shopping experience through innovative services such as last-mile delivery. To achieve this, we rely heavily on data-driven decisions at every step of the way, from the beginning of the customer journey to optimizing our fulfillment centers to even finding bottlenecks in our pipeline. Accordingly, the Coupang Big Data platform is continually evolving to keep up with the constant demand for excellence in scale, availability, latency, concurrency, and fast data growth.
This article outlines the evolution of the data platform at Coupang and the roadmap ahead for future investments.
Table of contents
Phase I: Relational databases (2010–13)
Like many other start-up companies, we relied on relational databases (MySQL and Oracle) to store, process, and retrieve data during the first few years. With the relatively limited amount of data we had at this time, these traditional relational databases were effective for both our engineering and business teams.
However, when the Data Science & Platform team was formed to find data trends and run data science experiments that would push business growth, the number of business intelligence queries and database users increased rapidly. Eventually, speed and scale became issues and we had to look beyond relational databases.
Data as a discipline began to gather more attention around this time, largely due to growing investments in data made by Big Tech companies. Data came to be viewed as an integral resource, not only to traditional business models but also to new data-based applications and services that were becoming widely popular. At Coupang, we also looked to Big Data platforms as the answer to our problems.
Phase II: The beginning of big data (2014–16)
As existing relational databases became inadequate for the explosive amount of data we were collecting, Coupang invested in Big Data infrastructures such as Hadoop and massively parallel processing (MPP) databases.
To effectively use the data for business decisions, data-based applications were developed, and necessary departments were formed. For instance, there was a need to log and collect customer interaction data that would later be processed in the Big Data infrastructure and used in search, recommendation, and customer analytics. To this end, the Log Collection team was formed. The Log Collection team ultimately moved the needle on Coupang’s Big Data infrastructure by ensuring customer data was centralized and accessible. Data became an even more essential part of our business than ever before, whether it be for weekly business reviews (WBRs) or for measuring weekly top line metrics of each department.
However, as the number of daily active users and queries grew yet again, systems such as MPP with limited concurrent query capabilities became significant bottlenecks in the data pipeline. Hive and MPP would hang from 8 AM until midnight due to such limitations.
Another bottleneck occurred with hardware used for various data needs, including operational reporting, business intelligence, ETL, and data science. One hardware became almost the single point of contention for all users’ data retrieval, eventually leading some frustrated users to run their most important queries either early in the morning or late at evening.
Furthermore, this infrastructure operated on on-premise labs, making it difficult to dynamically add nodes to clusters and scale. Although a wide range of engineering and process-based techniques were applied, most provided only temporary relief and the data infrastructure was constantly loaded.
In a fast-paced startup, new cases for data science, business intelligence, and data-based applications were all waiting for a new and more effective version of the Data Platform. As a data team, we saw many opportunities for growth when encountered with such interesting problems.
Phase III: Rebuilding and migrating for the long term solution (2016–17)
To address such shortcomings, we migrated to cloud infrastructure in late 2016 to 2017. Our team re-architected the platform and rebuilt several layers to scale to cloud and to handle ten to twenty times more growth from all business dimensions. The following served as motivating factors to prepare for the long term:
- Growing traffic. Most traffic on eCommerce platforms are driven through search, which requires data infrastructures to collect and process customer interactions with transactional, catalog, pricing, and experimentation data. To ensure search wasn’t disrupted during peak traffic hours, we needed a robust and reliable data infrastructure that could handle our rocketing user growth.
- Increasing data users. As the amount of data we acquired grew, the number of business members who wanted to run queries also grew. For instance, the Experimentation and Pricing platform needed Big Data; Finance & Global Operations businesses planned to join heterogenous data; marketing teams needed social media data integration. Our data platform needed to cater to such differing needs with hundreds of users running thousands of queries every day.
With these two specific needs in mind, we built a data platform that was scalable and easier to use, minimizing customer and user friction. We will discuss some of the data platform layers in more detail.
Big data infrastructure
Using Hadoop and other container-based Big Data tools, we scaled the data platform to our increasing computing needs, enabling users to process large volumes of data.
Collecting customer interaction data on mobile apps and websites are common data practices in the modern world. At Coupang, there was a focus on building the next phase of this instrumentation framework so that user action logs could be collected and processed for various customer insights.
Enterprise Data Warehouse
The Enterprise Data Warehouse (EDW) was rebuilt as star schemas onto cloud-based MPP systems. There were three main types of data warehouse clusters:
- Data Acquisition Platform (DAP): Transactional data used for ad hoc exploration, operational reporting, and system applications that require source datasets were moved to the cloud data warehouse solution named DAP.
- Sandbox: Since numerous teams wanted to conduct ad hoc analysis by creating their own tables or staging datasets, we provided them with focused sandboxes containing the datasets required by each specific team.
- Reporting: These user-facing clusters held all production data and were used in data queries.
Overall, our new Big Data platform was easily scalable and accessible, serving the needs of both our customers and business users.
However, the data platform still had some shortcomings in data concurrency, MapReduce and DW scaling, storage decoupling, and logging data quality:
- Big data operational challenges. Each team wanted to own their own Hadoop cluster and as the number of clusters grew, operational difficulties arose due to inefficient practices in writing data processing jobs, idle clusters etc.
- Compromised trust on logging data quality. Because of the rapid speed of business growth and app and web development, some logs did not have clear specification in the legacy instrumentation platform. Unnoticed changes and undetected bugs frequently harmed consumer metrics and threatened data quality confidence. A new instrumentation framework had to be designed from scratch to earn the trust of producers and consumers with reliable delivery and concrete metadata.
Phase IV: Big data as a service (2019- Present)
By 2019, the Data Platform evolved to serve Coupang’s multiple business uses and scenarios to scale. We implemented various new notable policies and practices to the existing Data Platform and EDW that addressed operational difficulties that arose due to rapid expansion of business. In addition, we developed a brand-new instrumentation web logging framework to mitigate issues of trust previously discussed. The new platform not only increased efficiency but also improved productivity for all our engineering and operations teams. We will discuss some of the major changes we made below.
Big data infrastructure
Our Big Data platform uses several types of large Hadoop clusters. Over the past few years, we had to revise our cluster management policies and deployment strategies to support our explosive business growth. To provide a stable and scalable platform, we made improvements in the following areas:
- Cluster lifecycle: We manage Hadoop clusters with different lifecycles based on the user’s workload. The lifecycle of a cluster is tightly controlled with regard to cost efficiency and business workloads. Different types of clusters have access to shared Hive Metastore and cloud storage so that all users can consistently use the same Hive tables. With the new lifecycle policy, we improved user productivity and cost effectiveness.
- Scaling policy: In the past, we used auto-scaling provided by the cloud service, but it failed to meet the practical needs of data consumers. Consequently, we additionally implemented a schedule-based scaling feature that scales up in advance after analyzing traffic concentration times. The mixed use of schedule-based scaling and auto-scaling significantly improved the platform experience.
- Backed machine image: We needed to install various softwares on a computing instance for a Hadoop cluster, including OS, Hadoop and ecosystem, and monitoring and security agents. We created a virtual machine image with the required software and various plug-ins. The image was managed using Packer and several different virtual machine images for the consumer’s workload was prepared. After introducing the backed machine image, cluster installation times were reduced by more than 60%.
Web Logging Platform
Since Coupang’s early days, the instrumentation platform that gathered customer interaction data was built upon an external solution, which was defective and lacking features we required. As a result, many domain teams used yet another external service to calculate metrics and visualize data. To solve this problem from the ground up, a whole new platform was designed and implemented. After a long and difficult migration and verification process, the new platform completely replaced the old one.
Here’s a simple journey of the logs. First, producers register schemas to the metadata service. Usually, they generate (static-typed) codes from the schemas and automatically inject the generated code into the corresponding app or website to minimize human errors. When released, clients produce real logs to the collection pipeline, where the collection servers receive the logs and produce messages to the message queue. Then, the data loaders consume the messages to save into cloud storage. As the first data consumers, session batch jobs create session data with additional attributions for all batch data consumers.
Let’s discuss some of the Web Logging Platform’s key components.
- Collection pipeline (collection servers, message queue, data loaders): Using managed MQ service from the Platform Service team, our new logging platform provides real-time data stream for real-time data consumers and near-realtime data tables to batch consumers without loss, duplication, or corruption. Users can write their own loaders with their own SLA & ETL logics, using the queues provided by WLP.
- Metadata service: In the new platform, every log data must have a registered schema with an owner and registered consumers who review and subscribe the schema change alerts. This single source of truth on the log data structure is the foundation of other services such as producer UI codes and consumer queries.
- Validation service: Without interfering with data delivery, validation service checks every single log in the pipeline with the schema in the metadata service. Results are periodically saved and reported to the producers and consumers of the logs, while triggering real-time alerts.
- Test & monitoring service : The new platform provides a service with web-based UI to track and validate the logs of any given user or device for both of QA testing and production in real-time. This service also provides scenario-based validation, checking the semantics and syntax for QA testing.
Enterprise Data Warehouse
Our data platform’s master data warehouse environment consists of ORC files, accessible through Hive/Hue and Presto/Zeppelin. While MPP based DW environments (Sandbox) continue to be a tool offered to EDW users, they are only a small subset of EDW. Their primary function is to provide an environment where users can build pre-production sandbox tables to manage their domain’s business and serve as short-term reporting if needed. For long term reporting or sharing, users are encouraged to move user-owned or managed tables to file based tables in cloud storage for enhanced scalability and reliability.
In addition to improvements in the big data platform and the logging platform, some other important new features to the Coupang’s data platform include:
Data quality checks
The Data Team has built a framework which compares row counts and entire rows of data using the row HASH to ensure data accuracy. As part of technical testing, we also run DQ checks such as primary key, null values, and more. The framework also allows the individual developer to plug-in business-related SQL statements to check data accuracy against the real world. In addition, we are leveraging open source frameworks for constraints or threshold-based data checks especially for Big Data tables.
Data abnormality notifier
In a data-driving company like Coupang, timely notifications of data abnormalities allow us to act fast. The Data Notifier alerts us of data abnormalities, almost as soon as the data is written. For example, imagine a new Android version of Coupang was released last month with a logging bug that dropped data. In the past, it took us three days to notice such data loss because we had to wait for users to start installing the app. With Data Notifier, such abnormalities are caught within two hours of the app release.
Service Level Agreement
With our new data platform, we provide daily email notifications to users at 9AM KST letting them know whether their mart tables are ready or not. In addition, we are working on building an online Service Level Agreement (SLA) report system available to all users to further SLA transparency.
Data Discovery Tool
Through the Data Discovery Tool, users can submit tags and descriptions for tables or columns in the data platform. These tags and descriptions can be seen and searched by other users, creating an open platform that can grow organically on its own. This tool enables self-service of data discovery for all users, making data exploration easy and improving the productivity of hundreds of data users at Coupang.
EDW Management System
EDW Management System (EMS) is a framework to create and manage data pipelines, support automation of data acquisition, and automate Airflow DAG generation using metadata. This framework also supports monitoring, backfill, and downstream dependency features for data engineers, significantly increasing their productivity. EMS also provides early SLA detection feature to help on-call engineers.
With the latest additions to our Big Data platform in 2019, we were able to provide our users with a more efficient and robust system for data management and discovery. However, at Coupang, we never stop trying to improve. Below, we discuss some of the work underway to advance our Big Data platform to the next level.
Hadoop Abstraction Layer
We are working on providing a Hadoop Abstraction layer that will simplify the user job submission process and Hadoop cluster management. From the end-user perspective, a Hadoop Abstraction Layer erases the physical details of various Hadoop resources and provides a simple job execution interface (Airflow Operator, Python, Rest API) for submitting Hive and Spark jobs. It can also monitor Hadoop cluster resources and allocate the correct amount of resources to each user job, improving system efficiency.
Log quality monitoring
An app’s logging behavior can change after release for various reasons, including due to a feature working later through A/B testing or changes in the domain API that can affect clients. For these reasons, QA testing during the release phase is not enough to ensure data quality. For fail-proof QA, we are developing a full-featured log quality monitoring service that runs all important log testing on real devices with full automation throughout the lifecycle of the released apps. Moreover, the quality monitoring system will check the quality of the attributions from our log pipeline, not produced by the client side, to enhance data confidence.
Metadata service is the single source of truth for all schemas of all logs. A log schema is like a contract between the producers and consumers that establishes trust on log data. Therefore, all consumers should be notified of any changes to a log’s schema and any newly registered schemas not yet in production. Through such notifications, data consumers can review the changes and prepare their downstream jobs. We are working on an automatic schema discovery tool on the logging platform that will collect and analyze queries from consumers to find all related schemas and automatically update their subscriptions without any direct user input.
Coupang’s Big Data platform is the result of cooperation between multiple teams across the globe, including teams such as Big Data Platform, Enterprise Data warehouse, Web Logging Platform and Technical Program Management.
Our vision for Coupang’s Big Data platform is to provide intuitive and reliable tools to manage, process, and visualize data at scale. We hope to develop tools that will empower our internal users to build large-scale data applications with ease and make data-driven decisions for the long and short run.
If you’re interested in developing the next phase of Coupang’s Big Data platform with us, you can find opportunities here.