Evolving the Coupang Data Platform

Coupang is revolutionizing e-Commerce market of South Korea with last-mile delivery and rethinking discovery of products on mobile first platform. Our mission is to create a world in which Customers ask, “How did I ever live without Coupang?”

Being a data driven company, we rely heavily on data decisions right from customer journey to the best algorithm to optimize space in a fulfillment center. We use data to find bottlenecks at each step of the process and move faster as a company. To keep up with the constant demand for excellence in scale, availability, latency, concurrency, and fast data growth, we must constantly evolve our Data Platform.

This article explains the journey of Data Platform at Coupang & gives glimpse of the future investments.

Phase I- Early years

During the 1st few years (2010–13), like many startup companies, we relied on relational databases (mySQL, Oracle) to store, process, and retrieve data. One of the early investments made here, when Founder of Coupang ‘Bom Kim’, decided to invest in Data Science & Platform team. This team was chartered to find data trends, run data science experiments & help business move fast. This worked well for small data sets & business reporting.

This was also the time where Data as a discipline was gaining more & more popularity due to huge investments in data by Big Tech companies, as a result right business decision as well as many data-based applications & services gained popularity.

However, as the number of business intelligence queries & users increased, speed & scale become an issue, and we had to look beyond relational Databases.

Phase II — Era of On-premise Hadoop, Hive & MPP systems

With fast growth of the company, Coupang decided to invest in well-known Data Infrastructure choices for BigData & Datawarehouse as Hadoop and Massively Parallel Processing (MPP) system respectively. This was also the era (2014–2016) where many other advanced departments in the company came into existence & demand for data took logical evolution from business reporting to inputs for many other departments in the company to measure their top line metrics on weekly basis. The culture of Weekly Business reviews (WBRs) prospered & data-based applications were developed. This is significant as production of data & need for it was increasing in the company’s landscape. There was need to log & collect customer interactions on Coupang’s App & website, as well as BigData infrastructure to process it & use as signals in various search, recommendations and customer analytics. Log collection team was formed, and it started collective customer interaction data, it helped to move needle.

However, as the daily active users & number of queries increased ten folds again, the systems like MPP with limited concurrent query capabilities became big bottleneck. Hive & MPP systems will hang between 8 am to 12 PM business hours. Another fundamental bottleneck was one hardware was used for various data needs like operational reporting, business intelligence, ETL, data science, it almost become single point of contention for everyone in the company to get data, while many users started to get frustrated, some took approach of running there most important queries either early morning or late evening. As a Data team, there were many opportunities & interesting problems.

In on-premise labs, it was also difficult to add nodes to clusters, though lot of engineering & process-based techniques were applied, some of which provided temporary relief, but the Data Infrastructure was under constant load. In a fast pace startup, new use cases for data science, business intelligence, data-based applications were all waiting for new version of the Data Platform.

Phase III — Re-architect & Migrate — Long Term Solution

In late 2016 to 2017, team re-architected platform by re-building various layers to scale using Cloud infrastructure. We migrated to Cloud in this phase & build the base components so that we can handle 10X or 20X growth from all around perspective.

At this point, would like to mention few important data platforms needs in the company which were also a huge motivation to prepare for the long term.

  • Most traffic in an eCommerce platform is driven through Search, and it require robust & reliable data infrastructure to collect customer interactions as well as process it along with transactional, catalog, pricing & experimentation data.
  • Hundreds of business members want to run 1000s of queries/day
  • Experimentation and Pricing platform needs big data
  • Finance & Global Operations business to join heterogenous data
  • Marketing teams need social media data integration

Brief about different layers:

Log Collection: Customer interaction data on mobile apps & web are common data practice in modern world. There was huge focus on building next phase of the instrumentation framework so that user action logs can be collected & processed for various customer insights.

BigData Platform: With Hadoop & containers based other BigData tools & components to enable teams/users to process large volumes of data & allowing us to scale to increasing compute needs.

Enterprise Data warehouse (EDW): EDW is re-built in right data modeling patterns as star schemas onto Cloud based MPP system. There are 3 main types of data warehouse clusters:

Data Acquisition Platform: Transactional data used for Ad Hoc exploration, operational reporting, system applications require source datasets will be moved to Cloud Data Warehouse solution named DAP.

Sandbox: Lot of teams want to do Ad Hoc analysis by creating their own tables/staging data sets, we are going to provide them with sandboxes focused on the datasets required by the specific teams.

Reporting: This are user facing clusters to query data. All production data is available in this cluster.

The new platform was scalable, easier to use, lot of areas had almost no customer friction. It had some GAPS though like concurrency on user queries, number of connections, data copies, scaling map reduce & DW, de-coupling storage & compute and trust on logging data quality.

BigData challenges: Too many Hadoop clusters, every team wants to own one, operational difficultly due to in-efficient practices in writing data processing jobs, idle clusters etc.

Trust on logging data quality: Because of the rapid growth of business and our agility of app/web UI development, the apps & web pages had many kinds of logs without any clear specification in legacy instrumentation platform. Un-noticed changes & un-detected bugs frequently harmed metrics of the consumers and threatened the trust on the data quality. The new instrumentation framework had to be designed from scratch to earn the trust of producers & consumers with reliable delivery and concrete metadata.

Phase IV — Bigdata as service, EDW on Cloud Storage and brand-new instrumentation framework

By 2019, we had evolved to understand & scale Data Platform to support multiple business use cases and scenarios.

BigData Platform : We’ve used several types of large Hadoop clusters. However, we had to revise its cluster management policies and deployment strategy entirely to support the explosive growth of the business. We have made improvements in a variety of areas, including backed machine image, optimizations to the characteristics of computing resources, flexible scaling policies, and cluster abstraction layers, to provide a stable and scalable platform for our customers.

● Cluster Lifecycle: We manage Hadoop clusters with different life cycles based on user’s workload. The life cycle of a cluster is tightly controlled with regard to cost efficiency and business workloads. Different types of clusters have access to shared Hive Meta store and cloud storage so that all users can use the same hive tables consistently.

● Scaling Policy : most Cloud platforms provide Auto-scaling based on System metrics. We also used Auto-Scaling provided by the Cloud service, but it couldn’t meet the practical customer needs. We then implemented a schedule-based scaling feature that scales up in advance after analyzing traffic concentration times. The mixed use of schedule-based scaling and Auto-Scaling allowed to improve our user’s platform experience significantly.

● Backed Machine Image: We need to install various software on a computing instance for Hadoop cluster, including OS, Hadoop and ecosystem, monitoring and security agents. We created a virtual machine image with the required software and various plug-in. We manage the image using Packer and prepare several different virtual machine images for the customer’s workload. After introducing the baked machine image, the cluster installation time has been reduced by more than 60%.

Web Logging Platform:

Since the early years of Coupang, the instrumentation platform to gather customers’ interaction data was built upon an external solution, with lot of defects and lack of features. Many domain teams had to use yet another external services to calculate their own metrics & visualize it. A whole new platform was designed & implemented to solve the problem from ground up. After long & hard time of migration & data verification, the new platform completely replaced old one.

Here’s a simple journey of the logs. Before beginning, producers register schemas to Metadata service. Usually, they generate (static-typed) codes from the schemas and inject the generated code into app or web pages to prevent human errors. When released, clients produce real logs to Collection Pipeline. In the pipeline, Collection servers receive all the logs, and produce messages to message queue. Then, Data Loaders consume the messages to save into cloud storage. As a first consumer of the data, Session Batch Jobs creates session data with additional attributions for all batch consumers.

● Collection Pipeline (Collection servers, Message Queue & Data Loaders): Using managed MQ service from Coupang’s Platform Service team, new platform provides real-time data stream for real-time consumers & near-realtime data tables to batch consumers, without loss, duplication nor corruption. Consumers can write their own loaders with own SLA & ETL logics, using the queues which WLP provides.

● Metadata Service: In the new platform, every log data must have a registered schema with an owner and registered consumers who reviews and subscribes the schema change alerts. This single source of truth on the log data structure is the foundation of other services, UI codes of producers and queries of the consumers.

● Validation Service: Without interfering data delivery, Validation Service checks every single log in the pipeline with the schema in the metadata service. Results are saved & reported to the producers & consumers of the logs periodically, while triggering alerts in real-time.

● Test & Monitoring Service : The new platform provides a service with web-based UI to track & validate the logs from any given user or device for both of QA testing & production in real-time. This service also provides scenario-based validation, checking semantics not only syntax for QA testing.

Enterprise Datawarehouse: Data Platform’s master data warehouse environment is ORC files, accessible through Hive/Hue, Presto/Zeppelin. While MPP based DW environments (Sandbox) continue to be a tool offered to EDW users, that is only a subset of EDW. Their primary function is to provide environment where users can build pre-production sandbox tables to manage their domain’s business. Environment can also serve as short-term reporting if user sandbox tables are needed in reporting. For long term reporting or sharing users are encouraged to move user-owned/managed tables to File based tables in Cloud Storage.

Sub Components

Some other important features of the Coupang’s data platform includes

Data Quality

Data team have built a framework which helps compare row counts and entire row data using HASH of a row to ensure data is accurate. As part of technical testing, we also run DQ checks like primary key, null values etc. Framework supports each developer to plug-in business-related SQL statements as well to cover data accuracy to real world. We are also leveraging opensource frameworks for constraints/threshold-based data checks especially for bigdata tables.

Data Abnormality Notifier

In this fast-moving industry, we must act fast. Data Notifier gives us the earliest possible detection, as soon as the data is written. For example, imagine last month a new Android version was released and there was a bug in the logging which caused data to drop. In the past, it took us 3 days to notice this because we must wait for users to start installing the app and that is how long it took for the data loss to be noticed. With Data Notifier, this will be noticed within 2 hours of the app release.

SLA (Service Level Agreement)

On new data platform, we will provide email notifications to users to let them know if all Mart tables are ready or not by 9 AM KST daily. In addition, to give more data SLA transparency to Data Platform users, we are building an online report which shows easy to read SLA information.

Data Discovery Tool

A platform to submit tags & description for tables/columns in Data Platform, those can be seen & searched by other users. This enables an open platform which can grow on itself organically. Data Discovery enabled self-service of data discovery for all data users in Coupang. This feature made life of hundreds of data explorers easy & led to improvements in their productivity.

EDW Management System (EMS)

Framework to create & manage data pipelines, support automation of data acquisition & automated Airflow DAG generation using metadata. This framework also supports monitoring, backfill & downstream dependency features for data engineers. Saves tons of hours on productivity of the data engineering teams. EMS also provides early SLA detection feature to help on-call engineers.

Future Improvements

Hadoop Abstraction Layer

We will provide the Hadoop Abstraction layer to simplify user’s job submission and Hadoop cluster management. From end-user perspective, it abstracts away the physical details of various Hadoop resources and provides simple job execution interface (Airflow Operator, Python, Rest API) to submit Hive and Spark jobs. It also allows users to manage their Hadoop clusters. It monitors Hadoop cluster resources and allocate correct amount of resources to user jobs.

Log Quality Monitoring

Due to many reasons, app’s logging behavior can be changed even after release. For example, a feature may begin to work later by A/B testing. Some changes of domain API can make side effect in clients and Some views implemented with web may have new deployment at any time. To ensure the quality of log data, QA testing on release phase is not enough. We will provide a full-featured log quality monitoring service, which runs all important log testing on real device with full automation throughout the lifecycle of the released apps. Moreover, the quality monitoring system will check the quality of the attributions from our log pipeline, not produced by client side, to have more confidence on the data.

Schema Discovery

Metadata service is single source of truth for all schemas of all logs. A log schema is like a contract between producer and consumers and the contraction makes the trust on log data. Therefore, all consumers should know every log’s schema to be notified on any changes, and they also should know about newly registered schema, not in production yet, to review the schema and prepare their downstream jobs. Logging platform will provide an automatic schema discovery tool. It will collect queries from consumers and analyze them to find all the schemas they need, then will update their subscriptions without any user interaction.

Authors:

Narendra Parihar, Matthew (Jae Hwa Jung), Joong (Joong Hoon Kim)

Appendix:

Data Platform organization at Coupang consists of multiple areas which make this giant and complex platform work companywide: BigData Platform, Enterprise Datawarehouse, Web Logging Platform and Technical Program management.

Our vision for Data Platform is to provide reliable, scalable, and worry-free tools to manage, process, and visualize data at scale. Those tools should be adopted by internal Customers voluntarily and with trust and empower them to build large scale data applications at ease, as well as make data-driven decisions including those time sensitive ones.

We are hiring, you can find opportunities here.

쿠팡 기술블로그 입니다. 쿠팡의 기술을 통한 혁신 스토리와 개발자 문화를 공유합니다.

Coupang technology blog Team

Written by

쿠팡 기술블로그 — Coupang Technology Blog
Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade