Fueling the Coupang search engine
How we designed the indexing platform fueling our search engine
By Winter Wang
Search opens the customer journey at Coupang, which is why the search experience is critical to our mission to ‘Wow the customer’. The goal of the Coupang search engine is simple: given a user search keyword, provide the best results. Unlike traditional search engines where relevancy is the determining factor, the Coupang search engine must also consider product ratings, customer reviews, prices, brands and more.
In this post, we will first provide a general overview of how the Coupang search engine works and then dive into details about our search indexing platform, a key component of our search engine.
Table of contents
Search engine architecture
First, let’s discuss the overall architecture of the Coupang search engine. Given a search request, the Coupang search engine takes the following steps:
- Query understanding. Understand the query intention and query annotation, such as category or brand.
- Retrieval. Find candidates based on the search query and textual product information.
- Ranking. Apply a ranking function or a ML model to rank the candidates. The ranking function usually uses ranking signals to return a score for each candidate and should run in fast iterations.
The ranking algorithm is instrumental in improving the search experience. The foundation of the ranking algorithm lies in reliable data sources and ranking signals. At Coupang, we have millions of products and petabytes of customer behavior data. How can we efficiently leverage the data at our fingertips to derive robust signals? The answer for us: the indexing platform.
The indexing platform is the foundation and factory that provides our engineers with everything they need to develop effective retrieval and ranking systems. It aims to be a scalable platform that collects data from various ground truth tables, denormalizes the data keyed by product-query, and builds the search engine index. In addition, it should enhance the productivity of our ranking engineers.
In the next sections, we will discuss how our indexing platform evolved to meet the above goals and our expanding business needs.
Stage 1: Indexing Platform 0.1
During its early stage, Coupang only had a small number of products. Most of the ground truth product data — such as price, brand, category, and more — was stored in relational databases. The first version of indexing platform leveraged the SQL engine of relational databases to join multiple tables by product key and then merged them into a single denormalized table. Next, the indexing platform fed the data from the denormalized tables to the online serving search cluster, which built the index on the fly. Then, the search engine used the index to provide retrieval candidates based on textual product information and a few simple ranking signals.
But as the number of products quickly grew, this simplistic indexing platform was unable to withstand the increase in data volume and many of its existing problems were amplified. Some of the difficulties we faced are summarized below.
- Overly simplistic. At this stage, the indexing platform could not be considered a platform. It was merely a connector between the ground truth datasets and the online serving search cluster. It was not complex enough to facilitate the needs of our search engine.
- Poor scalability. The indexing platform was not able to scale out to handle larger amounts data. For example, when a replica of the online serving search cluster was built, it had to re-calculate the same index from scratch, a waste of resources affecting scalability.
- Inefficient and poor robustness. Many phases of the platform were slow and error prone. For instance, building the index on the online serving search cluster was extreme slow and expensive. The SQL engine of the relational database was slow and unstable when joining dozens of large tables. It was almost impossible for ranking engineers to add retrieval and ranking signals within a few days.
Stage 2: Indexing Platform 1.0
To enable growth at accelerated speed, we decided to rearchitect the indexing platform into a distributed service platform that could easily scale out.
First, we reorganized our data sources. The new platform was rebuilt on Hadoop, a widely used open source platform for distributed data processing. Since the relational database was a major source of bottleneck, all the ground truth data was replicated to Hive. In addition, the ranking signals were generated as Hive tables that could be merged into the platform. Now, all the data we needed to build the search index was sitting in our Hadoop platform.
Then, we designed ways to merge and store our data efficiently. The index merger was developed as a Spark job to merge all the product data, which was first stored in a
Protobuf structure named
ProductJoin and then eventually in HBase.
Finally, we developed the index builder, which was another Spark job that consumed data from
ProductJoin to generate the search index. The output of the index builder was stored in distributed storage, which was directly accessed by the online serving cluster for service. In this system, the index could be used by all the replicas without having to be rebuilt.
A similar pipeline was built for query data to better support query understanding.
Stage 3: Indexing Platform 2.0
Another major responsibility of an indexing platform is unleashing the ranking engineer’s productivity potential during signal development. Although our Indexing Platform 1.0 solved performance and scalability issues, it failed to provide our engineers support for expediting signal development.
Previously, our ranking engineers had to spend a great amount of time finding data sources and scheduling workflow. To build a new signal, a ranking engineer needed to setup an entire pipeline starting from parsing the raw log to building the signal, requiring many hours of developing Hive queries and Spark jobs. Furthermore, we had around 70 such separate pipelines, which was almost impossible to maintain, not to mention a huge waste of resources.
To address such engineering efficiencies, the Indexing Platform 2.0 is an Indexing Bus that enables each ranking engineer to focus on realizing the logic of the signals without worrying about contaminating the data source, developing an entire data pipeline, or scheduling workflow.
First, raw log parsing is provided by the Session Log parser, a Spark job that parses and merges customer behavior data to create the Session Log dataset. Furthermore, the Ground Truth Merger, another Spark job, makes it easy for our engineers to easily access raw data about our products by merging all product data. We used Spark jobs for implementation because they are easily scalable.
Another important change in our Indexing Platform 2.0 is that ranking signals are generated directly on the platform, instead of being merged from external Hive tables, as in Indexing Platform 1.0. Because both the ground truth product data and session logs are easily accessible, we developed ProductJoiner and QueryJoiner to derive signals based on these datasets straight on the indexing platform.
The beauty of the framework is that all the datasets, both product and query data, are used during the generation of either the product or query signals. This exchange of information leads to extremely powerful signals. For instance, price is an example of product data from ProductJoiner that can be used to build query rankings in QueryJoiner. Engineers can build a signal that boosts or demotes products based on the query and product price distributions. Something as simple as a product price is converted into a powerful ranking signal.
Signal generation process
Let’s go through the steps the QueryJoiner takes to generate a signal:
- Load source data, which includes both product ground truth data and parsed session logs.
- Generate built-in aggregated raw customer behavior data like impressions, clicks, purchases at query level and product level.
- Run all the signal processors to generate signals.
- Store the generated signals in HBase.
The Indexing Platform 2.0 consolidated log parsing and raw signal generation processes directly on the platform, making building a new signal as easy as implementing a Class and adding some logic to process the data. For details on how our new platform improved efficiency, refer to table 2.
Overall, the Indexing Platform 2.0 provides the means for our ranking engineers to build high-quality ranking signals within a few hours by eliminating operational issues and by providing a method for unit tests and integration tests. All these improved benefits are implemented at a lower cost, as the platform efficiently uses our cluster resources.
This certainly is not the end of the story — in fact, it’s only the beginning. We established a system that derives the relationship between product and query as the first step of our indexing platform. In the future, we want to build a platform that leverages similar relationships between customers, products, queries, brands, and more. We believe investing in building a robust indexing platform can also support other areas of our business such as recommendation and advertising. Eventually, the ultimate goal of the indexing platform is to become the brain behind the Coupang business.
If you are interested in working in fast-paced environment with engineers who are passionate about solving difficult problems such as improving a search engine’s indexing platform, you can find opportunities here.