Rebuilding the SSENSE E-Commerce Product Catalog Using Elasticsearch

An incremental migration of the highest throughput microservice at SSENSE to Elasticsearch with zero downtime

Published in

SSENSE-TECH

10 min readOct 23, 2020

Introduction

This SSENSE-TECH article will examine how the SSENSE ‘Discovery’ squad successfully merged two microservices powering our product catalog into one, and how we rebuilt that one service to be future-proof.

Understanding Why the Project Was Necessary

In e-commerce, building reliable and scalable microservices is critical in providing customers with a consistently optimal shopping experience and simplifying the complexity involved with shipping new features. From time to time, high interest events on SSENSE, such as hype sneakers launches, would increase traffic to the website by 10–20x and test the limits of our frontend microservices.

In summer 2019, with these high traffic periods in mind, the SSENSE Discovery squad had a growing problem on our hands.

Time and time again, these anticipated spikes in traffic revealed several issues and opportunities affecting our customer experience related to the two frontend microservices powering our product catalog:

Failure under load. Neither of the microservices could survive the volume of activity on ssense.com during high-traffic periods — leading to a domino effect of website downtime, missed transactions, and poor customer shopping experience. Further, downtime in these services brought down the entire purchase funnel across all three sales channels at SSENSE — the website, the iOS app, and our proprietary retail technology.
Duplication of effort in feature development and maintenance. Two services powering the product catalog and the discovery experience resulted in double the effort for net-new features and ongoing maintenance.
Infrastructure cost savings. Each of the two services incurred significant infrastructure costs, and these costs only increased as the volume of traffic and breadth of our catalog grew year over year.
Unrealized opportunities. Incongruence between the extensibility of each service meant that new feature ideas would be deemed infeasible due to the limitations of one service or the other. For example, personalization on the website was long blocked by the limitations of one microservice, while the iOS app introduced a personalized experience by solely using the other.
Delays in updates to the catalog. High volume updates to the catalog, such as rearranging products on the website (40K+ updates) or pricing updates (60K+ updates) would take up to 6 hours to process on high traffic days . In addition to blocking any other updates to the catalog (inventory, new product uploads, etc), this also led to a substantial revenue impact.

The state of the frontend product catalog was an impending issue — the opportunity cost of continuing with the status quo would only exacerbate with the growth of the SSENSE catalog and the overall business. Engineers were frustrated with spending time fixing never-ending bugs, product management was frustrated with the capabilities of existing tech and pace of feature development, and the business was frustrated by the tech issues that surfaced every time high levels of traffic came to the website.

The fact that something stops working at significantly increased scale is a sign that it was designed appropriately to the previous constraints rather than being over designed (source)

In 2017, services had been built to solve the then current problems of enabling search and product recommendations on the website. SSENSE had simply outgrown that context. Not evolving the tech built for the SSENSE of 2017 would result in imminent failure for the 2020s.

How could SSENSE sustain its hypergrowth if the core website technology was on its last legs?

The answer was simple — yet daunting — we needed to merge the two microservices powering the product catalog into one and rebuild that one service to be future-proof.

Done right — this would be a high investment project yielding high ROI: enabling emerging opportunities, paying off years of accumulated technical debt, and saving on labor and infrastructure costs.

Done wrong — this would be a long technical project, with unsalvageable results.

Challenge Accepted.
(The Office. Deedle-Dee Productions and Universal Media Studios, 2005.)

Defining a North Star for the merger project

To drive focus on what was undoubtedly going to be a long, challenging project, the team came up with high-level design tenets that would shape our design choices and execution strategy.

The future product catalog service would…

have a rich, future-proof data model
extensible to new features
performant and reliable
resilient against voluminous customer traffic and product updates
built using safety-first product development
eliminate duplication of development work and maintenance
have perfect inventory, stock, and price consistency

With these design tenets in mind, the team defined a 7 step plan to execute on the merger project:

Select a database technology to power the “New Catalog”
Select a migration strategy, to move from two “Old Catalogs” to one “New Catalog”
Drive clarity on the architecture of the “New Catalog”
Construct the update (write) pipeline to the “New Catalog”
Build the capability to serve all of the features currently served by the “Old Catalogs” into the “New Catalog” and ensure that these features were functionally covered by automated testing
Ensure data parity between the “Old Catalogs” and the “New Catalog”
Progressively switch website traffic, feature by feature, from the “Old Catalogs” to the “New Catalog”
Deprecate the “Old Catalogs”

The R&D phase: building an implementation strategy

Step 1: Technology Selection

Inspired by Project Spryker, the team took a product-centric approach to building the “New Catalog”, which recommended a NoSQL approach to building a database to house the product catalog. The “New Catalog” would be a collection of products, with each product having all data relevant to it within its document.

Decision 1: The new product catalog would be powered by Elasticsearch.

Step 2: Migration Strategy

The “Old Catalogs” were split up as:

ms-product powered the womenswear/menswear, designer, and category product listing pages on the website and iOS app, in addition to the wishlist, cart, and checkout.
ms-search powered the search experience on the website and iOS app, product recommendations, and the personalized homepage on the iOS app.

There were 3 viable high-level options for the migration strategy:

Build a New City: Deprecate the 2 “Old Catalogs” (ms-search, ms-product) and build their functionalities into a brand new service to power the “New Catalog”
Upgrade the Core Citadel: Deprecate one of the 2 “Old Catalogs” ms-search, and transform ms-product into the “New Catalog”
Fortify the Crumbling Castle: Deprecate one of the 2 “Old Catalogs” ms-product, and transform ms-search into the “New Catalog”

We then analyzed the code quality and complexity of the 2 “Old Catalogs”, and came to the conclusion that the ms-product code was well written and extensible enough to merge in the features of ms-search, while the same could not be said for ms-search. Creating a brand new service was indeed a viable option, but would not benefit from the well-written code and safety in maintaining the ms-product codebase.

Final Decision: We chose option #2 — enhance ms-product and upgrade the core citadel.

To achieve this, we would need to:

Enhance ms-product to be able to power search, recommendations, and personalization both on the website and the iOS app
Build an Elasticsearch database for ms-product

Step 3: Architectural Design

We chose to divide the architecture of the “New Catalog” into:

the Write pipeline, which facilitates the flow of updates to the catalog
the Read API endpoints, which enabled consumers to request information about product(s)

Step 4: Constructing and Optimizing the Write Pipeline

The write pipeline to the “New Catalog” receives 1.5 to 3 million updates per week, with updates ranging from targeted changes in inventory, pricing, sorting ranking, to generic product updates that require verification from a back-office source of truth (SoT) system.

The catalog often receives a large number of updates (50K — 100K) all at once.

Given the above context, it was important for the “New Catalog” to be able to process updates fast and reliably.

We opted to proceed with an AWS SQS — Worker architecture for the write pipeline. Essentially, any and all updates to the product catalog are inserted into a queue, where they pile up first-in-first-out (think lineup outside a store), until the worker (think store employee letting customers in at the door) pulls updates from the top of the queue and processes those updates to Elasticsearch (the product catalog database).

To further drive performance improvement, we were able to identify an opportunity to further optimize the write pipeline. Generally, updates to the product catalog can be classified into 2 types:

Full messages (message = update), which contained enough information to directly update the relevant product(s) on the catalog
Skinny messages, which required a call to the SoT system to fetch product data, which could then be used to create or update relevant product(s) on the catalog

The key difference between the two types of messages is that the skinny messages depend on the SoT system to provide relevant information, which incurs additional latency. By parallelizing the processing of each type of message, we were able to bin messages by their type, and use bulk processes to optimize interactions with the back-office SoT system.

Visualizing the Full Message flow:

Processing Full Updates Directly into the “New Catalog” database

Visualizing the Skinny Message flow:

Skinny Messages Don’t Have Enough Information to be Self-Sufficient Updates

Skinny Messages Depend on the Back-Office Source of Truth (SoT) for Comprehensive Information

Full and Skinny Messages are Processed in Separate, Parallelized Updates

Step 5: Enabling Information Retrieval from the Product Catalog

Frontend consumers, such as the website and iOS app, request information from the product catalog.

To enable information retrieval from the “New Catalog”, we progressively implemented API endpoints, which enabled a number of retrieval patterns that powered different features across consumer channels:

Retrieving data for one product
Retrieving data for many products
Custom search query for many products (ex: cart, wishlist)
Aggregated descriptor lists (ex: lists of categories, designers, sizes) for a refined group of many products
Product recommendations (One to Many), based on matching brand or category
Autocomplete suggestions on the search bar
Personalized recommendations on the iOS app homepage
Searching for products based on text input, and returning relevant products

With the write pipeline constructed and information retrieval enabled, the data flow for the “New Catalog” looked like this:

Step 6: Ensure data parity between the “Old Catalogs” and the “New Catalog”

Once the write pipeline was up and running, the 2 “Old Catalogs” and the “New Catalog” both contained the entire history of the SSENSE catalog, from inception to the currently listed catalog.

Fun Fact: There have been roughly 450,000 listed products in the SSENSE history

The next step was to programmatically measure data parity between the “Old Catalogs” and the “New Catalog”, and verify that the databases were at parity in terms of listed catalog (important) and unlisted/old catalog (less important).

Step 7: Progressively switch website traffic, feature by feature, from the “Old Catalogs” to the “New Catalog”

Following the design principle of safety-first product development (Airbnb — Building Services at Scale) was of paramount importance at this step. We needed to verify that:

[Reliability] The “New Catalog” would be able to handle the volume and diversity of real world information retrieval requests (request traffic)
[Performance] The “New Catalog” would be able to respond to different types of requests with an equal or lower latency
[Safety] The “New Catalog” would be able to add new features or modify existing features, without downtime
[Safety] Request traffic could be switched over to the “New Catalog” with zero downtime

There were a few different deployment strategies available to us:

Source: Six Strategies for Application Deployment (TheNewStack.io)

We chose to pursue the Shadow Traffic option as our deployment strategy, so that we would be able to safely verify the above requirements and deploy the “New Catalog” to serve real users.

To accomplish the Shadow Traffic setup, we launched the “New Catalog” as a ‘ghost service’ in Production — it would receive a copy of every update and information retrieval request that the “Old Catalogs” received. This allowed us to test and benchmark both Full messages and Skinny messages flows in the “New Catalog” could be tested and benchmarked.

The shadow deployment enabled parallelized monitoring on identical update & request traffic patterns

Next up was a period of monitoring, where we tracked metrics, benchmarked performance, and verified the above 4 requirements.

The Write pipeline into the “New Catalog” was 10x faster than the “Old Catalogs”

Processing updates into the product catalog more quickly enabled it to be more “real time”

Information Retrieval from the “New Catalog” was 50% faster than the “Old Catalogs”

The “New Catalog” seemed to be reliable, performant, and safely deployable — all of this had been verified while keeping it in the shadow of the “Old Catalogs”.

It was go time — after 6 months of work, we were ready to go live with the “New Catalog”.

The SSENSE Discovery squad in pre-COVID times showcasing some smiles and frowns

We switched each of the 8 features (listed above under Step 5: Information Retrieval) to the “New Catalog”. No errors, no negative impact to page load time, no more data discrepancy.

Since then, we’ve used our “New Catalog” to enable a number of features, most of which will see the light of day later this year — stay tuned! Of note, we enabled Continuous Experimentation in the sorting mechanism on product listing pages, allowing our Data Science team the freedom to introduce multiple sorting algorithms across our e-commerce channels.

Editorial reviews by Deanna Chow, Liela Touré, & Gregory Belhumeur.

Want to work with us? Click here to see all open positions at SSENSE!