Scaling Product Information in a Distributed E-Commerce Platform

Alex Blundell
THG Tech Blog
Published in
6 min readOct 3, 2019

Towards the inception of our e-commerce platform, the system comprised two main components — the frontend application that served pages to customers and received orders, and the backend application that processed orders. Both of these depended on their own relational database instance that were used directly to serve information. This worked well when our customer and order counts were modest; however, substantial growth over the last 15 years has meant we’ve quickly outgrown this architecture. Over the years various improvements were made to this platform, including:

  • Aggressive caching of shared data
  • Sharding of Web Servers
  • Resource increase of various components (Web Servers, Databases, etc.)

Having these two monolithic applications also wouldn’t accommodate onboarding of the hundreds of developers working on the platform that we have today. In this post I’ll describe how we moved away from this initial architecture, with particular emphasis on the product domain.

Microservices

Due to the growth of THG, we’ve invested a lot of time into breaking down our e-commerce platform into smaller components. We’ve gone from two main components to hundreds, with a set of each being maintained by a focused team of engineers working on improvements and adding new features to them. This is great for engineering at scale but not necessarily the most efficient or reliable way to serve customers on demand by itself.

As an example of how these microservices are split, let’s take a look at a product page.

A Product Page, broken down into sections that could be powered by individual microservices.

In our example, sections have been highlighted that show data stored within various different microservices we have. There are basic details about the product, such as its title or flavour, that are retrieved from the Product Catalogue Service. More editorial content such as the Product Overview or Details come from the Product Content Service. The list goes on — Pricing Service, Reviews Service, Stock Service (for availability), etc.

You could argue that splitting the Product model into so many parts causes more headache than it solves; however, the complexity of each category of data warrants this level of separation. If we take the problem of product reviews, for example, there’s various actors and processes involved with storing this information:

  • Review Submission (by customers)
  • Automatic Review Moderation (another microservice)
  • Manual Review Moderation (by our moderation team)

Other categories of product information have similar complexities that aren’t immediately obvious when you look at the problem in the context of a user on our websites. Representing them with microservices rather than one large monolithic application allows you to store their data in a database technology that allows a good data model for the domain.

The Problem

Having so many microservices power a single page brings a few challenges though.

Downtime

A bad release — we’ve all caused (at least) one, and dread it when it happens. In an architecture where your microservices are depended upon so heavily, one bad release of a critical-path microservice can take down your entire platform. When you have multiple microservices powering a single page on a website, the risk of this happening increases even more.

Even more critical is if a main component, shared between your microservices, has an issue — for example, a shared SQL Server that stores all your information. The high-availability strategy of traditional relational databases such as Microsoft SQL Server is to have a master-slave pair, where the promotion of a slave node to a master node can take in the order of minutes to be ready to serve requests. Whilst we do employ some strategies to reduce the impact of this level of downtime — such as read-only replicas and Always On availability groups —these aren’t perfect solutions. This isn’t great in an environment where you depend on this kind of database to be able to serve requests with 100% availability.

Slow Loading Speeds

Your microservices might be really fast in the average case, but it’s easy to forget the high percentiles (p95, p99, etc.). These can easily reach 500ms-1s for even the most basic of services that normally responds in tens of milliseconds. When you need to make multiple requests for a single page, these variances in response time can accumulate and result in a catastrophic overall response time in the worst case. It’s well understood that high response times in e-commerce leads to a high bounce rate, so this is something we want to avoid.

Testing

If we needed to test the product page when it depends on many microservices, it would be difficult to mock each of the dependencies so that the data was set up in a way that was representative of a valid product. It is also likely that the cases we are testing for would become invalid in the future as we would be tied to the specifics of each service. This is due to a lack of cohesion between the various services that are ultimately representing a single entity — a product.

Our Solution

We wanted to unify all information relating to a product into one service, whilst still benefiting from the advantages that a microservices architecture gives us. This led to the implementation of a new service — which we’ll call Product Reader — whose sole purpose is to quickly serve product information without being the source of truth for that information.

Architecture

The simplest way to achieve the speed we needed for the volume of information we wanted to serve was to use a NoSQL database — Couchbase. We would store the entire representation of a product into this database and therefore serving all of the information for a product is as fast as one read on the database. Using Couchbase means we can also scale horizontally and maintain 100% uptime through the use of replicas.

However, in order to populate this database in a way that doesn’t make it the source of truth, we needed to create another application that can read from the product microservices and write the product entity into the NoSQL database ready to be read by the Product Reader. Here, we’ll call this Product Writer. Below is a simplified diagram of the information flow in our system:

Product Information is read from the various services and is stored in a NoSQL database.

But how do we know when to write a new product, or update an existing one, into the database? Luckily our system is designed using an event-driven architecture. Whenever a price is changed, new stock is booked in or really any other change in the system, we provide a mechanism for interested parties to subscribe to those changes and handle them accordingly. In this particular case, the Product Writer subscribes to any product related changes and uses those to read the updated information from the microservices before updating the representation in the NoSQL database. Once completed, the changes are immediately available for the Product Reader to serve to the website when rendering a product page for a customer. If one of the services is unavailable, we simply try to update the NoSQL database at a later time — the Product Reader continues to serve the last representation. (Note that this may not be ideal in certain applications, but works well for us).

The Result

Looking back at the product page we went through earlier, instead of individual elements being provided by each of the product services, there is now only one request made to the Product Reader which gives back all of the information required to render the page. Most requests for this information made by the website are responded to in under 5 milliseconds meaning we can render the product page much faster than we otherwise could. For context, the Product Writer application can take anywhere up to 5–10 seconds to collect the information required to store in the NoSQL database.

Historically we’ve suffered website outages caused by (scheduled or unscheduled) database maintenance or failovers on SQL Server; however, product pages — and subsequently any other part of the website — are now not directly reliant on that database and can sustain traffic throughout. This allows us to take orders throughout the year, even through our busiest trading periods.

--

--