How we handle Black Friday traffic in Motive Commerce Search

Francisco García López
Motive.co
Published in
5 min readNov 25, 2022

Black Friday is definitely one of the most special days of the year for online shopping and shoppers, but more importantly for shop owners that see a dramatic rise in visits to their shops and interactions with their catalogues. The idea of having the shop going down on this very singular day is surely daunting.

It is the day that your shop’s infrastructure is shaken up and tested to see whether it can handle the maximum stress it was built for. As part of your shop is your search, Black Friday is also the day when you can check whether what you’ve been preparing for during the whole year is enough for occasions like this.

Although it is a first for Motive, we have a long history of dealing with Black Friday stress over at our sister company Empathy.co, handling this intense period for many of the big brands we provide a search solution to. However, the Motive ecosystem has its particularities, and although we can use the learnings from Empathy, some things we’ve had to face for the first time ourselves.

Forecasting how Black Friday will be like

The starting point is to really get to understand the prospective behaviour of our system under high pressure to know the potential risks the system may face during Black Friday. Although checking the system scaling possibilities is something we do regularly to ensure our system is robust and ready to handle unexpected stress, it is especially important for a day like this.

It was vital to know how to scale our system to handle the upcoming traffic. For this, we made an estimation of not only the increase in traffic, but also the information managed by our infrastructure, which has an effect on how our system performs.

Therefore, we scale our regular conditions by an estimation of how information is expected to grow, keeping the same distribution currently held by the relation between traffic and data. This way, we can observe how the system behaves, understand its limitations, its risks and identify where possible bottlenecks may occur.

Getting down to work: Chaos testing

Now that we had an environment that behaved as similar as possible to what it would during Black Friday, we needed to have a clear process to follow in order to carry out a chaos testing of possible failures. This allows us to make sure our functionalities behave as they’re supposed to during the D day.

For this, we randomly generated failures in all parts of the system for the functionalities that are critical to see how the system would behave in each scenario. There were several steps we went through to ensure everything was rock-solid in preparation for Black Friday:

  1. Carrying out a previous analysis of the key functionalities.
  2. Establishing the dependencies between the technology and functionalities.
  3. Making failure experiments to those dependencies.
  4. Seeing how the system responds to the experiments.
  5. Making informed decisions based on the learnings of the experiments.

The establishment of relationships between features, the services and the technology in use is vital, as we need to understand which parts are critical for Black Friday. In other words, if something were to fail, which part would be the least damaging for our customers and their shoppers’ experience?

As an example, the control panel our shop owners use to configure their search -the Motive Playboard- will unlikely see overloads, but the shops’ data is expected to grow significantly. Also, if a search feature has to sacrifice itself for others to be up-and-running, which one must that be? Deciding priorities was a main point for this risk prevention, making sensible decisions based on facts.

Lessons learnt

Okay, we now have a general understanding of the process we followed, so let’s get to the specifics. Some of the learnings we got are very relevant, and some others we actually got way before Black Friday. For us, constant observability is key not only for Black Friday but to provide a consistent experience for our customers all year round.

As part of our findings, we identified the cache as a Single Point of Failure (SPOF), as its time-out was making services fail due to being too high. So when it was down, some of the functionalities became unresponsive. We mitigated this cache time-out issue and checked it didn’t happen again.

When it comes to the scalability of the system, we’ve established plans on what to scale if we have even more traffic than what we’d projected. We need to have a contingency plan in case of uncertainty, i.e. know what to scale if some of those observed scenarios take place.

Among those changes, we upgraded the number of ElasticSearch nodes to have a margin in case we needed to scale the search capabilities. We learnt that, with our current configuration, if we increased the number of search pods without increasing the number of ElasticSearch nodes, the system would not really scale in case of emergency.

One of the reasons for this ElasticSearch upgrade is the very nature of our system, giving service to over 300 customers and therefore handling a large amount of isolated information. This makes an overload due to the amount of ElasticSearch shards more likely than an overload due to traffic per se.

Another one of the conclusions we reached when understanding the situation was preparing Kafka with a greater number of partitions for critical topics. This decision was made to be ready to scale in case we need to make use of those partitions for Black Friday.

Ready for today

All in all, improving the observability is the spot-on solution to not leave anything to improvisation: our plan of action has to be well-drawn and based on facts. We need to have visibility on everything that is going on in our system in detail and in real time. For that, we improved our tracking metrics in Grafana, our alert system and the observability of the system as a whole.

This pristine observation of facts is key to having the team’s focus on understanding which aspects of the system they need to be looking at during Black Friday, the potential risks that may arise and how they can mitigate them. We’re now ready for today, and know how to react to everything that may happen.

--

--