Preparing for (a) Black Friday

Published in

Tech@Travelstart

8 min readNov 27, 2017

Black Fridays are exciting for any e-Commerce company and for Travelstart this was our third time. We have learned a lot from the last two, so this year we were determined to get it right. Here’s what we did.

The few months before

We learned quite a number of lessons from the previous Black Friday and scheduled some fixes to address these issues:

Issues with our CMS — luckily, it was resolved quickly because it was a configuration issue (that is why periodic load testing is a good idea). However, it could only scale up a certain number of users. We then offloaded images onto S3 and made it available via CloudFront. This change also enabled us to provision more instances — the previous application architecture stored images on the File System. We added a caching layer which dramatically increased the response times and amount of concurrent requests. We made incremental changes and testing them doing load tests, during the process we also discovered OS level bottlenecks which needed to be tuned.
The platform backend issued a lot of the same queries to the database for data that almost never changes — we started caching those in a distributed cache, which also means that changes will reflect across the cluster in the case of a change. The below is a sample screen shot showing the effect. We are pretty much saving a database connection allocation, a network hop, a (possible) disk read and some CPU time on the db box. Which is great for scalability.

Issues with the database that keeps track of all Request/Responses to third parties on the previous BF. Although INSERTs are already bundled and then pushed to the database asynchronously, we were bundling in parallel (multiple bundles of bundles) and that meant that the db will be sent a bunch of bundled INSERTs per connection per backend. Which was really bad for scaling — the db was totally overloaded. So, we thought about on how to “move” the problem from the database to the backend itself (we can provision backends at will, but not the database easily): we queued up each entry and sent that to the bundler (of which there is only one now, instead of a few). This means that the number of connections to the db are relatively constant and the load sits on the backend, not the database. See the screenshot below; the number of connections went from 27 to 42 (1.6x) under a lot of load, but the number queries went up by a factor of 6 and (highly compressed) network traffic by a factor of 4. On the previous BF we had 500+ connections in use which resulted in really bad performance.

Reviewing the number of “static” requests that the website was making to its API — we realised that those were 50% of all API requests! We introduced a caching layer in front of the API and reduced the number of calls by 50%!

Deployment: before we maintained a list of deployed components (backends, APIs, etc), ie the list of upstream servers were static and not dynamic, which resulted sometimes in deploy bugs: we provisioned new server instances, but the deployment process did not know about it —resulting in “weird” behaviour (old code showing up) — we changed all configuration to be dynamic: all lists of “nodes” are now generated at all times. This is useful when you need to ramp up quickly and eliminates time wasted on bug hunting.
Implement a Change, Review and Apply philosophy: make a change to a small part of the system and monitor the change — if all is well — apply everywhere. Have proper monitoring tools. Have all configuration (including infrastructure) in version control — makes it easy for rollbacks and to see who did what. Reverts should be easy.
Since we rely heavily on Message Queues, we fell into the trap of just increasing the number of concurrent consumers (to scale) and not notice that the actual processing time of each message went up since making the change; if processing of a message depends on an external dependency (like a database or a external service) that has a certain SLA — be careful! Since then we reviewed and reduced the number of concurrent consumers for certain queues from 10 to 1 per backend and we got this nice surprise:

Processing message time reduced by more than 50%

The Prep the week before

The first thing we did was to do a load test at night to see if our CMS (content for essentially our sale pages) will able to hold up to a sudden surge in traffic, and to our surprise we discovered that we hit bandwidth limits at the data centre! Luckily, the test was done early, so we could resolve it in time. It also helps a lot if load tests can be initiated from a console for consistent testing.

In terms of the office network: since we are going to deal with a lot more calls, we increased out internet bandwidth for both voice and data for the week to deal with the extra load.

We also created BF-specific dashboards to be shown on screens, especially for third parties (response times, volume) as we depend on them for eg flights, payments — usually during a big sale, some services go down very quickly and it is often easier to have a visualisation that everyone can see to spot problems early. It is also a good idea to have a chat with your most problematic service suppliers and agree on how to communicate when things go TITSUP (Total Inability To Support Usual Performance). We have assembled a team of core people to a “War Room” for the day and dashboards were put up on screens.

Alerting is something we already had in place, but we reviewed all alert policies like response time for certain key transactions, eg BookFlight. If you depend on a lot of third parties for critical services like Travelstart, also setup alerting for those services to able to deal with operational issues.

Then provisioning of extra resources (“servers”, “instances”): we looked at normal load (1x), our last sale’s load (Birthday sale, 5x) , and last Black Friday’s load (25x). This is where you need know your application architecture well — just adding more servers does not necessarily mean your apps will scale with the number of servers added (Travelstart’s platform is a heavily SEDA-based architecture). Since we use NewRelic, we could get some useful stats to get a feeling on how many instances of each component to deploy (backends, APIs, etc). We have not yet implemented auto-scaling, but the same thing applies: your architecture must cope with increased number of instances.

On a more technical front, we looked at our external caches — in this case, our flight cache — which holds very often searched flight data which is quite complicated since dates and availability play a very special role in terms of flight pricing — not your straight up cache strategy. On Black Friday, searches for routes (with accompanied dates) will be very specific and we created a separate cache instance and changed the Flight Cache Rules to have a very short TTL (usually we have a dynamic TTL that is calculated for every search). The rationale behind this is that users will search for the same flights (route + date) in a very short time and it will alleviate that sudden pressure on our side, but also on our suppliers as well.

Review of scheduled Jobs — since BF starts at midnight, we had to re-schedule some jobs that run around that time. Same goes for Database maintenance tasks — you don’t want to put extra load on the database when it is supposed to be a quiet time. We also looked at jobs that run very often and asked if it is really necessary for it to run every minute? We discovered a few of those and changed it to run every 15 minutes — it took a quite a bit of load off the system overall!

Change of jobs schedule resulted in a query running less frequently, resulting in less load on db

Another big thing we checked was database connection pool configuration — since we are scaling up, the number of database connections will also increase. Eg, you have 10 backend instances and each backend have 10 connections, now you scale to 25 backend instances then suddenly you can have up to 250 connections instead of 100. Either reduce the number of connections per backend instance (good idea, databases don’t scale as easily as your backends) or give your database a higher connection limit. We did the same review for connections to our Message Queue.

Timeouts: tuning timeout configuration for external services. On the website, total search response time is as good as the slowest external service we use to get Flight availability. We have shorten the timeouts for some of our problematic suppliers to deal with us appearing “slow” to our customers.

In short:

Load Test — before and after changes
Monitor —visual, but also alert via Slack, Text Message (SMS)
Provision capacity with backed up stats — know your application architecture!
Review scheduled jobs
Review caching strategies
Review of database connection options
Review number of message queue consumers
Have a we-need-to-make-a-change philosophy
Review of timeout config: you are as “fast” as your external service provider

Conclusion

A lot of these points are quite obvious, but could be quite dramatic when implemented, and in our experience it were the little changes (mostly config) that had the biggest impact. We also learned to review more often, and not just for big sale days. Metrics are your guiding points. Don’t make changes without measuring the impact.

Preparing for (a) Black Friday

The few months before

The Prep the week before

Conclusion

Written by Jan-André le Roux