Written by Gargi Dasgupta
What is the worst nightmare for an e-commerce company operating at more than double its usual load during the peak holiday season?
While you can think of many, the first and foremost horror for an engineer would certainly be production bugs!!
Although it’s quite normal for any micro-service involved in continuous development to result in quite a few periodic bugs, there are 2 prominent ones that haunt me to date. These bugs occurred in the Marketing Promotions platform. To give you background, this platform aims at creating personalized marketing promotions, distributing, making customers aware of their eligible promotions, and finally validating the promotion code during checkout.
In November 2018, with peak holidays around the corner, the Promotions engineering team successfully delivered the new promo awareness feature. The objective of this feature was to make the customers aware of the promotion codes (promo codes) available to them through their in-app messages. In addition, it would also pass the last 24 hours(hrs) promo code redemption count as well as the time remaining for the promotion to create urgency based on the promo code’s popularity.
However, with Groupon being a multinational eCommerce company, the objective in North America was to highlight both the remaining time and 24hrs redemptions, the EMEA application version mandated to show only the last 24hrs redemption count and hide the remaining hrs left.
This feature was a huge success, and led to a sales lift of over $2 million!
Everything was working smoothly until the next year around the peak holiday seasons when something unexpected happened. In EMEA, instead of displaying the last 24hrs redemption count, the time remaining for the promotion started showing up.
The obvious question was, what then went suddenly wrong, as there was no release done for this feature in over 6 months?
After a team of 4 engineers spent hours investigating the issue, they were able to identify that another feature that went live in mid-2019 had incremented the hourly redemption counts in an underlying function, which in turn doubled the value of hourly redemption counts; note: total count was correctly tracked. To make things tricky to solve, the last 24hrs redemption count was set to total count of redemption (which was tracked differently) if the promotion was running for less than 24hrs, and after 24hrs the formula was
Since the hourly counts were doubled, after the promotion ran for over 24 hours the redemption count became negative as per the formula above. Further investigation revealed the web and mobile application logic in EMEA defaulted to showing the time remaining for the promotion if the redemption count was 0 or less.
Going back a few months in time, in mid-2019, the team had witnessed another unusual issue, and this time resulting from a regular deployment. The promotions on-call engineer received a page after 10 hours of deployment that the error rate had increased by 7%.
Upon investigation, it was found that the errors were coming only from one application (app)server out of 30. Turned out this app server’s deployment was corrupt. Although running a smoke test was a standard part of release, it was done once after the first application host deployment, and then at the end of the complete deployment at the Load Balancer level. Hence, the corruption was not identified until later, when the peak traffic started.
While both the incidents were resolved quickly, it made us realize that had we been more proactive in identifying the gaps in the system, we could have averted the obvious ones. This led to the conception of a robust quality framework which would improve reliability and availability. An additional aim of the framework was to improve the delivery time of high-quality products by reducing the time of operations.
The goals of the framework
Elements of the framework
Over a period of time, once we had vehemently followed the above standards and best practices, we saw considerable differences in our promotions platform as noted below:
Ultimately, engineers are now happier as they have to spend less stressful days and nights attending alerts and outages, allowing them to be more innovative. Internal and external customers are also delighted with new products, such as “Click to Apply” promotions much further up on the purchase funnel, and automation of promotion campaigns creation.