Why should an engineering team invest in Quality Framework?

Gargi Dasgupta
Nov 23, 2020 · 5 min read

Written by Gargi Dasgupta

What is the worst nightmare for an e-commerce company operating at more than double its usual load during the peak holiday season?

While you can think of many, the first and foremost horror for an engineer would certainly be production bugs!!

Although it’s quite normal for any micro-service involved in continuous development to result in quite a few periodic bugs, there are 2 prominent ones that haunt me to date. These bugs occurred in the Marketing Promotions platform. To give you background, this platform aims at creating personalized marketing promotions, distributing, making customers aware of their eligible promotions, and finally validating the promotion code during checkout.

Nightmare 1

In November 2018, with peak holidays around the corner, the Promotions engineering team successfully delivered the new promo awareness feature. The objective of this feature was to make the customers aware of the promotion codes (promo codes) available to them through their in-app messages. In addition, it would also pass the last 24 hours(hrs) promo code redemption count as well as the time remaining for the promotion to create urgency based on the promo code’s popularity.

Image for post
Promo Awareness in North America

However, with Groupon being a multinational eCommerce company, the objective in North America was to highlight both the remaining time and 24hrs redemptions, the EMEA application version mandated to show only the last 24hrs redemption count and hide the remaining hrs left.

Image for post
Image for post
Promo awareness in EMEA

This feature was a huge success, and led to a sales lift of over $2 million!

Everything was working smoothly until the next year around the peak holiday seasons when something unexpected happened. In EMEA, instead of displaying the last 24hrs redemption count, the time remaining for the promotion started showing up.

Image for post
Image for post
Promo Awareness in EMEA when the expiry time started showing instead of 24hrs redemption count

The obvious question was, what then went suddenly wrong, as there was no release done for this feature in over 6 months?

Root cause

After a team of 4 engineers spent hours investigating the issue, they were able to identify that another feature that went live in mid-2019 had incremented the hourly redemption counts in an underlying function, which in turn doubled the value of hourly redemption counts; note: total count was correctly tracked. To make things tricky to solve, the last 24hrs redemption count was set to total count of redemption (which was tracked differently) if the promotion was running for less than 24hrs, and after 24hrs the formula was

Image for post
Image for post

Since the hourly counts were doubled, after the promotion ran for over 24 hours the redemption count became negative as per the formula above. Further investigation revealed the web and mobile application logic in EMEA defaulted to showing the time remaining for the promotion if the redemption count was 0 or less.

Nightmare 2

Going back a few months in time, in mid-2019, the team had witnessed another unusual issue, and this time resulting from a regular deployment. The promotions on-call engineer received a page after 10 hours of deployment that the error rate had increased by 7%.

Root cause

Upon investigation, it was found that the errors were coming only from one application (app)server out of 30. Turned out this app server’s deployment was corrupt. Although running a smoke test was a standard part of release, it was done once after the first application host deployment, and then at the end of the complete deployment at the Load Balancer level. Hence, the corruption was not identified until later, when the peak traffic started.

While both the incidents were resolved quickly, it made us realize that had we been more proactive in identifying the gaps in the system, we could have averted the obvious ones. This led to the conception of a robust quality framework which would improve reliability and availability. An additional aim of the framework was to improve the delivery time of high-quality products by reducing the time of operations.

The goals of the framework

Image for post
Image for post

Elements of the framework

Image for post
Image for post
Image for post
Image for post

The Conclusion

Over a period of time, once we had vehemently followed the above standards and best practices, we saw considerable differences in our promotions platform as noted below:

Image for post
Image for post

Ultimately, engineers are now happier as they have to spend less stressful days and nights attending alerts and outages, allowing them to be more innovative. Internal and external customers are also delighted with new products, such as “Click to Apply” promotions much further up on the purchase funnel, and automation of promotion campaigns creation.

Groupon Product and Engineering

All things technology from Groupon staff

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store