A/B testing 101: Building a Robust Testing Framework for Home Improvement Giants

3 min readMay 26, 2024

Leading Home Improvement Retailer, Who Would Remain Anonymous

When I started working with Mu-Sigma, my first project involved creating an A/B testing framework. While I can’t disclose the client’s identity, I can say it was one of the world’s largest home improvement retailers. They regularly conducted events (campaigns and other changes) and were eager to understand their impact. Their existing method involved testing these events in a few select stores. If these stores showed improved performance post-event, the initiative was rolled out nationwide. These test stores were chosen to represent nationwide stores.

However, the pre-post analysis they used had its limitations. For instance, external factors like holidays, weather fluctuations, or news could artificially boost sales. In such cases, the pre-post analysis would misleadingly attribute the sales jump to the event, resulting in disappointing nationwide results. Therefore, they approached us for a better method.

We proposed selecting a set of control stores for each test store. These control stores would be similar to the test stores in both store characteristics and historical performance but would not implement the event. By comparing the performance of the test stores to their controls during the event period, we could confidently attribute any differences to the event. This approach also assured that the effect could be replicated on a national scale.

The next challenge was selecting these control stores. After much deliberation, we settled on a four-step funnel approach to find the stores most similar to the test stores.

Geographical Vicinity
Demographics and Store Attributes
Correlation
Euclidean Distance-based Similarity

Here’s a detailed description of the steps taken:

Data Ingestion & Preparation

We filtered the latest 52 weeks of data for both test and control stores. Exclude new stores, i.e., less than 1 year old, from control pool.

Data Cleaning

The data was aggregated at the store-date level. Missing values were replaced by the mean of the lag and lead values. Outliers were detected and treated using z-scores and business insights.

Scaling

The daily sales data of test and control stores, measured in units sold, were scaled using standard scaling methods.

De-Noising

The scaled data was de-noised using a moving average. De-noising was essential because noise could diminish the effectiveness of both correlation-based and Euclidean distance similarity matching methods.

Filtration Steps

Geographical Vicinity

Our client had divided the US into 19 geographical sub-regions. We ensured that a test store from a given sub-region had a control pool from the same sub-region.

Demographics

For a given test store, we chose the top 100 stores most similar in terms of store attributes like size, age, and type, as well as customer demographics.

Correlation

From these top 100 stores, we selected the top 30 with the highest correlation in sales, measured using Pearson’s correlation coefficient.

Euclidean Distance

We de-trended the data from the top 30 stores and the test stores. Then, we calculated the Euclidean distance between them. The top 10 stores with the closest Euclidean distances to the test stores were chosen as our control stores.

This meticulous approach ensured that the control stores closely mirrored the test stores, making the analysis robust and reliable. The result was a much more accurate measure of the event’s impact, giving the client confidence to roll out successful initiatives nationwide.