What we have learned about AB testing from our APP catalog page redesign

Rolling out a new version of the catalog design on the app might look like an easy win from the outside. But, it took us a year to confirm that our vision was adding value to the customers.

Lucas Von Dorpp

Published in

Vestiaire Connected

9 min readFeb 17, 2023

Co-written with Nensi Kopellaj and Adriana Colomer.

Catalog listing pages | Vestiaire Collective app

The Vision

A year ago, the product design team realized that our catalog listing page (PLP) experience was cluttered and heavier than any of our e-commerce competitors.

Showing a lot of details on our product tiles could be helpful for some customers, but it also diminished the appeal and inspirational mission of the PLP. Compared to other fashion marketplaces, some of our competitors were able to display 4 to 6 items per screen view, while we only displayed 2. It was also clear that we were focusing too much on text attributes rather than product visuals.

This was when the Catalog Product Team realized that there was an opportunity to give our users a more inspiring and attractive catalog while displaying more products per screen. This is also where the fun began 🥳.

In this article, we’ll go over all the challenges we faced, how we overcame them, the final AB test results, and what we learned along the way.

Challenges

Challenge #1 — We had too many open questions

When the project was pitched to stakeholders, the initiative was supported by solid data proxies, contrasting benchmarks, and insightful user interviews. The solution was also seen as not too difficult to implement: a quick, easy win!

By focusing on information most relevant to users, we identified a few pieces of information that could be removed (Seller badge, Like counter, shipping option) or optimized (change like position & hide filters when scrolling). We then saw an opportunity to show 4 full items per screen view instead of 2. We were really confident that since our catalog contains millions of unique items, allowing users to browse products faster would positively impact our catalog’s business KPIs.

Catalog listing pages: our starting point and end vision | Vestiaire Collective app

Yet, even if user interviews told us that the attributes we planned to remove were not considered essential, it raised many questions. Removing all the attributes at once was considered too risky: What if it looks good overall, but we missed a key change in user behavior? What if customers like the fact that we removed attribute A from the product tile, but not attribute B?

To answer these questions and many others, we decided to AB test removing one attribute at a time (4 in total) with the idea of answering them all later.

Challenge #2 — We expected too much

In the first 4 AB tests, the expectation was that by showing more product tiles per screen view, along with relevant information on the product tile, users would click on more products. But it would also generate more likes, more orders — more of everything.

By choosing a long list of KPIs that had to be statistically significant, the bar was set very high for the new design. This caused some test iterations to fail because they could not deliver such a thorough impact.

Like any other feature, the new design could only impact a certain number of KPIs closely related to its place in the funnel. As the new design stood in the first part of the funnel, it could affect product clicks, product impressions, filters, or other interactions that could be made directly from the PLP.

Its impact on orders further down the funnel would always be more marginal or invisible, even if the new design turned out to be better for users on the PLP.

Challenge #3 — We did not spend enough time planning the test

The first step in the planning process should be to choose a clear, simple hypothesis to test and linking your success KPI closely to that hypothesis. In our first AB tests, we got lost in a list of questions to be answered after the test was launched.

The second important thing to do during test preparation is to list the steps of the funnel that you do not want to negatively impact. For example, when we launch the test for the new design, we do not want it to negatively impact orders, orders from PLP, or GMV.

The final step after defining the hypotheses and KPIs is to preemptively discuss the outcomes of the test and the steps to be taken for each. Overall, there are four major combinations of outcomes to test:

Positive impact on success KPIs, no negative impact on guardrails.
No impact on Success KPIs, no negative impact on guardrails.
Positive Impact on Success KPIs, Negative Impact on guardrails.
Negative impact on Success KPIs, negative impact on guardrails.

Challenge #4 — We were testing too small variations

The first few iterations of the PLP test were done on minimal variations of the product tile design. Removing one attribute of the product tile at a time (i.e., removing the seller badge) proved to be time-consuming for the data analysts as the impact of the results varied.

After going through the deep dive analysis exercise, we realized that because the design changes were small, their impact was difficult to detect and quantify. We were close to running an A/A test, and the results were closer to randomness than a true impact of the change.

Challenge #5 — Our choice of count metrics biased the AB test results

In GrowthBook, our preferred tool for AB test analysis, when creating a new KPI to measure the results of an experiment, a metric type should be assigned. We primarily use the following types:

Count metrics are those that measure multiple conversions per user, such as page views per user.
Binomial metrics are a simple yes/no conversion, such as users who viewed at least one page.

One of the biggest limitations of count metrics is that they can include outliers (and/or heavy users), and these can affect the results of our test, and end up being highly biased.

In the early stages of the PLP test, we observed a large volume of extreme values in metrics such as Catalog Views and UPPVs (Unique Product Page Views) coming from non-logged traffic. Most of the metrics were of the count type. Therefore, it took a lot of work to evaluate whether large outliers had an outsized impact on our experiment results.

The most common solution to this limitation is to use a capped value for count metrics. This means that if our normal volume of page views per user is 20, everything above that value is capped at 20. Outliers are accounted for, but their impact on the results is mitigated.

Another alternative is to create the binomial version of this metric to eliminate extreme values. Since binomial metrics are a binary KPI, they don’t have this limitation, so outliers are completely eliminated.

Challenge #6 — Too many deep dives that didn’t yield valuable insights

When our selection of KPIs shows performance that is contrary to our expectations, we naturally want to understand why this is happening. To do this, we should follow a plan. Because we didn’t have one, we were drawn into a never-ending spiral of questions instead of finding answers to our initial concerns. We should only conduct deep dives if results show a correlation between control and treatment. If that’s the case, here’s what we should do:

Make sure our KPIs are close to the new feature: the closer they are, the more relatable to the feature our results will be, so the higher the chance of observing significance.
Break down the results into different dimensions to understand if a region, user segment, or device is performing better/worse and can explain our AB test results.
Recognize that the feature has a very small impact and therefore doesn’t really make a difference. In this case, we should rethink the feature itself, work on applying some changes, and test again in the next iteration.

Solutions

By breaking this initiative into too many little chunks and looking at too many details, we were missing the big picture: the initial hypothesis that users would be more engaged with a decluttered PLP. Removing one attribute at a time was not adding value for our customers. Showing a cleaner, more attractive catalog page might.

We then switched to a simpler and more straightforward AB test strategy, addressing chronologically all the pain points mentioned above.

Focus on answering our initial question: Will users engage more with a decluttered PLP?
Let “percentage of users with at least one product click” be our main success metric. With a KPI breakdown to understand the share of users who engage a little more and users engaging way more. Consider all other KPIs as control KPIs.
Spend enough time planning the test, and approve the A/B test strategy with everyone involved in the project before starting.
Test a large enough increment by showing all changes at once to make sure we can analyze the impact, or limit the test users to those who are likely to notice minimal changes.
Make sure to control for outliers by looking at our KPI with a binomial approach. We looked at “all” instead of “at least one”.
Commit to the timeframe. As we targeted two weeks to make a decision, all additional questions were to be listed and seen later in an ad hoc analysis.

Results

With a strong success metric and the appropriate breakdown, we were able to easily validate that users were more engaged with the tested version.

Success metric

PLP users with at least one product click +.

Success metric breakdown

PLP user’s average number of product clicks by PLP users: +++
PLP users with 1–5 product clicks: +
PLP users with 5–15 product clicks: ++
PLP users with 15+ product clicks: ++++

Most control KPIs showed a positive trend, with the exception of filter usage. This would have been considered a no-go in the previous AB tests, but was now considered a to-do for later ad hoc analysis.

ATC, orders, likes and impressions… were also in the green!
Filter usage was slightly down.

We then shipped the end goal vision to 100% of our app users!

Do’s & Don’ts

Feel free to try these tips to improve your future AB testing experiments.

Do’s ✅

Focus only on answering your original question.
Find the one success metric that addresses your initial question.
Allow enough time to plan & track your A/B testing.
Test a big enough incremental. You don’t want to spend time analyzing something that doesn’t have enough impact. Otherwise, limit the users in the test to those who could notice the impact (i.e. top users).
Make sure you can remove the outliers from the KPI you are looking at.
Share a clear timeline for the next steps in an A/B test.
When making a final decision on your A/B test, determine if your A/B test will require another similar test in the future.
Perform an AB quality assessment when the experiment is over.

Don’ts ❌

List multiple success metrics.
Have multiple hypotheses to test for a feature.
Choose a KPI that is not directly impacted by your feature, even if it looks like the easiest solution to communicate to your stakeholders.
Come up with a waterfall KPI strategy (i.e., “let’s see if orders increase, if not, let’s look at ATC, if not, let’s look at clicks, if not…”).
Assume you’ll find the most insightful KPIs after that (i.e., “let’s track everything and see what’s improving or declining to make a decision”).