One sunny Fall day you walk into Nordstrom on a mission for boots. Before long, you’re greeted by a salesperson who notices you’re browsing boots, and asks whether you’re finding everything alright. You say yes, thank you, and continue browsing until the salesperson happens to catch your eye again. You suddenly notice they’ve rearranged all the nearby displays to include every boot you’ve considered since walking through the door, mixed with several similar boots you hadn’t noticed yet, plus newly arrived sweaters in the same brand you bought last weekend.
This scenario might be on the extreme side of customer service, even for Nordstrom. But it’s become the standard for online shopping, where personalized recommendations now function as an extension of search and site navigation. The mission of the Personalization team at Nordstrom is to support this basic functionality and continuously find new ways to improve it. We do this by constantly exploring and questioning different approaches to anticipating our customers’ wants and needs.
More concretely, we support the systems that generate recommendations (including the models and data pipelines) as well as the services that provide a renderable set of results for the web, apps, and email. We build and support an evolving assortment of algorithms that do everything from simply reminding you of what you’ve been browsing, to recommending products and brands using a combination of offline models and near-real-time personalization.
But how do we know if what we’re doing is actually having a positive impact? That’s where online experimentation comes in. It’s not a new concept, and it’s one that the Personalization team has used heavily since its inception to optimize our algorithms and other customer-facing features. Experimentation — specifically A/B testing — lets us focus on building what really matters to customers. It’s also fun and gratifying to put your work out into the wild and see how real people respond to it!
There are already many good resources available that cover the foundations of experimentation (a few links are included at the end of this article), so here we focus on some important questions we ask ourselves when designing and building a test. This isn’t an exhaustive technical ‘how-to’ but more of a collection of interesting rabbit holes and potential blind spots we’ve encountered along the way, as a team that includes a mix of software engineers, data scientists, and product management. This article will cover some things that may apply during the ‘before’ stages of an online experiment, and in future blog posts we’ll move on to the ‘during’ and ‘after’ stages. Some of these questions will apply to multiple stages of the process.
How does an online experiment work? Revisiting the definition.
At the core of an experiment is a hypothesis that clarifies the key assumptions behind what we’re testing, and what we expect to learn. We start with the idea of rejecting the null hypothesis — that our ‘treatment’ doesn’t make any difference that is directly attributable to the new experience we’ve created. Then we collect and compare the data between randomly-assigned groups using statistical tools to determine the probability that the null hypothesis can be rejected.
If that all seems like a convoluted way to describe “testing if it works or not,” it’s because we tend to respect the testing practices that originated in medicine and other formal sciences. Although the consequences of ‘freestyle experimentation’ (or misinterpreting a p-value) might not be as disastrous with a recommendations algorithm as with a new drug, many of the best practices should still apply. After all, what we’re really attempting to do is study human behavior, which comes loaded with pitfalls and cognitive biases. So we may as well be extra careful if we want to come to accurate conclusions.
Experimentation as a shared responsibility
In Personalization, both engineering and data science take an active role in this process by directly collaborating with the business: Mapping out solutions together to known customer problems, informing and adapting around our technical constraints, and pushing to better understand the “Why?” which helps us better match expectations (and also helps to refine the hypothesis).
But even with a solid hypothesis and well-defined test, there’s a lot more to consider when running a successful experiment. Success doesn’t just mean whether you moved the needle up or down — it also means knowing that we asked the right questions. Time is a resource with a hard limit, and it can take a lot of time to gather enough data from an experiment to form any kind of conclusion.
Have you seen it through the customers’ eyes?
The first question to ask yourself is: Would you appreciate this new feature showing up unannounced? Would it be a pleasant surprise, or would it get in your way? This is where user studies and other research comes in handy — and thanks to our UX team, we don’t have to answer these questions on our own. It’s still important for us to consider the impact of what we’re building. This is especially true when the changes we’re testing are more subtle and contextual, such as algorithm improvements. For example, we might ask ourselves:
- Will we recommend a lot of jeans in an overly limited size range if we make a similarity model more accurate to a product’s cut or style, vs. what customers naturally browse together?
- If we over-optimize for recent browsing patterns, will we recommend too many Gucci sneakers for any one human to own? In other words, are we paying too much attention to what you happened to click on a few times out of curiosity, and how can we tell the difference or mitigate that situation so it doesn’t get annoying over time?
- If we over-filter on gendered product tags, are we really helping customers discover what speaks to them, or are we limiting their options?
The effects of these changes might not always get detected by the metrics we’re measuring directly in an experiment, but it’s essential to get a feel for what it would really be like to use the feature that you’re building.
Is it different enough to make a difference? And which version(s) should you test?
Along with thinking about how the test might affect customers outside of the ‘global’ signals, you also want to know that what you’re testing is likely to make a detectable difference. This can be tricky to figure out, especially when it comes to algorithm improvements which won’t affect the whole population in the same way. Some products will be more popular and have more data, while some will be brand new and have no data at all.
Add to that the uncertainty of how customers will respond to seemingly tiny adjustments, and you might wonder if throwing a dart is a better approach. We recommend thinking a little harder than that about what your customers are known to care about, and to use whatever analytical resources you have to make an educated guess.
To this end, my team is working on expanding our tools for doing offline evaluation, which is one way to help ourselves determine which changes might present the most opportunity to make a measurable, positive difference. This deserves its own article in the future. For now, we do have some means of understanding how different our experiences will actually be, mostly by comparing the outputs of multiple algorithms side by side.
Do you know the scope of the technical debt you’re incurring, and how/when it will be addressed?
Major tech debt can be the curse of a winning experiment. In order to avoid paying too much up front for an experiment, you’ll probably incur some kind of technical debt. The nature and size of tech debt can vary from “we need to clean up this code” to “we should make this work faster” to “we need to trash this and rewrite it completely.”
Some key steps here are:
- Knowing what that debt is — i.e. recognizing it and being realistic about its scope. Ideally this should be a collaborative exercise. Different people will have blind spots or biases around how much work something will be to fix or optimize down the road.
- Communicating the costs and implications clearly to your stakeholders. This requires understanding the customer impact of the debt, both internal and external. For example, will this new experience make certain pages load more slowly? Will it require running a new ETL process that adds operational overhead for the team when it breaks? Be considerate to your future selves.
- Planning ahead about how and when you will pay it down if your experimental feature shows positive results. It’s like any other kind of debt: you probably can’t buy a house or go to college without it, but you have to start with a plan for how you can realistically manage it over time. Uncertainty about the outcomes of experiments makes it hard to plan ahead, so having some way of documenting those “if/then” decisions when planning is essential.
- And finally, considering how future refactoring may affect the validity of your original results, and whether or not you will need to plan for another test. If you run an experiment with code or a data source that will be fundamentally altered, further testing is required to validate the change.
If you decide against rolling out a feature, it becomes a cleanup problem, where the code might not be easy to detangle from the rest of your app. One concrete way we try to reduce the footprint of uncertain code is to rely on feature flags that live in a separate location from production settings. The goal is to avoid mixing the settings that might get deleted with the ones that are safe to reference elsewhere. We effectively quarantine experimental settings until we know they’re permanent.
What will the weather be like? I.e. what variables might change during your experiment, that are outside of your control?
When we design an experiment, we intend to observe the effect of changing an experience (x) on a measurable outcome (y). The experience we’re controlling is the independent variable, and the measurable outcome is the dependent variable, assuming a possibility of cause and effect between the two.
In reality, we can’t keep everything else constant. The literal and metaphorical ‘environmental conditions’ based on the state of the world today are guaranteed to be different tomorrow, so it helps to be specific about our assumptions related to timing and other context.
Some things we’ve had to consider include:
- Is there a sale event happening that will overlap with the experiment?
- What other experiments are other teams planning to run at the same time?
- How might the content of the product catalog change? Will some product categories be getting more new or replenished items than others?
We can’t predict precisely how these will affect the results, but we try to remain aware and adjust our plans accordingly. What we choose to adjust depends on what we are trying to learn. Some predictable seasonal events might give us a unique opportunity to test what works and what doesn’t in those scenarios. At the very least, it makes sense to consider these variables in the post-test analysis.
Who do you have in mind? Will there be enough data?
It helps to think about what customers your experiment will serve when designing your experiment. We often target a subset of customers by regional location or product category to get a clearer signal regarding what works in a specific context. In other words, we try to only expose customers to experiences that we think will be relevant to them. Another big consideration is new vs. repeat customers. Sometimes we improve the experience more for new customers than existing ones, and if that’s a recurring theme, then we may want to have the ability to explicitly target new customers.
Pre-segmenting the population also affects how much data we can realistically collect in a given time frame. When we’re specific about which population is exposed to a treatment, we may not get enough data to reach an accurate conclusion within 2 weeks, or even a month. The details of determining appropriate sample sizes is outside the scope of this article, but you should ask your friendly neighborhood statistician to help you know what to expect. It’s better to get a rough idea of a ‘sufficient sample size’ ahead of time rather than finding out after spending weeks of valuable testing time.
Should you really run this as an A/B test?
A/B tests are not the only way to learn what works for customers. There are other valid approaches to exposing people to different experiences and acting upon the results. Examples include:
- Multi-Armed-Bandits: Systems that decide between multiple options by ‘exploring’ (exposing different options) and choosing the best performing options. One way a bandit might do this is by randomly selecting between options and then gradually reducing the randomness over time using a reward function.
- Interleaving: a method that some companies use as a ‘combined ranking’ approach, where ranking algorithms are mixed and measured as part of a single experience. It can be a faster way to determine the best candidates between various ranking algorithms, which can then lead to a traditional A/B test between those candidates.
We won’t go deeper on the details of these approaches here, but I encourage you to explore them and understand why and when you might want to pursue them as alternatives or supplementary options to A/B tests. We experiment with both of these methods in Personalization, and hope to share more about our experiences with them in the future.
This is not an exhaustive list of experimental considerations. We encourage anyone interested to investigate other resources related to experimentation (see “Further Reading” below), and to form and share their own informed opinions based on personal experience and research. We also plan to share even more on this topic in the coming months — so stay tuned!
- https://exp-platform.com/ was a project headed by Ronny Kohavi at Microsoft and includes a lot of useful papers, articles, and talks about “accelerating software innovation through trustworthy experimentation.”
- https://conversionxl.com/blog/ab-testing-guide/ is a very comprehensive guide to all stages of an A/B test. It also includes its own list of resources for further reading at the end.
- https://www.statisticsdonewrong.com/ is a website and book that isn’t specifically about online experimentation, but includes a lot of common misconceptions that do apply in any type of results interpretation.
(Special thanks to Morgan Weaver for editorial assistance on this article)