How to Swiftly A/B Test Recommendations From the Inside, Rocket-Style
By Nelson Gomes — Software Engineer and Paulo Ventura — Senior Software Engineer — .NET
The main goal of any recommendations engine, particularly our own Inspire by FARFETCH, is to, well, inspire. To achieve that, we need to adapt and evolve our product (you can read all about that here!). We must make decisions based on data and understand the behavioural change of our customers and how they are immersing in this luxury experience.
We start the process of designing specific recommendations for a touchpoint by driving experiments to provide us with insights for the next product iteration. This approach helps us measure the success of our hypothesis, enabling us to launch a new feature/update with more confidence, to increase the accuracy of our actions and to estimate the impact of the feature towards our mission as a recommendations engine.
So, we use A/B Testing to compare two versions of a feature: testing the user’s response to a variant A against variant B, commonly known as control and alternative, and determining which of these variants is more effective. In this process, we split users in two groups: one group experiences the original version without changes (control version), and the other group experiences the new feature/update under launch consideration (treatment or alternative version).
Let’s say we want to start A/B Testing the best algorithm for a given touchpoint: simply test recommendation A against recommendation B. Simple, right?! Well… Actually, it’s a slow process that requires perfect alignment between multiple teams. We need to include the user touchpoints/use cases where the A/B test will run. The experience teams that manage these touchpoints on the website, mobile apps, and/or email campaigns are separate from our recommendation engine team, and each other.
Because we operate our Inspire engine as an independent, standalone service, we need to coordinate with these teams to request client-side development changes that enable our new A/B tests. This includes discussions about backlog prioritization, adjusting to different working methodologies across teams, and coordinating around different release windows. Given how our teams divide responsibilities in a large, complex organization, this can create a lot of friction to deliver value.
The challenge — Freedom!
Our challenge and our dream is to be autonomous and independent. Our response to this challenge was simple: we needed to take ownership of the process we depended on. We needed to:
- Establish the autonomy we need to A/B test internal features
- Easily and seamlessly start a new A/B test when needed
- Be touchpoint-agnostic, so we can A/B test every recommendation use case automatically (from the very beginning)
- Keep our awesome recommendations performance
- Improve our ability to adapt and move quickly to deliver more value
Beginning with Questions
At the beginning of our quest (to achieve freedom!), we raised the question, “What do we really need to space rocket the A/B testing process for our recommendations?”. Actually, that’s quite simple… we place an A/B Test engine inside Inspire. Our next valuable question was, “Should we build one from scratch?” Well, we don’t need to reinvent the wheel (Time-To-Market, remember?!).
Guess what?! FARFETCH has a core Experimentation Team, entirely dedicated to supporting our self-service experimentation needs across the company. Yes, we have an internal A/B testing engine with that expertise to support this need from the platform-side, called the Farfetch A/B Service (FABS)!
What about integrating Inspire with an A/B testing engine?
Generating amazing recommendations is our area of expertise. It’s where we excel! To apply A/B testing techniques to test and measure our improved impact on user engagement, it helps to engage one of our experimentation experts so that we can stay focused on the quality of our recommendations service.
So… let’s work together and do what’s never been done in a recommendations engine!
The integration of FABS will bring multiple advantages, as it provides a robust A/B testing API proven to hold up to our recommendations workload as the full recommendations provider for FARFETCH. FABS already integrates with FARFETCH’s core analytics tools, allowing us to practice continuous measurable improvement through our test metrics and measures. This also enables us to take greater safe-to-fail risks with new features and improvements.
Bringing FABS closer into the recommendations engine operations gives us freedom: freeing us to be agnostic about testing at any touchpoint. Now we just need to develop new hypotheses, configure FABS experiences and test it — with far fewer dependencies on external implementations and specific platforms.
Our time-to-market just received a major boost, rocket-style! Just get in and enjoy the ride.
For the consumer experience applications leveraging our recommendations engine, it’s a very smooth experience because they don’t need to care about whether we are A/B testing or not. We do everything on our own, we control everything, we manage their experiences consistently, and we are free to explore our hypotheses!
How did we do it? Here’s the winning recipe:
For each recommendation request, Inspire directly requests the FABS Service for all eligible experiments and their alternatives on behalf of the customer (we call this predict). This ensures a consistent experience every time the user lands in the touchpoint.
Having the prediction for the specific request, Inspire reconfigures the recommendation setup to use the configuration of the selected alternative (we call this Setup), register the user participation into FABS (Participate) as predicted and generates the final recommendation (Recommend).
When the user encounters a recommendations touchpoint, we trigger a recommendation request to the engine. If this touchpoint is under an experiment to be A/B tested, we make a prediction about the alternative the user would be bucketed into for the experiment. A call to the FABS service for the user provides this anticipatory information. Predict allows us to conveniently pre-fetch potential experimentation details while minimizing disruption to the user’s flow or experience.
This step is particularly important so we can consistently provide the same experience/alternative to the user across the entire FARFETCH platform even if the request is received by different applications or services.
This step configures the Inspire system based on the predicted alternative assigned to the user for a given touchpoint.
With the configuration loaded to compute, we then register the user’s participation in the experiment via a separate request to the FABS Service. This separate “participate” step is important as it registers the user as an active participant in the experiment. (Predict allows us to anticipate a user’s potential experiment participation before they reach the necessary touchpoint.) Our back-end analytics will then follow the A/B testing process for later deep diving.
Generate the final recommendation, polishing the diamond to inspire our customer.
A picture is worth a thousand words!
This was the original recommendations flow with A/B Testing:
And this is our dream come true, the process that accelerates our A/B Test recommendations from the inside, rocket-style:
And what about performance and quality…?!
Performance is a major focus area within FARFETCH and Inspire has been a great example of that. We decided to improve this process of integration with the FABS API by avoiding multiple calls for the same user — thus avoiding the need of making a request for every experiment prediction. Now we batch predictions in a single request, enabling us to save in cache all the potential experiments and chosen (i.e., predicted) variations for a user along their recommendations journey. Meanwhile, we continue to register user activation for each experiment with separate participation requests.
Managing the A/B Test process inside the recommendation engine also reduced the points of failure from many to just one. So, when something goes wrong, it will be easier to know where to fix it immediately, because we have ownership and we are focused on it.
To ensure consistent and soft transitions between experiences over different platforms — for example, for someone who adds an item to their wishlist from their web browser but later responds to an in-stock notification about it on their iPhone — we chose our most comprehensive identifier of the user. This way our predictions could be idempotent for the majority of the scenarios we deal with nowadays.
And that’s pretty much it! Now we have full A/B testing capabilities embedded directly on a platform product not only making use of our own internal tools but also without compromising on performance.
What we’ve learned on this journey
Becoming more autonomous with our A/B testing process came with a lot of learnings and advantages. These include the following:
- Simplification of our recommendations A/B testing process: We only need to configure experiences and the service internally for the alternative configurations.
- Continuous testing approach: By not depending on specific implementations and releases on the client-side, we can internally configure new A/B tests on a daily basis.
- Faster Time-To-Market: Instead of waiting for meetings and coordination between schedules and priorities, we spend our time configuring the alternatives to be tested inside the recommendations engine and independently launch the experience on FABS Service.
- Fewer points of failure: Our centralized implementation allows us to focus on better solving issues and recovering faster if needed. It also simplifies testing.
- More confidence to start new tests in parallel: With fewer points of failure, we have more confidence to start multiple concurrent tests at different touchpoints.
- Start testing new platforms: Email campaigns can now be included in a simple A/B testing process just like other platforms.
- Flexibility: With more consolidated code to support our A/B tests, it is easier to adopt another A/B Test platform or add multiple A/B Testing platforms as needed.
- Scalability: We can scale up to more variants per experiment when needed.
- Reduce implementation costs: We no longer need specific implementations per platform on the user experience side.
At the end of the day, regardless of what touchpoint we are inspiring, we want to be sure that we keep doing an amazing job! Now we are ready and excited to launch new A/B tests managed from inside Inspire’s recommendations engine and to work on its improvement on a regular basis.
Inspect and adapt, the sky’s the limit!
We have some key aspects we want to dive into with A/B testing inside Inspire.
We want to adopt more A/B Test platforms and to accurately monitor the new recommendation alternatives and improve our monitoring tools to keep an eagle eye over our performance and quality of service. We want to go beyond metrics and discover more ways to improve our recommendation engine’s performance, delivering an even better experience.
For each recommendation experience we launch, we want to keep accurately monitoring specific service metrics such as the following:
- Availability of the new feature;
- Error rate;
- Calls to Inspire that might be missing parameters;
- Assignment to the right alternatives we cover;
- Response time per recommendation alternative; and
- Impact on the business.
In the end, we are embracing innovation on a daily basis, working in constant teamwork with brilliant and curious-minded coworkers. Now we have an amazing set of opportunities to work on, and shape the future of, recommendations with Inspire by FARFETCH. The countdown has begun… Let’s see what’s next!
… And don’t forget to keep doing what’s never been done. #BeRevolutionary
Originally published at https://www.farfetchtechblog.com on January 30, 2021.