Oda’s online experimentation journey: Lessons learned and best practices

Xavier Gumara Rigol
Oda Product & Tech
Published in
9 min readJun 3, 2024

This article is based on the presentation with the same title that I gave at the Data Innovation Summit in Stockholm in April 2024. I went through the slides typing what I spoke over them, edited the text, and added some of the most relevant slides in between paragraphs.

Introduction

In this article, I will take you through Oda’s experimentation journey, some numbers about our experimentation program for 2023, and five takeaways we’ve learned along the way.

My goal is to inspire you with the different practices and lessons we’ve learned from iterative processes and that have proven highly effective for us. Perhaps most findings in this article will be most relevant for companies of similar size and maturity: scale ups of around 300 employees in Product & Tech roles.

I’m looking forward to hearing from you in the comments if you have experienced similar challenges and have taken a different route to mature your experimentation program!

Realizing the potential in data at Oda

Oda is the largest purely online grocery service in Norway. Our mission is to develop the world’s most efficient retail system, allowing you to reclaim the time spent planning meals and navigating supermarket aisles for more enjoyable activities.

Our success is underpinned by years of data-driven innovation and experimentation. We’ve outlined six principles on how we create value from data: The six principles for how we run Data & Insight at Oda. In this article, I’m going to deep dive into the principle of “Impact through exploration and experimentation”.

What is an online experiment?

In an online experiment or an A/B test, users are randomly assigned into one of two groups. Each experiment group gets its own version of the user experience. The control group typically gets the current state, and the treatment group gets the new version we want to test. This setup, followed by a rigorous analysis using the scientific method, allows us to determine which version performs better based on predefined business metrics.

The following image shows an example of an A/B test in the context of Oda. The header of the website for both Control (or version A) which is the current version at the time of the experiment, and Treatment (or version B) is the proposed change.

Example of an A/B test in the context of Oda

In this case, the difference between Control and Treatment is which information is shown in the top right corner of our webpage. As you can see, Control shows the number of items in the cart and Treatment shows the running total cost of the products you have in the cart.

We’ll go back to this example at the end of the article to exemplify some lessons learned from it. For now, let’s just say that we offered the two versions to different groups of users (randomly assigned to the groups) for the same period of time and we analyzed conversions for both groups over the specified period, then used the findings to determine which version to roll out to all users.

Oda’s 2023 experimentation program in numbers

Oda’s 2023 experimentation program in numbers to get a sense of scale and maturity

At Oda, a trustworthy completed A/B test is defined as a finished experiment where the data has been utilized to either make or inform a decision, and the information regarding the experiment is accurate. In 2023, we completed 105 trustworthy A/B tests.

In reality, we conducted a total of 133 experiments including A/A tests. Additionally, this number also includes experiments that were terminated prematurely due to unreliable initial results and issues with data quality.

It’s worth spending some time early to define exactly how you’ll be measuring experiments so that you can use those numbers later on when setting goals. For example, high win rates can tell you that there’s still room for further optimization of your product, and low-win rates might indicate that you’ve reached a local maximum.

Also, if you break down these numbers by several dimensions (like departments, teams, source of the idea, etc.), they help you identify the areas that need more support on experimentation. More on this can be found in this article: How to Choose the Right Metrics for Your Experimentation Platform.

Oda’s experimentation journey

We can divide our journey of experimentation into four phases. The years 2022 and 2023 marked pivotal periods where substantial investments were made in our platform, people, and technology, showcasing how these investments yielded significant returns. On the other hand, we’ll also explore our pre-2022 practices and how 2024 began in comparison.

🐢 Before 2022: Manually coded tests and zero governance

We didn’t have an experimentation platform per se and A/B tests were manually coded in the backend. Because of that, we lacked a generalized analysis engine and everything was handled manually.

There was limited comprehension of results and the more motivated teams were driving the efforts, often emphasizing on optimizing for a single metric in the experiment.

🧑‍🔬 2022: Investment in processes and platform brings first positive outcomes

At the beginning of 2022, we invested in GrowthBook, our current experimentation platform, and chose the Bayesian approach over Frequentist to run experiments for simplicity.

We formed the Experimentation Platform team who were responsible for platform and experimentation enablement with activities like spreading the word on A/A testing and helping the teams run them before their A/B tests.

Work on documentation, templates, and how-to guides really enhanced teams’ autonomy in testing, and bi-weekly community of practice sessions provided a safe space for teams to share lessons and learn from others.

🤝 2023: All teams experiment

Several tests run during winter 2022–23 increased the confidence senior management had in testing, which led to product managers adopting an “always experiment” policy in their teams.

We also transitioned to using business guardrail metrics for running and analyzing experiments, incorporating governed revenue, profitability, and short-term retention metrics.

Defining “risk profiles” based on potential revenue impact and profitability increased the frequency of low-risk experiments.

We started addressing increasingly complex technical challenges like pre-exposure analysis and pre-existing group biases.

🚀 2024: Incremental changes in product optimization

We used experimentation when transitioning Mathem over to the Oda platform to mitigate platform flip risks.

We also automated pre-exposure checks to identify pre-exposure issues: slope of metrics, same time last year/seasonality.

It is in the plans to use experiments for assortment optimization.

5 lessons learned

📕 Lesson #1: definitions matter

Counting experiments and setting yearly goals for the number of experiments you want teams to run is common for organizations in early stages of experimenting. It’s easy to agree that more experiments are better and it’s an easy-to-count metric.

Setting a goal for the number of experiments isn’t the same as setting a goal for the number of trustworthy completed A/B tests. Because of that, we’ve used the following diagram to structure our thoughts around experimentation metrics:

Experimentation metric tree

This breakdown isn’t the final step. Once you have definitions, you can calculate baselines and set goals accordingly. If you set goals without baselines, your teams are going to suffer. Thanks to this, at Oda we know what our experimentation velocity is and we are content with that. It’s the most we can get for how we are set up and given our experimentation program maturity.

In the future, we aim to analyze these statistics by department, team, and source of the idea, theme of the experiment, device, etc. to be able to double down on successful experiments rather than experimenting blindly everywhere. And by doing that, we hope to eventually increase the trustworthy completed experiments we run.

If you’re interested in learning more, I wrote a Medium article on which metrics to focus on to measure the success of an experimentation platform you may find helpful: How to Choose the Right Metrics for Your Experimentation Platform.

🎚Lesson #2: A hybrid organizational structure maximizes success

Experimentation is not the responsibility of a single team at Oda, nor is it completely decentralized. We have three groups of people, each with its own mission, that maximize the success of our experimentation program:

  • The Core Services team (previously the Experimentation Platform team) owns the experimentation platform ensuring centralized data and metrics, and that infrastructure is robust.
  • Departmental Lead Data Analysts and the Director of Data & Insight lead the experimentation program and act as a center of excellence for best practices, documentation, and community sessions.
  • Cross-functional product teams are responsible for running experiments within their domain (delivery, shopper experience, etc.). They have full autonomy in experiment selection and frequency. They adhere to and contribute to global experimentation practices for continuous improvement.

🚩 Lesson #3: Invest in capabilities to encourage teams to test every change

We put a massive investment in documentation, templates, and how-to guides in 2022 which enhanced teams’ autonomy in testing.

Our internal documentation portal for experimentation

We ran 15 community of practice sessions on experimentation in 2023 (12 in 2022) that provided a safe space for teams to share lessons and learn from others. Sessions covered various topics about how to run experimentation programs, Bayesian statistics, experiment planning, experiment analysis, feature flagging and presentations about specific tests. We even had external speakers come and share experiences from other companies.

We implemented a high-touch approach to nurture a data-driven culture, partnering closely with selected teams to advance experimentation adoption over time.

And finally, some teams began using experimentation as a way to do safe feature rollouts to all users, which was a nice way of moving from always safely rolling out features to actual A/B tests.

📐 Lesson #4: Introduce guardrail metrics in assessing success of tests

When reporting on experiments and making decisions, we wanted teams to go from taking into account a single metric to actually have business metrics (like revenue, profitability and short term retention) as guardrails and make go/no-go decisions.

Decision table to decide on the roll-out of treatments or not

Using this table, if an experiment negatively impacts profitability metrics we won’t roll out the treatment unless it has a very good impact on retention or revenue, which then would be a business call for an accountable lead to make.

The table can also be extended to add more dimensions like customer experience or other secondary metrics. A great example of this can be found in this article: Lessons Learned From Running Web Experiments.

⚠️ Lesson #5: Introduction of risk profiles for tests

It has been important for us to be clear about the level of risk an experiment carries based on potential impact on revenue and profits.

For instance, in the experiment mentioned at the beginning, customers had long requested to see the total amount in their cart, but we were concerned about how it might affect our business metrics. While having the running total could benefit us compared to physical stores, we were very wary of making this change.

To address the issue, we ran an A/B test. Fortunately, nothing went wrong. In fact, the new feature even boosted revenue and profitability (during the test). So, we decided to fully implement it since it was a long requested customer desire that did not harm our metrics.

This is an example of a ‘do no harm’ experiment, falling into the low-risk category. Having it addressed this way reduced the fear of running these types of experiments and enabled us to increase the frequency in which our teams run them.

Final words

Our multi-year journey in maturing our experimentation program shows the potential of data-driven practices within our industry. During these years, we went from a novice stage to an intermediate level with some parts of our experimentation program/culture stretching out on the advanced stage as we have explained.

We’re delighted all these efforts have been recognized by the industry by being selected as winners of the Experimentation Culture Awards 2024 in the organization-wide category!! 🎉 It feels great to see our efforts being recognized this way. 😊

--

--

Xavier Gumara Rigol
Oda Product & Tech

Passionate about data product management, distributed data ownership and experimentation. Engineering Manager at oda.com