10 Lessons from Building an Experimentation Platform

Published in

SEEK blog

10 min readApr 16, 2023

At SEEK we strive to match more people with job opportunities than any other job seeking platform. It is an ambitious goal and one that requires ongoing development of new and existing products and services that improve our online platform.

Much of that innovation comes in the form of artificial intelligence (AI) driven services — be it personalised job recommendation algorithms, or machine learning models for matching job seekers and employers. To this end SEEK has a focused and growing team known as Artificial Intelligence & Platform Services (AIPS) that works on these solutions.

TL;DR — Just want the 10 lessons? Here they are:

Ensure data is fit for purpose
Keep complex data transformations outside the experimentation platform
Start with simple statistical methodology
Recognise that methods for understanding anomalies in the data are more important than advanced statistical techniques
Consider the impacts of outliers / extreme observations
Focus on early stopping techniques before variance reduction
Consider scalability right from the start
Cache EVERYTHING
Use task parallelisation wherever possible
Consider both scheduled and ad hoc analysis use cases — but don’t build one system for both

Of course, with an abundance of new ideas and innovation there comes the risk of making incorrect decisions about which features and products to roll out. So, like any data driven organisation we aim to make the right decisions by testing out our new ideas.

Experimentation — or A/B testing — forms a big part of that decision making process.

Why didn’t you experiment, George? Source: giphy.

Why we started building an experimentation platform…

One way to facilitate experimentation at scale — within a large business with large data assets — is by centralising some of the infrastructure that can be used to deploy, monitor, and analyse experiments.

So, at SEEK we have been working on building an online experimentation platform that can be used to experiment on our AI products and services.

The overall objective of our experimentation platform is that it will help reduce friction for teams that want to run A/B tests, whilst also helping improve the robustness, consistency, and efficiency with which we experiment.

An experimentation platform is nothing new, but the lessons might be

Building an experimentation platform is not particularly novel — many technology companies have centralised internal platforms for A/B testing — including some that have been written about publicly such as Netflix, Spotify, Uber, and more.

But many of those companies now have relatively mature platforms — some a decade or two since their inception — whilst at SEEK we are much earlier in our journey.

In late 2020, we started bringing together a collection of internal infrastructure and software — our “platform” — to help schedule, implement, track, analyse, and report on A/B tests at SEEK.

It’s now early 2023 and we can reflect on where we’ve got to. Some things worked, some things didn’t. We pivoted when necessary, and we rolled up our sleeves and persevered when it made sense.

Of course, there have been numerous learnings along the way — and in this post we share the ones we think are most useful for others starting a similar journey.

Here are our 10 lessons, George. Source: giphy.

1. Ensure data is fit for purpose

In a big organisation like ours — with several legacy systems and data warehouses — getting data centralised and consistent isn’t easy.

Think about the quality and organisation of data at your company. Is it centralised, consistently managed, and well documented? Is it built for analysts and data scientists to use? For example, is there a single table to access customer attributes, or is this split across several tables in a way that requires in depth know how.

If not, then try and advocate for data solutions that enable data scientists and analysts as consumers.

You said you want data suitable for data science? Source: giphy.

2. Keep complex data transformations outside the experimentation platform

Avoid coupling complex data engineering pipelines to your experimentation platform — ideally solve the data transformation problem outside of the experimentation platform.

Look at building data pipelines that produce core tables suitable for analysis, or perhaps a full-blown metrics repository like Minerva at Airbnb. But make sure that these are external to the data analysis components of the experimentation platform.

Analysis-ready “tidy data” should be accessible to the experimentation platform through simple SQL queries or perhaps a simple API to a metrics repository. If your experimentation platform has a data ingestion component that is allowed to grow exponentially in complexity — then it will almost certainly become a constraint on scalability of the experimentation platform later on.

We started with jinja-templated SQL queries for retrieving experiment data from our data platform, but that quickly became unfeasible, and so we ended up building a standalone metrics repository to manage metrics definitions and make analysis-ready data retrievable via a simple API.

Complexities like removing bots, resolving tracking issues, and reconciling multiple data streams are all important — but they are not experimentation-specific, and therefore have the potential to draw you away from the core needs of an experimentation platform.

3. Start with simple statistical methodology

Keep the statistical methodology simple at the start of your journey — first, focus on the platform engineering and stakeholder communication. Once you have a solid foundation in those, then you can accelerate towards more advanced statistical techniques.

This approach allows you to build trust with sponsors and leaders and start delivering through the platform while you continue to build out new features.

Plus, it is extremely difficult to implement and communicate complex statistical methods when you don’t have a foundation in place for the simple stuff. Take the following examples:

Bayesian models that require Markov chain Monte Carlo (MCMC) methods are more memory intensive, computationally expensive on large datasets, and often require more tools for validation (e.g. assessing convergence)
Estimated treatment effects from generalised linear models (GLMs) with non-identity link functions can be more difficult to communicate to stakeholders — if you haven’t yet worked out how to communicate treatment effects from linear models. It’s unlikely product managers will want to discuss odds ratios.

These sorts of engineering and communication challenges are surmountable, but it is easier to implement more complex modelling and statistical techniques once you have a foundation for scaling and communicating the simple stuff.

Better build that trust before you start getting too fancy. Source: giphy.

4. Recognise that methods for understanding anomalies in the data are more important than advanced statistical techniques

You’ll want some degree of data quality checks, and ways to easily interrogate the data. These are essentially a prerequisite for implementing more advanced statistical techniques in the platform. Otherwise, it ends up hard to investigate surprising results in experiments.

When you see a particularly surprising result, do you perform deeper statistical analysis, or do you go back to the raw data and look for anomalies in the data tracking?

Many times, we’ve found that particularly large treatment effects in an A/B test are not a consequence of the intervention itself, but “bugs” in the data tracking that could have been identified prior to analysis by suitable data quality checks. Reliable data quality checks prior to any analysis will save you time in the long run!

5. Consider the impacts of outliers / extreme observations

Extreme observations — also known as outliers — can dramatically skew the results of an A/B test. Even a small number of outliers in a large dataset can have a non-trivial impact.

In online experimentation, bots and software bugs are common causes of outliers. So your experimentation platform will need some methods for identifying extreme observations and handling them — right from day one. Even if those methods are relatively simple or naïve, they can enhance the reliability of your results.

6. Focus on early stopping techniques before variance reduction

From day one, stakeholders or users of your platform will be wanting to try and reduce the run time (i.e. required sample size) for their experiments. So at some point you’ll want to start looking for statistical techniques that help with that.

We found that techniques that allow for interim analysis and early stopping (e.g. group sequential designs) provide more bang-for-your-buck than variance reduction techniques (e.g. covariate adjustment or CUPED).

This is because most variance reduction require additional data wrangling, and in some cases data that is difficult (or impossible) to collect. Plus, there is often a learning curve in understanding how the modelling techniques might apply to your data and context.

For instance, we’ve generally found the reduction in experiment run times can be somewhat marginal — because it’s hard to find decent covariates to use in the models! The best covariates are those that help explain the variability in the outcome you are interested in, but those are sometimes hard to come by, especially if your online platform has a lot of first-time users.

On the other hand, some new features (interventions) will produce larger effects than we assumed they might during the experiment design. In those cases an interim analysis means you can potentially stop the experiment “early”.

For stakeholders, early stopping will always come as a welcome surprise. Plus, the average reduction in run times achieved using group sequential methods can be quite large in our experience — YMMV.

Trust me: your colleagues will be sure to party harder when you stop an experiment early, than when you design one to run faster. Source: giphy.

7. Consider scalability right from the start

It’s worth considering a significant amount of scalability right from the start of your build.

In our tech stack, we use Airflow for scheduling our experiment analyses. Initially we built Airflow DAGs dynamically from a single file, but ran into scaling issues on our Airflow server.

We also initially underestimated the number of virtual CPUs we might need in AWS Batch, which led to surprisingly hard-to-debug job failures.

Finally, we began by analysing metrics for a single experiment in a serial fashion since our earlier experiments had less data and fewer metrics to analyse. Later on, the time taken for analysis would become a blocker and force us to parallelise tasks, which massively increased the efficiency of our analyses on larger datasets.

These are all examples of issues that could have been easily avoided if we had factored in a much greater need for scalability than we initially expected — more experiments, more data, more complex analysis methods.

8. Cache EVERYTHING

When we started out, we avoided caching some things — e.g. interim datasets. Our main concern was storage costs, but over time, not having access to those objects made it harder to debug issues, slower to rerun things when jobs failed, and harder to reproduce historic results. Not to mention, the savings on storage costs likely pale in comparison to the engineering and analyst time spent trying to reproduce historic results.

It is worth trying to cache everything at each step of your data and analysis pipelines — be it raw datasets, transformed datasets, analysis datasets, input/configuration parameters for analysis, results objects, error logs, and more.

You aren’t alone either — awesome tools like Metaflow are great at helping you do this.

Even if it feels excessive at the time, storing everything is likely to help you downstream. Source: giphy.

9. Use task parallelisation wherever possible

As the experimentation culture matures at an organisation, the complexity of the experiments being run inevitably increases. Teams often start exploring the opportunity to use more complex designs and consequently analysis requires more complex models. Plus, the number of metrics being tracked grows in order to answer an ever-increasing number of stakeholder questions.

With more metrics and more complex analysis, computation demands increase — and task parallelisation becomes more important.

To avoid refactoring at a later date, use task parallelisation as an optimisation technique right from the start.

10. Consider both scheduled and ad hoc analysis use cases — but don’t build one system for both

Both scheduled analysis (e.g. scheduled daily monitoring of an experiment) and ad hoc analysis (e.g. follow up analysis of an experiment after it has ended) are important.

When scheduled experiment analyses lead to surprising or unexpected results, e.g. a sample ratio mismatch, then data scientists or analysts will often be tasked with digging into the experiment data further. So, you’ll want to enable ad hoc and interactive analysis use cases — whilst ensuring they can use the same data, analysis methods, and codebase as the scheduled and automated analyses that produced the initial results.

Having said that — you want to avoid building both use cases into the core experimentation platform itself. Ad hoc use cases often require more flexibility for data scientists to cater the analysis to their needs. So, building that flexibility into the automated part of the platform can add unnecessary complexity. Plus, this delineation allows new approaches to be developed and battle hardened before building them into the core platform.

The lesson here is trying to build the experimentation platform from composable parts e.g. a statistical toolkit, data transformation pipeline, job scheduling, UI or dashboards, etc. And for each of these parts thinking about whether they have an automated and/or ad hoc use case and how that might need to look.

Closing thoughts

At any web-focused company there are numerous benefits to building a centralised experimentation platform.

A central platform helps enable efficient, consistent, and reliable experimentation. It also helps empower teams to run experiments — avoiding situations where they might have otherwise been blocked by infrastructure or knowledge hurdles.

Nonetheless, building a mature experimentation platform is a sizeable undertaking. There are numerous components to consider, each with their own complexities. Hopefully these learnings from our journey at SEEK help others decide where to prioritize and how to avoid some pitfalls — making the path to success a little less random.