The Modern Experimentation Workflow with Eppo + Snowflake

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

9 min readApr 25, 2023

The practice of online controlled experiments, also known as A/B testing, was pioneered by tech giants such as Google and Airbnb and is now widely used in various industries. This is because only a small percentage (20–30%) of product changes lead to improved business metrics, with another 30% actually causing harm. Multiple studies, including those from Airbnb, Netflix, Meta, Google, and Microsoft confirm these rates. Blindly shipping products without measurement is a one-step-forward, one-step-back process, while consistent experimentation leads to a compounding success through improved metrics over time.

Tech giants understand the significance of experimentation and have committed significant resources to build in-house A/B testing tools. All other companies have to run their experiments via manual workflows in ad-hoc Jupyter notebooks or Excel sheets. These manual workflows lead to fragmentation and lack of trust, with each team defining metrics differently and adopting varying standards of statistical rigor. This often results in failed experimentation programs and incorrect inferences, costing companies thousands or even millions of dollars.

There is a better way. The emergence of The Data Cloud and the tools built around it provide a huge opportunity for better experimentation. Given that the data warehouse is the most accurate and up-to-date metrics, it is crucial for experimentation workflows to run natively on top of the data warehouse. Doing so builds a foundation of trust that is crucial in a high-stakes environment where even small metric impacts of 2–5% can have significant consequences such as dismantling code, getting a product leader promoted, or securing resources for a strategic initiative. Additionally, all experimentation analyses conducted on top of the warehouse leave a clear paper trail, making them auditable and replicable. This promotes transparency and ensures that everyone involved understands exactly how the data was analyzed.

In this post, we’ll walk through how Eppo built a Snowflake-native experimentation platform with the following key principles:

Standard, Reusable Data Wrangling Logic from established Sources of Truth tables
Business-Approved Metric Repository
Democratized, Uncorrelated, and Reliable SDKs for assignment
Automated Data Quality Monitoring
Modern Statistical Methodologies Minimizing Time-to-Insight
Easily Digestible Reporting

Standard, Reusable Data Wrangling Logic from Established Sources of Truth

Despite providing an auditable paper trail of analyses, querying on top of a data warehouses still leaves room for errors during data wrangling, especially in experimentation. Each experiment requires specific SQL code, and without the right historic context, it’s easy to query the wrong data. As the number of metrics within an experiment grows, so does the SQL code, which can quickly become complex and increase the possibility of errors. For instance, consider this query (modified for privacy) I once wrote at a previous job that formatted data before performing aggregations. With this type of query for each experiment, errors are nearly inevitable.

To mitigate this risk, define proper sources for each metric and establish standard logic for data wrangling that can be fully automated to reduce human error. Eppo achieves this through SQL definitions that tell the platform where to find the tables and columns required to wrangle data for any experiment. These definitions include four types: Assignment SQL, Fact SQL, Dimension SQL, and Entry Point SQL. At a bare minimum, only Assignment SQL and Fact SQL definitions are required for an experiment analysis.

Assignment SQL are SQL snippets that point Eppo to the table in the data warehouse that contains a log of user assignments for each experiment. This query is annotated to inform Eppo which columns correspond to experiment subjects, their assignment timestamps, and the experiment and variation they were assigned to. With this information, Eppo can identify which users were assigned to which experiments and when, serving as the foundation for all experiment analyses that involve joining metric event data.

Eppo uses Fact SQL and Facts to determine the data sources for each metric. Fact SQL Definitions tell Eppo which table to look at for a certain event that you want to craft your metrics from. Then, using Facts within those definitions, Eppo knows which columns to aggregate for metric calculations.

In each experiment result refresh, the Assignment SQL fact for the specific experiment is merged with the corresponding Fact SQL definitions for the metrics. The same wrangling logic is consistently applied, ensuring consistency across all analyses. The outcome is a clean table at the entity level, ready for metric aggregations.

Democratized, Uncorrelated, and Reliable SDKs for assignment

Airbnb, Google, Facebook, Netflix all won their markets by letting any engineer easily set up experiments, and culturally establishing that engineers can always test impact without a political process. They allow 100s of experiments to run concurrently, with infrastructure that insures proper setup and application integrity.

Tech giants like Google, Netflix, Airbnb cultivate a culture of experimentation by permitting any engineer to set up experiments effortlessly. They enable the concurrent execution of hundreds of experiments with infrastructure that ensures correct setup and application integrity. Eppo’s SDKs embody these same architectural principles.

Eppo’s SDKs reflects the same architectural principles seen in Airbnb, Netflix, Spotify, and Booking’s experimentation SDKs:

Experiment configs are served via CDNs (content delivery network), with latencies below 50ms for most of the world (see below).
Server-side applications and mobile devices poll the CDNs every few minutes and store configuration locally, allowing nearly instantaneous feature delivery.
Feature exposure instrumentation is built into the experiment setup APIs, so no engineer has to remember to gather these events.
Idempotent randomization guarantees that users on multiple devices are assigned to the same group. This is done via hashing functions (typically md5) on the concatenation of experiment and subject identifiers.
Randomization methodology guarantees that experiment assignments are uncorrelated with each other, enabling concurrent experiments and preventing previous cohorts from influencing future ones.

This combination of edge-served configurations, built-in instrumentation, and hash-based randomization allows the most advanced technology companies in the world to run thousands of experiments concurrently, without compromising performance or causing interference between them.

Business-Approved Metric Repository

To ensure accurate and efficient analysis of experiment results, it is essential to have a standardized metric definition across all experiments that is approved by the business. Without a uniform metric definition, stakeholders may dispute the results of an experiment readout, leading to unproductive debates.

To address this issue, companies should establish a standard set of metrics with pre-approved definitions that are integrated into their experimentation platform. Eppo simplifies the process of creating metrics by allowing users to select the SQL Fact they want to aggregate on and choose from various aggregation methods.

For example, a metric like ‘Total Purchase Revenue’ is a simple Sum aggregation on top of a ‘Purchase Revenue’ fact. A more complex metric like ‘Users with ≥ $50 of Revenue within 7 Days of Assignment’ is also quite effortless and does not require the user to write any custom SQL code. By selecting the Threshold aggregation and entering several essential parameters, this metric can be created in a few clicks.

This flexible approach to metric creation helps ensure consistency and accuracy in experiment analyses while minimizing the potential for disagreements among stakeholders.

After defining all the metrics, they are saved to a shared repository. These metrics can be organized into collections within the repository for easier management. A collection is simply a logical grouping of metrics. Adding these collections to each experiment is a straightforward process.

Automated Data Quality Monitoring

The saying “garbage in, garbage out” is particularly relevant in the context of experimentation. The quality of data being input into the experiment platform directly affects the output. This is why a reliable experimentation platform should automatically check for data quality issues and alert users if any are found.

An essential aspect of an experimentation platform is to ensure that users are being randomized correctly and as anticipated. Inadequate random assignment can introduce bias into experiment results and make any inferences drawn from them invalid.

Eppo achieves this by conducting sample-ratio mismatch tests on assignment data at both the aggregate and dimensional levels. If a statistically significant deviation from the expected traffic allocation is detected, an alert is generated and users are advised to postpone interpreting any experiment results until the issue is addressed.

Apart from detecting traffic imbalances automatically, other data quality tests are performed and reported within a separate Diagnostics tab. The tab lists any failed tests, and users are also alerted in the experiment report header if a diagnostic test has not passed. This prompts them to address the issue before analyzing their experiment results.

Modern Statistical Methodologies Minimizing Time-to-Insight

As Data Scientists, we understand that conducting experiments is time-consuming, and the experiments should have adequate statistical power to be valid. However, this poses an inherent contradiction with businesses, which require agility by quickly learning and iterating. In this world, relying on simple T-Tests and Z-Tests won’t cut it. Fortunately, in recent years, advanced statistical methodologies have been developed to reduce experiment runtimes without sacrificing statistical rigor.

One such approach is Sequential Analysis, which enables users to peek at results at any time during the experiment. If a statistically significant result is observed, the experiment can be terminated early without worrying about a False-Positive. Sequential Analysis calculates confidence intervals in a more conservative way than traditional Fixed-Sample approaches, ensuring the statistical significance of results.

CUPED (Controlled Experiment Using Pre-Experiment Data) is another modern tool that can help reduce experiment runtimes. CUPED leverages regression analysis with both pre-experiment and within-experiment data to reduce the variance of experiment metrics. This reduction in variance leads to a quicker time-to-significance.

By combining both Sequential Analysis and CUPED, experiment runtimes can be significantly decreased, sometimes by 50% or more. Any experimentation platform that lacks these modern statistical tools unnecessarily hinders the agility of the company.

Easily Digestible Reporting

Although a good experimentation platform’s statistical layer should feel like it was written by a PhD Statistician, its reporting layer should be understandable by everyone in the company, regardless of their technical expertise. This is because experimentation involves many different teams, including those with varying levels of technical ability such as Designers, Product Managers, CEOs, and Data Scientists.

The platform should enable non-technical users to draw accurate insights from the results without needing to understand the underlying statistics. The platform should also inform users when an experiment is ready to be analyzed and when it is safe to draw conclusions from the final results.

For the first task, Eppo uses a progress bar that indicates whether the experiment has reached an adequate sample size, abstracting away the concept of statistical power. For more technical users, it also provides information on the specific criteria for determining an adequate sample size.

For the second task, Eppo uses metric report cards that utilize vibrant color schemes to convey whether an observed result is statistically significant. When a metric is currently displaying a statistically significant effect in a positive direction, it is highlighted in green. On the other hand, if the effect is statistically significant but in a negative direction, the metric is highlighted in red. Any other results are colored in grey to indicate insignificant results.

For slightly more technical folks, the confidence intervals for the estimated lift are displayed next to the observed results. This allows those users to quickly intuit just how wide or narrow a current estimate is for any given metric.

Conclusion

The Snowflake Data Cloud provides a huge opportunity for data and product teams to scale their experimentation practice and run high-impact experiments that are tied to business metrics, making it simpler to quantify ROI from both data investments and shipped code.

Eppo streamlines your feature flagging and experimentation workflow directly on top of the data warehouse, allowing you to run 10x more experiments with your current infrastructure without sending sensitive raw data to a third party.