Building an In-House Solution for AB Experimentation at Pluralsight

Co-authored by Levi Thatcher and Rich J Wheeler

TLDR: Pluralsight, a SaaS company, is AB experimenting using in-house solutions built on user feedback and consistency

Step 1: Start Analyzing Actions Scientifically (SaaS)

In a previous post, we wrote about our efforts to make Pluralsight product development more data driven. For context, one critical highlight of that post is that we found we needed to establish better metric standards and a central repository for all of our AB experiments. Zooming out a bit, AB experiments historically were based around the needs of e-commerce companies, who were focused on event-based metrics like conversion (a user signs up for something, a user makes a purchase, etc). At Pluralsight, since we’re a SaaS company, our product organization instead uses metrics like user retention. In other words, we want to know if a paying subscriber comes back within a certain time frame. Certainly, we have a substantial B2B user base and we’re laser focused on account renewals, but for the purposes of AB testing, one of the key metrics we focus on is user retention, partly because B2B account renewals happen on a much longer time frame and can’t be used to judge a typical AB test (which usually last a few weeks).

There are myriad ways to calculate user retention. Because of the practicalities around test length and because of its relationship to long-term retention at Pluralsight, we use 2nd and 8th-week retention as one of the core metrics by which all tests are judged. In terms of the actual definition, this is a bracketed or classic sense of retention (see here), rather than a rolling or cumulative retention. We simply define it as whether a user who landed in a test at time zero, returns to the app overall within 168 to 336 hours (i.e., one to two weeks) of that first experiment visit.

With this metric in place, we had a rallying cry to gather experiments around a metric that would be used in the measurement of all tests. At a minimum it would require a “Do No Harm” outcome, at it’s best it would highlight experiments that propelled user retention.

Step 2: A Stats Engine to Rule Them All

Product development at Pluralsight is based on bounded contexts, Pluralsight product teams enjoy a great deal of autonomy, which historically meant that experimentation analysis happened at the team level. As we’ve become more mature in terms of our AB testing infrastructure and growth has forced us to better coordinate, it’s become obvious that we needed a central way to calculate metrics in a standard and reproducible manner. This would ensure results are more trusted across the organization. It would also help unify expectations across experimental outcomes.

Before starting the vendor search, and after careful discovery (more on this below), we had determined that we needed a few key elements in an AB testing solution: consistent metrics around retention, easily interpretable significance calculations, and a central home for all product experimentation documentation. We couldn’t find a suitable vendor that could fit our needs, so we decided to build in-house. This comes from the fact that AB experimentation platforms don’t integrate well with clickstream data in the context of retention. To begin the inhouse journey, we first built an MVP stats engine that made a few practical decisions:

  • Uses Python, the language of data engineering
  • Brings engineering rigor, via unit tests and docker
  • Leverages frequentist statistics for now
  • Provides broad internal visibility via easy github access

While we’re excited about the benefits of Bayesian methods of analysis, the cognitive costs of adopting that paradigm at this stage weren’t worthwhile, compared to other low-hanging fruit we’ve prioritized around spreading a culture of experimentation (when to test, how to test correctly, how to leverage results, etc). A coalition of passionate data scientists and analysts have been mobbing, largely via the awesome Visual Studio Code Live Share, to establish these standards and push forward improvements.

Step 3: Create a User Tool with Possibility

When creating an in-house, in browser, web app solution for experimentation, we were cognisant that we needed to approach this tool as a product. This meant from the beginning of development, we did deliberate discovery with stakeholders and prospective users. We talked with Product Managers, Data Scientists, Product Analysts, and numerous product leaders. Each group had unique requirements. Their feedback was instrumental to the initial development of the tool. We also acknowledged that like other software products, this tool would be developed in iterations. We adopted the MVP mindset of getting something into the hands of users quickly to accelerate the feedback cycle.

Based on feedback from stakeholders, we decided on a graphic interface coded in React, rather than (say) layering Tableau over the stats engine. This custom web app solution, built by the impressive Matt Adams, pulls in our data and calculations from the stats engine. It also adds context to experiments via metadata that is filled out when setting up the experiment.

Meet Pluralsight’s ExHub:

ExHub offers many features for users including:

  1. Historical Record of Past and Present Experiments

We use ExHub as both a monitoring and research tool. As Pluralsight and our experimentation culture grow, context around past research becomes more critical. ExHub helps Product teams improve their own experiments through analyzing what’s already been tried across the org.

2. Current Trended and Raw Retention Metrics

While 2nd-week retention is our primary metric for experimentation, we also monitor same-week and 8th-week retention, among other kinds of metrics. ExHub displays confidence intervals looking at the lift of the test group compared to control for each retention interval. ExHub also displays trend line charts showing how the retention has changed over time.

3. Custom Filters

When researching past experiments, ExHub has filters for the name, status, the product feature being experimented with and custom dimension segments. Current segmentation breaks out results by business type (B2B or B2C).

4. Links to Detailed Documentation

When a user is interested in going deeper than what is available in ExHub, we provide links to experiment notes. These notes act as a sandbox for documentation that goes beyond the metrics. This could include written notes, as well as ad hoc analysis performed in Adobe Analytics by the Product Analyst or to notebooks that the Data Scientists have created.

The Work Never Sleeps

As it stands, ExHub serves as a single source of truth, with standardized stats calculations, creating experimentation transparency in an easy to use graphical interface. We’re not done yet though. Already we’re looking at better ways to ingest metadata around experiments and additional types of segmentation. We envision a tool that will estimate how long a test will take to reach significance, allow for point and click test targeting to particular user groups, and provide discussion capability in ExHub for each test. Additional features could highlight future opportunities for recommendations and personalization. In the meanwhile, we’ll be experimenting and always discovering ideas with our users, where they decide what ultimately is released.

--

--

Levi Thatcher
Data Science and Machine Learning at Pluralsight

I’m a Principal Data Scientist at Pluralsight, where we’re democratizing tech skills.