The big lessons we learned from a failed hypothesis and a mighty A/B test

Published in

CBC Digital Labs

7 min readNov 3, 2017

Hi, I’m Richard Loa, product manager for CBC’s analytics and search team. In the last post I wrote for this blog, I talked about how our team shifted to product-based work. Today I’m going to expand on that idea, making the case for using the lean startup approach.

Can I get a hooray for testing?

My team, the folks responsible for analytics and search at the CBC, recently conducted an experiment that focused on changes we made to the search function of our website, CBC.ca.

Our hypothesis was ultimately invalidated but what we gained was hugely valuable — and that’s something to celebrate.

Let’s start at the beginning….

*Michael Wekerle, of Dragon’s Den, giving the thumbs up*

Our team uses the lean startup approach because it allows us to constantly ask the question, “Are we going in the right direction?” We are in continual pursuit of idea validation based on data and short feedback loops.

Working lean is driven by making assumptions and then validating those assumptions by making small, iterative changes. With each iteration, the impact is measured against the goal. The feature set of a product will be more focused and more narrow (compared to other approaches, including Waterfall and Agile) because there’s more iteration on each feature.

With the goal of optimizing the search function on our website, we decided to run an A/B test — presenting two variants, or user experiences, to Canadians. We focused in on whether people read an article found on the search results page.

Our thinking was that the search function is more useful when people read an article that was organically — i.e. when a user searches for a topic on their own — discovered through CBC’s search page. (There are use cases of hitting the back button after clicking a search result, which we believe results in a false positive.)

Using our six-step process, here’s how that work played out:

Step 1: State the hypothesis

We theorized that when using the CBC.ca global navigation search function, or the search bar at the top of every page on our site, Canadians are more likely to read an article if the algorithm delivers results that considers both the search query term and a user’s previously viewed/read content (as compared to if the algorithm only considers the search query terms). The hypothesis is based on a parent hypothesis that recommendation/relevance will increase engagement.

Step 2: Define a controlled method of testing hypothesis (aka the methodology)

To test this hypothesis, our methodology included guidelines and a funnel.

The guidelines:

The user experience of both search engines should be as close to identical as possible
Search algorithm A (uses only search query terms) will be found at http://www.cbc.ca/gsa/, and search algorithm B (using search query terms and a user’s viewed/read content history) will be found at http://www.cbc.ca/search/
We will evenly divide exposure to the two algorithms
The test will run to a minimum sample size of 10,000 exposures for each algorithm

*Side-by-side search results pages from algorithms A & B, using the query “Justin Trudeau” — they look the same, right?*

The funnel we used to measure search utility:

User searches a topic via the global navigation search tool
User selects an article to read from the search results page
User reads the article selected from that results page (without going into specifics, these are some signals used to determine if an article has been read)

(Note that the funnel presented above is the final version — it changed while iterating on the testing methodology. See more in the summary provided below.)

Step 3: Define the criteria for acceptance and rejection

Acceptance criteria: When Canadians search, select and then read articles one per cent more often from algorithm B than they do from algorithm A.

Rejection criteria: If the acceptance criteria is not satisfied, we’ll go back to the drawing board and design a new algorithm to test against algorithm A.

Step 4: Run the test, capture the findings

We ran our first test from June 13–27 of this year. Results showed that search algorithm B was more useful, with a two per cent increase in performing the desired funnel.

Step 5: Analyze the findings, identify key learnings

When we dug deeper into the results, we discovered that all Internet Explorer and Internet Edge browsers were excluded from the test. Since a large portion of the CBC.ca traffic — about eight per cent — come from IE/Edge browsers, we invalidated the test at this step.

Step 6: Draw a conclusion, repeat steps 1–5

At this point, our test was invalidated. In order to properly test our hypothesis, we needed to configure the test to include IE/Edge browsers and run the test again.

Iterate and iterate again… and again…

So that’s exactly what we did. We went through a series of iterations resulting in invalid tests before validating our hypothesis. Below are the summaries of our five test iterations:

Test # 1 — ran from June 13 to 27

Search algorithm B outperformed search algorithm A by two per cent, when excluding IE/Edge browsers.

Recommendation: Include IE/Edge browsers in the test.

Test #2 — ran from June 28 to July 11

After adding IE/Edge browsers into the test, search algorithm A outperformed search algorithm B. We then learned about a bug in search algorithm B that didn’t allow certain versions of the IE/Edge browsers to track the testing funnel.

Recommendation: Fix the bug in order to validate the methodology of the test.

Test #3 — ran from July 20 to August 3

After fixing the IE/Edge bug, the sample size was too small to be conclusive. The team learned that there was a change to the user experience in search which removed 75 per cent of potential test participants.

Recommendation: Use the new search user experience to run the test on http://www.cbc.ca/search.

Test #4 — ran from August 14 to 28

Search algorithm B outperformed search algorithm A by 3.5 per cent.

In order to deliver utility, the team decided that the action of reading an article is what we wanted to optimize for — previously the optimization event was selecting an article to read).

Recommendation: Add read as the last step in the test funnel.

Test #5 — ran from September 28 to October 11

Search algorithm A outperformed search algorithm B by 0.92 per cent. The team learned that there are search sessions where users will select an article to read and are sent to a destination where the read detection code does not exist. On top of that, there is a scroll depth bug within the read detection code that is resulting in an underreporting of read events in the testing funnel.

Recommendation: Fix the read detection bug and add the read detection code to more parts of CBC.ca.

Funnel Visualization

This graph illustrates how the product team is currently analyzing the test results from test iteration #5.

Bar graph showing results for all three stages of the funnel, comparing algorithm A and B side-by-side. The bars get shorter at each step as users fall out of the testing funnel after deciding to do something else.

What we found, by the numbers

In the case of search algorithm A, there were 15,312 opportunities to return search results, 5,039 (33 per cent) search results pages presented an article the user was interested in reading and, of these articles of interest, 1,415 (9.24 per cent) were read.
In the case of search algorithm B, there were 11,505 opportunities to return search results, 4,272 (37.1 per cent) of the search results displayed an article the user was interested in selecting to read, and of these selected articles, 957 (8.32 per cent) were read.

Thus, there is a 0.92 per cent decrease in the number of articles read when results are served via algorithm B.

Embracing the journey

As dictated by the lean startup approach, we took what we learned from our first series of tests, refined our approach and redeployed.

We ran the test again after the web presentation team rolled out a change that affected the search experience on CBC.ca. The test methodology was a little bit different. Both of the variants live at http://www.cbc.ca/search. As with our first experiment, 50 per cent of searches receive results from search algorithm A and the other 50 per cent from algorithm B. The look and feel of the search results from A vs. B are identical.

So far, our test results show that Canadians prefer the status quo, search algorithm A, to the alternative. (We weren’t too surprised — we are in the infancy of personalization at the CBC, and it’s a hard thing to do right.)

Since building products is a learning journey that never ends, we’ll continue to iterate and optimize for more useful results on the cbc.ca/search page.

Test, learn, repeat.

*This is what I’m hoping to light — a big ol’ fire that challenges the way you work (Photo by:* *Richard Loa*)

I hope I’ve managed to light a fire — or at the very least a spark — that got you thinking about idea validation and the ways in which you work.

I’m curious to know what others in the industry are doing to support their learning in product development, and would also appreciate feedback with how we’re using the lean startup approach. Whether the sentiment is neutral, negative or positive, please leave me a comment in the section below.