Certainty Is a Luxury, Cutting Corners Is a Craft

When data people take numbers and turn them into stories we risk overlooking incorrect assumptions and getting things totally wrong. How can we hedge this risk so that we don’t suffer from analysis paralysis?

Yoav Arad
Simply
7 min readApr 22, 2021

--

Image: Shutterstock.

Working with data is fun: you take numbers and you tell stories. Sometimes these stories help you better understand what happened in the past and sometimes they can help you predict what will happen in the future. Obviously there are other use cases, but for the sake of focus, let’s move forward with this post.

When we take numbers and transform them into stories, we are adding a layer of interpretation. On the one hand, these interpretations provide the business value behind the data. On the other hand, these interpretations create uncertainty. Why do they create uncertainty, you ask? It’s all in the assumptions we’ve made about the data and how we chose to frame the problem in the first place.

Let’s explore this with an example.

Problem:

One of the questions we constantly ask ourselves is to what extent will adding 100/1,000/10,000 more songs to our song library increase the number of sessions our learners spend playing on Simply Piano or Simply Guitar. In other words, how do we estimate the impact of adding new content to the apps?

Framing:

One possible way to frame this problem is:

  1. How many learners have already completed all the songs in the library?
  2. How many learners are expected to complete them all?
  3. How many learners stopped playing songs (got bored?) and could possibly benefit from a wider variety of available songs?
  4. How much retention does each library song generate? (For simplicity’s sake, let’s say retention = average sessions per learner).

Data:

Assuming we have event-based data, we’ll be looking at something along these lines:

We can use this data to answer the above-questions with some certainty. Let’s dive into the specifics and see how we can answer two of the listed questions.

1. How many learners have already completed all the songs?

I would tackle this question by looking at the distinct number of songs each learner completed.

If this number equals the number of songs in the library then they completed all the songs.

But how certain are we about this answer? Let’s look at the assumptions we made:

  • We have full events data.
  • The finish_song event registers a song’s completion. .
  • During learners’ usage time-stamp, there were no changes in the library’s content and the changes don’t affect the definition of completion.

In order to get the events into our database we are dependent on network communication and some platform dependent constraints.This means the first assumption can never be true and is therefore the first constraint on how certain we can be. Our benchmark is a loss of 2–3% of all events.

The second assumption is a matter of interpretation, if a learner only started a song but didn’t finish it, what does it mean about their intent to return to playing this song? Can we put a number on this?

The last assumption is easier to check. However, let’s say a learner finished all songs that existed 6 months ago, and hasn’t returned to the app since. Are we considering this learner as someone who didn’t complete all the songs because more songs were added since their last session?

In this one small example we can see that there is a lot of uncertainty and many wrong conclusions that we can get to through analysis of the raw data. As with many other analysis tasks, we can always dig deeper and search for more ways to frame the problem and more ways to assert our assumptions.

Our challenge and our craft as data people is knowing which assumptions make most sense and which framing will provide a “good enough answer” in the least amount of time.

2. How much marginal retention (measured in sessions) does each additional library song provide?

Again, using the same event-based data we can tackle this question as follows:

a. Group events by sessions (using the standard 30 minutes non-activity threshold to define the start/end of a session)

b. Calculating the average number of distinct new songs played* in a session.

*new songs from the learner’s perspective.

Leading us to something like this:

Average number of songs per session = 2.5.

Based on this simple calculation, we can say that every addition of 2.5 songs will result in an average of one more session.

Wait, wait, wait!!! Is it really that simple? And how much certainty do we have about this answer?

Obviously, this very naive approach doesn’t cut it here. Let’s look at the assumptions we made and see how reasonable they are:

  1. All sessions are identical and taking the average number of songs describes them well.
  2. The causality between starting a new session and playing new songs is: learner has new songs to play -> they start a new session.
  3. Having more library songs to begin with will not change the average number of songs per session.

The first assumption is clearly not true. Earlier sessions are different from later sessions. Some songs take longer and are more satisfying to learn than others; weekday sessions are not the same as weekend sessions when learners have more free time; and generally assuming all learners behave the same is always a shortcut.

For simplicity’s sake, let’s say the only difference between sessions is their duration. We can estimate the uncertainty brought by using the average by calculating the variance of session durations. If 95% of sessions are ~10 seconds different from each other (in terms of duration) and 10 seconds is a small number compared to a sessions’ average duration, it seems reasonable to assume that we are adding ~5% of uncertainty by using the average number of songs.

The second assumption depicts a very problematic and common situation that we encounter in data analysis. It falls under the ‘correlation is not a causation’ fallacy. While there may be ways to assert the direction of causality, the usual practice is taking the leap-of-faith. In our particular problem it may make sense designing a specific test that asserts our ‘new songs lead to new session hypothesis’.

Our third assumption might not hold true due to The phenomenon of overchoice (meaning it’s difficult to decide when facing too many options). Still, in practical terms we can add the songs to our songs-library in a manner which will reduce such behaviors.

This example actually shows our assumptions may be very much off and that using our ‘fast’ average calculation is probably totally wrong. This average does not tell the relevant story, it does not hold enough of the information we need to solve our business problem. Imagine coming to a conclusion that an addition of 1,000 songs will lead to 400 more sessions per learner.

The two above examples showcase how business questions can be answered with data. In our breakdown we’ve seen that while the answers can be valuable and do correspond with reality on some level, we’ve added uncertainty by assuming things that might not be true.

So how can we know which corners to cut and which not to?

  • First, identify the assumptions we make.
  • Try putting a number on the amount of uncertainty they bring. It may be straightforward as when losing events or may be less trivial by including variance into our calculations.
  • Think thoroughly about the causality of our assumptions. If the assumption is crucial for the analysis, think if testing it before moving forward is our next best move.
  • Sanity check the results! Do they make sense? If not, either we have a bug (np.med instead of np.mean?) or one of our assumptions doesn’t hold.
  • Discuss with others, explain the process and the results. This can be as simple as a 5 minutes ‘look at this graph’ discussion. If other professionals don’t call something out, good chance it’s not off.

A few more words about A/B testing and why it is almost not mentioned.

A/B testing serves as an hypothesis testing tool and not for estimations and general purpose questions. A common practice is using data analysis as a step to come up with hypotheses that are worth testing.

To sum things up:

“All you need in this life is ignorance and confidence, then success is sure.” —
Mark Twain

In our fast-paced environment, velocity is crucial. The market is competitive, impact areas are huge and there are many decisions to be made. In order for us to succeed, we need to keep balancing between being certain and being fast. To do so, we need to have reliable data at hand, stay aware of the assumptions we make and always keep a healthy sense of criticism.

Check out additional posts that explore impact driven data ownership, building data-driven product road maps and tuning music note recognition algorithms and check out our open data roles!

--

--

Yoav Arad
Simply

Data for 10 years. Anything from ML to Excel. Anything from Gaming to Agriculture.