Lean Backward Induction — A Pattern for Exploratory Data Analysis

Published in

ResearchGate

11 min readDec 11, 2020

Can you find interesting patterns in the data?

Data analysts often hear this question when it comes to exploratory analysis. It’s a tricky one. I often saw people jump to writing code straight away after hearing this question. It’s tempting to dive into the data straight after hearing an interesting question — however, this may lead to a flawed approach. Not all data and insights are suitable to influence the decision we are trying to make.

There are several interesting parallels between the processes in product design and data analysis (read e.g. this blog post about divergent and convergent phases of data analysis). In my experience, borrowing the lean approach from UX design and applying it to data analysis can help to ensure the insights generated by the analysis are useful to the decision maker. By “lean”, I mean to tackle the analysis by first generating some preliminary insights, then to gather early feedback and iterate on it.

In this blog post, I describe a lean approach for exploratory data analysis that helps to ensure that the analysis provides decision-relevant insights. This pattern is derived from the idea of backward induction and builds on the principles of lean product development.

Backward induction: solve the problem reversely

“Backward induction is the process of reasoning backward in time, from the end of a problem or situation, to determine a sequence of optimal actions.” (Wikipedia)

Backward induction is used for example in game theory or chess to define an optimal strategy for a game. We can use similar reasoning for exploratory data analysis by defining the approach reversely, starting with the desired outcome.

30% of the work happens before you start to write any code

Backward induction applied to exploratory analysis means that I start by sketching out the implications and the results of my analysis before writing any code. If I start to query data straight away without having a plan (or “research design”) for my analysis, I might end up spending a lot of time retrieving the wrong data. In order to prevent this, I use a pattern for exploratory analyses that requires me to do ~30% of the work before writing any code:

30% of the work happens before you write any code

There are many ways to explore data, approaches will differ depending on the problem. We can however generalize the steps of defining the approach and identifying the relevant data using backward induction as a mental model.

6 questions to answer before you write any code

I take the following steps to get from a business problem to a specification for data retrieval, starting with the outcome and working backward via the analysis approach to the data requirements:

What is the action that will be taken based on the analysis?
What is the research question?
What are the hypotheses to answer the question?
What kind of statistics, models or charts do I need to analyze these hypotheses?
Is it efficient to put all data into one data frame, or rather several data frames?
What does the schema of the required data frame look like?

Before moving on to querying the data, I usually ask for feedback on my approach. There are four steps to complete the analysis — that finally involve writing code:

7. What is the sample size and time period I am interested in, are there any thresholds for cut-offs required?

8. Query the data and ensure high data quality

9. Analyze the data

10. Document the results

Backward induction process for exploratory data analysis

Based on the feedback on the results, I usually do one or more iterations of this process, starting with defining new hypotheses. Sometimes it also helps to go back to step 2 and reframe the research question. Let me explain these steps in more detail below.

Backward induction for exploratory data analysis in more detail

Steps 1–3: Start with an action, translate it into hypotheses

Let’s use an example to go through the different steps of solving a product analytics problem via backward induction. ResearchGate is a platform for scientists that helps scientists to discover relevant research and to increase the visibility of their work. Let’s assume we want to analyze early user retention of scientists signing up for ResearchGate.

Step 1: What is the action that will be taken based on the analysis?

What kind of follow-up action will we take after the analysis? This question is the first litmus test for whether we should work on a problem at all. If the insight will not have any consequences (where a consequence might also be not to act on something, given the data), we might as well not generate the insight. In our example, we want to redesign the signup flow in order to demonstrate the value of our product earlier to our users and to improve their early retention.

Step 2: What is the research question?

In the next step, we define a research question. In our example, we have the following research question: Which factors influence early user retention?

Step 3: What are the hypotheses to answer the question?

We want to identify relevant events in the early user journey that makes a user more likely to stick to our platform. There might be many different factors influencing early user retention. However, we only have limited time for the analysis and want to identify the most important ones. Therefore, it is important to limit the solution space to a finite number of hypotheses we want to explore.

To identify the most relevant hypotheses, it can be helpful to brainstorm in a cross-functional group of domain experts. In our case, let’s gather input from the product manager, designer, front-end engineer and a member of the customer support team. We identify the following hypotheses to answer our research question:

Users who connect with other users early in their journey are more likely to be retained
Rationale: social proof — seeing their peers on ResearchGate gives the user a signal that this product might be valuable for them as well
Users who add topics to their profile early in their journey are more likely to be retained
Rationale: ResearchGate can make better content recommendations to users if they add their topics of interest, so they can discover more relevant content on the platform
Users who receive an answer to a question they posted early in their user journey on our Q&A product are more likely to be retained
Rationale: Our user has a specific problem and is looking for a solution. Helping the user find the solution will demonstrate the value of the product
…

There are many more hypotheses on our list, however, we consider these the most relevant and therefore focus our first analysis on them. Remember, we want to take a lean approach. Other hypotheses might be explored in future iterations of the analysis.

Step 4: Sketch out the skeleton of your results

Step 4: What kind of statistics, models or charts do I need to analyze these hypotheses?

Let’s get creative — the next step involves pen and paper. Which types of statistics, models or charts can we use to evaluate the hypotheses defined in the previous step? In case I work on a problem for the first time, I always start with descriptive statistics to get an overview of the topic.

To identify which type of charts and stats help to evaluate the hypothesis, I first sketch out what a potential answer to our research question could look like. In our example, we might get to an insight of the following shape:

Users who connect with at least 3 other users within their first week on ResearchGate are 10% more likely to be retained 30 days after signup.

We can answer this question with cohort retention curves, grouped by different segments (e.g. users who followed other users in their early journey vs. users who do not follow others):

Interlude: Get a second pair of eyes on your approach

It’s time for a review of your approach. Ask a fellow analyst or a data-savvy domain expert to challenge your research design before bringing it to life.

Steps 5–7: Translate the hypotheses into data requirements

Step 5: Is it efficient to put all data into one data frame, or rather several data frames?

We are close to specifying the required data. One more important question is whether we want to have all data in one data frame, or rather create several data frames. The answer to this question depends on the size and nature of the analysis. We might take completely different approaches to analyze different hypotheses — in this case, separate data frames might be helpful. In our example, we are analyzing hypotheses of the same type that require similar data — therefore we can gather all data in a single data frame.

Step 6: What does the schema of the required data frame look like?

With all the information generated in the previous steps, we can finally specify the schema of the required data frame. Again, pen and paper are helpful. In our example, we want to calculate the fraction of users who logged in on a given day after signup and who have a certain value of the property we are interested in (e.g. a boolean variable that tells us whether a specific user followed another user X days after signup). We need a boolean for each of the properties we are interested in so we can group the data by these dimensions. The boolean “answer” is only defined for users who signed up via the Q&A channel so we are able to filter out users from other channels when analyzing this question. We need the total number of users in the cohort to calculate the fraction of users logging in on each day after signup.

Although the login variable is conceptually a boolean (did the user log in on a given day after signup or not?), we define it as an integer because it allows us to easily calculate and plot the fraction of users who logged in on a given day by calculating the mean — e.g. in R using this code:

```{r}df %>%
   group_by(follow, day_after_signup) %>%
   summarize(retained = sum(login)/sample_size) %>%
   ggplot(aes(x = day_after_signup, y = retained, group = follow, 
   color = follow)) +
   geom_line()```

With this specification, we are ready to start querying our data and to finally analyze it.

Step 7: What is the sample size and time period I am interested in, are there any thresholds for data cut-offs required?

We are getting closer to the real data — this step finally involves writing some code. What kind of data do we need to generate the chart sketched above? How large should our sample be, and in which time period are we interested? Do we need any other additional checks before visualizing our data?

In our example, we first want to look at a generic user retention curve in order to determine at which point cohort retention flattens out. Therefore, we need to retrieve some initial data and generate a cohort retention chart like this:

The retention curve flattens out after the first 90 days after signups. Therefore, we can limit the analysis to this time period.

In addition, we need to define some threshold for the independent variables. Remember, these are the 3 hypotheses we want to analyze:

H1: Users who connect with other users early in their journey are more likely to be retained

H2: Users who add topics to their profile early in their journey are more likely to be retained

H3: Users who receive an answer to a question they posted early in their user journey are more likely to be retained

In order to analyze these hypotheses, we first need to check the distribution of follow events, topics added and receiving answers to questions by days after signup in order to determine a cut-off date.

The chart shows that the majority of follow events happens in the first week after signup. Therefore, we can use the first week after signup as a cutoff threshold for retrieving follow events. We will also need to define a threshold for how many other users a happy user would follow within the first week. In a similar way, we can define thresholds for users adding topics to their profile and receiving an answer to their question. We want to limit the latter to users who signed up via the Q&A channel, since other users might not even have posted questions in their early user experience.

Steps 8–10: Analyze and document

Step 8–9: Query and analyze the data

We have made our way backward to the data. Once we know what to query, we can start retrieving the final data set. After investigating the data quality, we can finally analyze it.

Step 10: Document the results

Documentation is an often overlooked step — I consider it highly important to properly document results. Having the results of an analysis well documented, your analysis can make an impact a second or third time. At ResearchGate, we collect all our analysis documentation in a repository on our Wiki. We often end up revisiting analyses from the past. You can reuse the problem definition, research questions and hypotheses from steps 1–3 for your documentation. Do not invest too much time into polishing your documentation in the first iteration — it’s supposed to be a lean prototype and the major messages might change after your next iteration(s).

Gather feedback and iterate

Your first shot at the research question might not have provided the expected insights. For example, we might reject all hypotheses about factors influencing early user retention and still not know how to adapt the user journey. This wouldn’t be exceptional, particularly if you are working on a domain for the first time. Consider your first shot the “minimum viable analysis” for which you want to gather feedback early on. To make progress, you can iterate through steps 3–10 several times. I find it helpful to limit each iteration to a pre-defined number of hypotheses — otherwise, you might end up in an infinite feedback loop. The feedback in each iteration can be used to define hypotheses and sketch out the approach for the next iteration of the analysis.