What’s Observational Data For?
The Limits of Non-Experimental Data
I dislike the phrase ‘directionally correct’. I often hear it thrown around in the context of analysis done with non-experimental data, intended to acknowledge that while correlation does not equal causation, it’s close enough for a rough estimate. This is, of course, not at all true.
Imagine for a minute that cigarettes cost $1,000 per pack. As curious scientists we have a hypothesis that smoking causes cancer and would like to have data to support or refute this idea. So we decide to measure the all-cause mortality among those who do and do not smoke. Comparing the two groups, we fail to find evidence that smoking increases mortality, and in fact find the opposite: smokers actually have a LOWER mortality rate. While we didn’t have a true randomized experiment (after all this would be unethical), we still figure our results are close enough and conclude that smoking does not cause cancer, and may even reduce mortality rates. Based on this, we develop a program to subsidize the cost of cigarettes so that less affluent people can afford cigarettes and be just as healthy as their wealthy, smoking counterparts.
What went wrong here is probably obvious: if cigarettes were so expensive, only the very rich could afford to smoke with any regularity, so when comparing our two groups we measure not only the effect of smoking, but also the effect of being rich. If the effect of being rich on all-cause mortality is stronger than the effect of smoking and the two effects are pulling in opposite directions, our estimate of the effect of smoking is not even directionally correct — it is completely reversed.
More generally, when we measure correlation with non-experimental data we have to deal with the fact that effects from the ‘treatments’ we’d like to measure are self-selected and therefore are correlated with numerous other attributes and behaviors, making an unbiased estimation impossible. Random assignment in an experiment or A/B test solves this problem.
To give a more realistic and industry-friendly example, imagine you run a website that sells unlocked smartphones at a discount. At some point your team decides to launch a new feature that allows price comparisons across many different major smartphone retailers. However, you don’t want to dedicate the resources to run an A/B test, so instead you launch the feature and then compare the revenue generated per user between those who use the tool and those who don’t. Happily, you find that users who used the tool also generated more revenue, so you kick back and wait for your revenue to grow. But it never does.
What happened this time? This is something we run into all the time with software products: we launch a feature and want to understand if it affects our core metrics positively — revenue, subscribers, hours spent in product. But, as with the smoking example, the comparison between the users who used the tool and those who didn’t isn’t a fair one: there are intrinsic differences between the two groups, so comparing their revenue tells us more about the users’ inherent characteristics than it does about the causal effect of using our price comparison tool. In consumer software you will almost always find that your core metrics are higher among users who used Feature X compared to those who did not for the simple reason that more engaged users, who tend to have higher measures on core metrics, are more likely to have used any given feature.
The obvious answer to this problem is to always do A/B tests for causal questions like these, and we should (and do) do that. But then what is the rest of our data for? Should 100% of our work as data scientists be analysis of experimental data?
There is actually a time and place for observational data, but it’s not for getting approximate estimates of causal effects, since, as shown above, observational data is incapable of doing that without additional (and unlikely) assumptions.
Here are the three primary uses of observational data, as I see it, in an analytics or data science setting
Counting Things
The most straightforward use of observational data is to count it. As bland as that might sound, when launching a product, iterating on an existing feature, or just monitoring current performance, raw counts can be incredibly useful.
For example, when thinking about new features to launch, straight counts of data can tell us a lot about which features might be high impact. If we’re looking at building a feature that will exist in a sub-menu that only 1% of visitors access on a monthly basis, we might make a pretty reasonable guess that this is not a feature to prioritize highly. Similarly a change to a feature that 90% of users use daily can have a huge impact, positive or negative, and should be taken very seriously.
Counting data is a also a great way to understand performance and track bugs in an existing feature. If we release a new version of a mobile product with no user-facing changes and yet notice the volume of clicks on a particular page is down by 10%, there’s a reasonable chance that we have a problem. If we further find that all the missing clicks are from our Android platform, while iOS looks normal, then we know where to guide our engineers so that they can get a fix in.
These examples, while maybe not the most exciting analyses, are incredibly important uses of observational data for a product-focused data scientist.
Describing Behavior
Descriptive or exploratory analysis is what I like to think of as pre-experimental work. This is analysis that describes how users behave, and its greatest value is in helping data scientists, engineers, and product managers to generate hypotheses about what feature changes might have a positive impact.
This can get dangerous though. The temptation quickly becomes to compare users exhibiting a certain behavior to users who do not behave that way and then draw causal conclusions about that behavior affecting core metrics. Of course, as with the smoker/non-smoker comparison outlined above, this is not an effective approach.
The more productive path is to deeply understand user behavior based on data, but then instead of looking for comparison points in observational data, imagine alternative products and how they might alter behavior. For example, if you notice that time spent on a particular page in your product is shorter than others, you could imagine a different version of that page designed to keep users engaged for longer, which, in turn, might yield an improvement in overall retention. Following that, an A/B test could provide evidence supporting your original hypothesis (that the new page will keep users engaged longer) or refuting it.
It’s worth noting that previous experiments should play a strong role in the hypotheses generated at this stage. In the above example, your hypothesis might be a reasonable one if you have previous experimental data to show that retention increases as time spent in app increases, whereas if you have a previous experiment that suggests no correlation, you might want to look elsewhere.
The key is that the focus here should be on understanding behavior and generating hypothetical product changes that could alter that behavior, not on drawing causal connections between behaviors and outcome metrics.
Prediction
With prediction things get a little more interesting: rather than asking how many users do X, we ask how many users will do X. It might be leaving a platform, or clicking on a link, or signing up for a subscription. Accurate predictions can have huge positive impact on a business, whether they’re used for recommending a new video to watch or for deciding when to send a customer an email. And, in general, we can build effective predictive models from non-experimental data.
Take, for example, a well-known computer vision problem: given an image of some handwritten numbers, we want to create a model to predict which digit the image represents. The input here is a matrix of pixels with a numeric value in each cell representing the strength of that pixel on a black and white scale; the desired output is a classification 0–9. In this case, we don’t actually care about causal inference — whether or not a pixel, or combination of pixels, ‘causes’ the image to be a particular digit is irrelevant. We just want to minimize our prediction error in classifying the images, and the data representation of the images allows us to do this, no experiment necessary. But as before there are definite traps in building predictive models where it can be easy to draw causal conclusions without realizing you’re doing so.
Let’s say we’ve built, validated, and deployed a model that accurately predicts users who will leave our platform in the next 7 days, i.e. churn. The only step left is to find a way to convince those users to stick around instead of leaving. We notice that the strongest predictor in our model is whether or not the user has engaged with Feature Y in the last 30 days: users who did not engage with this feature have a 90% chance of leaving the platform. We develop a clever intervention that, after using our model to identify at-risk users, directs them to use Feature Y. While we see a huge lift in usage of Feature Y, our churn numbers remain the same.
Of course there’s a logical fallacy here: just because usage of Feature Y is predictive of retention doesn’t mean it causes it. So while we’ve correctly identified users who will churn, we have not yet determined the best way to stop them from leaving. At this point, the best course of action is to A/B test different interventions and determine what works best. With predictive models, the observational data allows us to train an accurate model, but the evaluation of the efficacy of an intervention can be left to A/B tests.
I’m sure that there are other uses of non-experimental data that I’ve missed here, but, in my experience, these three categories capture the primary use cases. As data scientists and analysts we all get asked to draw causal conclusions from observational data, often on daily basis, so knowing its limits and being able to communicate them is important. Product decisions that have a real impact actually ride on how we interpret the data, so let’s take that responsibility seriously and use observational data responsibly.