Adam Kelleher has a physics PhD and is principal Datascientist at Buzzfeed

Adam Kelleher on experiment design and observational analysis

Arjan Haring
I love experiments
Published in
17 min readAug 20, 2016

--

Adam Kelleher is principal Datascientist at Buzzfeed. Kelleher has a physics PhD from the University of North Carolina-Chapel Hill.

His main interests are in recommender systems, virality, and information diffusion. Kelleher has some strong side interests in causal inference, and math methods for large scale social research.

On the side, he is working on building out a causal analysis package to fill a gap in the Python ecosystem (pip install causality).

Open source projects
- Fast visualization of large graphs with H3 layout:
http://www.github.com/buzzfeed/pyh3
- Tools for causal inference and causal effects estimation:
http://www.github.com/akelleh/causality

So, what actually is causality good for?

Causality is a basic requirement any time you’re trying to make a data-driven decision about a change to system. Often, people try to use observational data to speculate about changes. You can do this if you’re very careful: there are ways to control for variables that cause bias, but none of them are perfect. Check out my 2nd blog post if you’re interested in the details of that.

The result is that if you want to use observational data to speculate about what you should do, then you’re really leaving the result up to chance. You might get lucky, and there’s no bias, and so your correlative result is causal. The problem is that you just don’t know.

To make it concrete, I’ll modify an example from this paper. You observe that when people buy winter hats, (to make up a number) half of them also tend to buy gloves You think “great, this gives me an idea: if I look for products with similar levels of co-purchasing, I can put all of my advertising dollars toward those, and get more sales for my money”.

Unfortunately, most of the correlation was due to a confounding variable: it’s winter, and so hat and glove sales go up independently of each other. The clicks from one to the other were purely out of convenience, and the customer would have found an alternative path to the gloves.

This paper estimates that the true causal effect is around 1/4 the size of the correlative estimate. This big of a change, 75% of the value, could easily turn the strategy into something too expensive to justify. A simple experiment to establish causality, in this example, is an important step toward checking feasibility. Even better, you could try using Sharma et. al.’s observational strategy to do a quick sanity check!

Of course, there are even simpler examples. A standard method of establishing causality is with an AB test: you’re testing to see what happens when you actually change the system, so you compare the changed system (the test) with the original system (the control). Establishing causality is on par with establishing what would really happen if you make a change.

Do you like what you have read so far? Get a quarterly update of what I am busy with.

Why should businesses be busy with chasing causality? And how far should they take it?

This is actually the subject of my next post! It’s fast and easy to find a correlative, observational answer to a question and hope that it’s causal. What are the odds you’re right? You have to balance the cost (and chances) of being wrong with the cost of finding causal answers.

In the real world, we make decisions in the context of uncertainty and limited information all the time. How far you should take causality really depends on your appetite for risk. I’d recommend it in cases where you’re making a really big decision, or in cases where you have to make the same small decision over and over (so being wrong repeatedly adds up!).

With that said, the more you know about the causal relationships you’re working with all the time, the more that can inform your observational results. I know there is activity bias, and so I know that I can’t just use “people who have seen an ad” as my test group, and “people who haven’t” as a control.

That would be a quick, observational, correlative approach to analyzing ad effectiveness that I know is wrong. I know it’s wrong because someone has done the work to establish causality, and has compared it with an observational approach.

If I want to work with observational data, I at least need to try to control for activity bias. There may be other biases, but controlling for known ones can (hopefully) get me closer to the right answer. Establishing a quick experimental result pays off in the short term. Establishing bias (or not) by comparing it to a correlative one can keep paying off in the long term.

In one of your blogs you write about the importance of having a clear picture of the system that you are examining. In an interview with Ronny Kohavi I’ve discussed the ideal goal of building an empirical model of the system you are examining.

How should we go about building an empirical model?

This is a great question! I think it goes together with your second question: how far should we take it? I’m not expert on experiment design in social science (my background is actually in mathematical physics), but I have read enough about it to say something. This probably won’t be a standard answer.

You have a spectrum of options, depending on the investment you’re willing to make. You could go all-in, and try to develop a model of the system that is as complete as possible. You could be more conservative and just try to discover the aspects of the system that cause correlative results to be biased (these include finding confounders and selection effects). You could also take the very most conservative approach, and model only the experimental effects you’re interested in, limiting your model to the result of a small set of AB tests.

I like to imagine a world where it makes sense to build a complete Pearlian causal model to describe a system (something like the all-in approach). That would put social science on par with physics: these models would be the high-noise analog of physics equations.

Unfortunately, I think practical reasons might keep us far from this ideal: it’s not hard to imagine interventions that would change the edges in some causal models. Worse, if an edge is changed in the Pearlian framework, then there’s no reason to suspect that causal effects estimated in any other framework should be constant.

In other words, other factors aside, social science systems might have causes that change too dynamically for us to build a complete picture of the world. If you wanted to play with this approach, there’s a very interesting field that focuses on causal graph search. One of the most astounding things about Pearlian causal models is that there’s an observational, statistical test for genuine causality in the context of latent variables.

If nothing else, finding this kind of causality might be on par with a free AB test! I’ve implemented this algorithm in python in my causality package on pypi. Many more algorithms are implemented in the Tetrad package, from CMU.

Sometimes it can be easy to build these models. You can write down much of the graph just using domain knowledge, and then test the relationships you’ve written down experimentally. If you start with a skeleton search to find a structure of the graph that’s consistent with your data as a starting place (noting that some edges may be due to unobserved confounders), then this ensemble of approaches makes the problem reasonably tractable.

It still seems like it’s going too far to throw out this approach. Many systems probably do have very static causes, and the ability to model interventions in the Pearlian framework is too powerful to pass up. With that said, its up to the scientist in each context to decide what level of investigation is appropriate.

They may stop at estimating the causal effect of an intervention before making a business decision (i.e. an AB test). They may go farther and start modeling confounders and selection effects. I think this paper deals with testable models in a way that gives a nice example of this approach.

I can’t imagine a world where we could possibly measure all the variables relevant for determining these variables (regarding social influence), let alone building a model to incorporate them all. Still, finding the confounders (and selection effects) is sufficient for treating the rest as noise.

Finally, there’s the approach that people currently do every day, where they analyze the results of simple AB tests. The down side of this approach is that they’re pretty case by case. If you compare the experimental results to observational results, you can see if your observational results are confounded.

Surely you don’t mind me inserting this selfie of Adam and me :)

Surely, in the case of advertising, marketing, political campaigning and lots of policy decisions where changing human behavior is involved, behavioral sciences will give us inspiration to what kind of model(s) that could be, right?

That’s probably true! I’m not familiar enough with behavioral science models to weigh in heavily on that. To the extent that they provide testable causal assumptions, and functional forms for causal relationships, I think they have large potential impact.

It can be very difficult to bring even well-understood science into an industry setting, and I think the most immediate headway to be made will be less in discovering and testing new models, and more transporting established models into an industry setting. There is an enormous amount of low-hanging fruit.

Auction theory is a good example of where models have made a nice impact in industry. I’d love to see more adoption of theoretical models like that, but that’s an easier one to justify in a business setting. It has a directly measurable impact on the financial bottom line.

Do you like what you have read so far? Get a quarterly update of what I am busy with.

In workshops I like to begin with the most simple model I know: Behavior = Motivation x Ability x Trigger (Fogg Behavior Model) and from there we work towards more detailed models.

How do you see the vast behavioral science research benefit experimenters?

I’m really not a behavioral science expert, and actually wouldn’t even consider myself an authority on experimentation. I’m at a weird place in between fields (maybe mathematical data scientist is apt? My training is in theoretical physics.), so I’ll do my best to answer your question. This is just my (fact-based) opinion. I’ll caveat that I might not have broad enough perspective to answer this question well.

There are parametric and non-parametric methods for estimating causal effects. Having a functional form that is correct or close to correct gives you a parametric model, and parametric models can be fit much more efficiently than non-parametric models. That’s a powerful advantage in itself: you can do more with less data, if you’re willing to make modeling assumptions.

There are some interesting corollaries. In introductory physics labs, we’re often just estimating parameters of equations experimentally, and we can use those parameters to make future predictions, possibly about completely different systems. The parametric model can inform both the experiment design, where we realize that if we rearrange an equation in such-and-such a way, we can make it linear, and directly measure the parameter we’re interested in as a (function of a) linear regression coefficient.

A fun elementary example is measuring the charge-to-mass ratio of an electron by measuring its trajectory through a magnetic field. You can re-arrange the force law to make the radius of the electron’s trajectory linearly related to the accelerating potential. Measuring the radius at different accelerating potentials lets you measure the charge to mass ratio, simply as the regression coefficient!

You then have a quantity that’s useful in other context. You could combine it with a measurement of the charge (e.g. Millikan’s experiment) to make predictions in even more systems! I’m not sure how this can be applied to a social/behavioral science context, but there may be interesting ways in which it can. It depends on the models you have, and how well they work.

Talking about how science can help business experimenters and vice versa, what do you think about local evidence vs. scientific evidence?

I’m not a big fan of the wording. Local evidence, that is, evidence from individual experiments that are done in a very specific context, is still scientific. Scientific evidence, that is, evidence that is expected to be generalizeable across contexts, is just a global (across contexts) version of local evidence.

I’m going to call “scientific evidence” “global (scientific) evidence” and “local evidence” “local (scientific) evidence”. Science reflects rigor in methods and a refusal to accept anecdote, and that is characteristic of both types of evidence.

I think the core of data science in business is to gather local evidence to inform everyday decisions. At the same time, I think an important supplement to that is to try to gather as much global evidence as you can. We don’t make product decisions out of nowhere — there’s some reasoning that underlies it, and that reasoning is often something that is expressed as global knowledge.

Ideas like “people like to feel a shared experience” is a global idea. If something is true in a general sense, it should be true across contexts — testing the effect of an abstraction (“shared experience”) in several local contexts is important support for the global idea. If we could find a way to quantify the global abstraction, say through building psychological constructs and tests, then we may advance to a point where we could try applying local evidence to building a global model that could be applied to producing local predictions.

This is largely how physics works: you experiment on a specific system to generate data sets, and then abstract those into physical laws. You refine your variable definitions over time (something like “material bulk” is refined to “mass” ), and start establishing more general laws. Once you’ve found them, you can actually predict what will happen in another system.

AB testing is, I guess, a nice example of local evidence. You’re testing a small change to a specific system. You don’t generally make changes completely at random: you’re often testing potential product improvements. There must have been some reasoning behind the change, and that reasoning may have been global: “bigger buttons click better” is an idea that is generalizeable across contexts. I think it’s important to abstract out as many of these as possible. Otherwise, you’re not really building a knowledge base. You’re just making decisions in the absence of general, rigorous knowledge.

You can try to build global knowledge by analyzing a sequence of AB tests. The approach to experiment design that a lot of people might take is often biased. We try to confirm, rather than refute hypotheses. We seek to explain away bad results, and accept good ones with less questioning. What we need is a more rigorous framework for building global (potentially more abstracted) insights from local evidence. I think the Pearlian causal framework is a good step toward that. Barenboim and Pearl’s recent work is very interesting for this reason: it’s meant to generalize data sets across contexts.

How scientific should experimenters be?

Again, I’m going to use the word “global” instead of scientific. Day to day, in an industry context, the priority is really to make local decisions. If a complete global understanding were a requirement for that to happen, then decisions, practically speaking, just can’t be made. Experiments are a rigorous way of finding the causal effect of small decisions. They’re extremely practical.

Unfortunately, I do think there’s a tendency to lose the global insights in the practical day-to-day decision making. When global insight isn’t the priority, it’s hard to get institutional focus on it. When people do try to extract global knowledge from experimental results, their methods lend themselves to a lot of cognitive biases.

I do think the experimenters should be the ones pushing for building global knowledge. At the very least, because this kind of knowledge tells you where biases are (selection and confounding), and so should inform experiment design. More to the point, global knowledge helps lead you to the more interesting questions, and potentially more powerful questions.

And how about other professionals, how well should they be able to judge evidence and add to the experimentation process?

Other professionals are enormously valuable for the experimentation process. There’s so much domain knowledge that’s specific to each role in a company that if an experimenter doesn’t draw from it, they’re putting themselves at a major disadvantage.

The value added can range from conversations that help identify variables of interest, causal mechanisms, confounding variables, selection processes, variables that can be controlled in experiments, and even insights that help prioritize which experiments to run first. Product people and designers have valuable intuitions about user experience. It takes all kinds of people to form and test hypotheses.

Do you like what you have read so far? Get a quarterly update of what I am busy with.

Adam, I really like the way you think! So I will give it to you straight. I am in the commercial business because I see how commercial interest can raise the professionality of my field immensely and I hope to take this knowledge and apply it in non-commercial fields.

Below a rough sketch of my thinking:

On the areas where the goal is to influence behavior (click on ads, stick to a treatment, go to vote) these fields should be able to learn from each other through any global evidence that has been gathered.

And on another level I think experimenters should learn from each other, in the sense that together we can improve our experimentation strategies and toolbox.

I am naive to believe this is worth pursuing?

That’s so hard! Aside from the ongoing development of methods to actually make data transportable across contexts, there are difficult business problems. Would competitors be willing to share insights, given that it would remove their competitive advantage, even though everyone is richer for it?

Would the existence of “free” insights remove the incentive to do “global science” research altogether, for all but the biggest companies? Would the increased commercialization of research and financial incentives associated with anything in a commercial context hurt academic integrity? These kinds of questions could be show-stoppers if any of them are affirmative, and I don’t really know the answers to them.

Still, people are very smart, and have an enormous capacity to solve problems. If any of these things are show-stoppers, I think someone out there might be smart enough to figure out how to get past them.

Another question I have is on “the golden standard” of evidence. This will be a hot topic in the years to come, I believe. And I of course am not eager to admit there is anything better than rigorous experiments ;)

Although I believe there are still a lot of organisations that apply a brown standard (= bull shit opinions) I am not sure what the pursuit of the golden standard will bring us.

Do you think it’s useful to have something like “a golden standard”?

I think the gold standard has the advantage of defining what “ground truth” is. I don’t think there’s much getting around that. Obviously, research that doesn’t achieve the gold standard is still useful. Correlative evidence points to possible causation (“There is no correlation without causation [somewhere]”). I think it would be even more useful if the media didn’t blow up correlative results so often.

I do think there might be an interesting in-between, but I’m not sure that our methods are up to it yet. In particular, I’m not sure that we have robust enough conditional independence tests, or methods for deriving psychological constructs that are “sufficient statistics” for the brain’s effect on behavior.

If you take a data set, you can test all of the conditional independence relationships and set of possible causal relationships. These represent all of the correlative relationships in the data that can’t be explained away by other observed variables. They’re either due to causal relationships, or due to bias from unmeasured variables.

As you start testing the relationships to find which are causal, the ones that aren’t point to areas to explore for latent variables. You can continue searching and measuring more variables, and refining the graph. The “gold standard” isn’t required at all to kick off this process, and serves more as a source of input data for this process, instead of being the whole process.

CMU’s Tetrad package is an old but nice piece of software for playing with this sort of thing. It works well on some simulated and real data sets that I’ve played around with. I’ve been starting to implement these things in a python package called “causality”, and have a basic walkthrough of this initial step on the github page for it.

Still, I think we’re nowhere near the level of sophistication in building psychometrics or even doing statistical tests on the data that’s available to execute a procedure like this. It might work in certain closed systems with crisp relationships amongst the variables, but I don’t expect it to work in a context that relies on behavioral variables as proxies for psychometrics, like measuring social influence, responsiveness to ads, etc.

And finally, giving the fact that finding out why something happened is so hard it’s reasonable that a lot of organisations don’t further pursue analyzing their data, right? If the AB test generated additional revenue, we can keep the winning version and file it under “success”, case closed.

I mean, there is a trade-off between the time you spend analyzing and the number of additional valuable insights you are going to get. The additional money you might gain from more insights might not outweigh the cost you make analyzing the data.

In science it makes sense to spent 20 years analyzing the same thing. But in business you need a really good reason to spent a week on analyzing the data from your experiment.

Do you agree? And if so, is there a way to optimize this trade-off in a sensible way?

Yes and no. I think it’s hard to make the case for a deeper procedure, aimed at finding global, context-independent insights. That’s definitely not the primary goal of the experiment. I’ve blogged about this recently:

I tried to make the case that while it might be more expensive in the short run, it’s more valuable in the long run to develop these global insights.

Part of it is a question of resource allocation. It doesn’t make sense to block progress in product decisions because you’re trying to get global answers. You need analysis outside of the path of the day-to-day decision-making process.

One option might be for an individual or a small team to try to build a global picture by working together with more context-oriented teams. Maybe they can get answers by making small adjustments to how experiments are run, or by making sure one more column of data is collected or controlled for.

I’m really not sure what that collaboration would look like. Maybe the level of cross-team communication required would make it infeasible. I do know that sometimes you need accurate answers to hard questions quickly, and the stronger the base of knowledge you’ve built up (e.g. sources of bias you know to control for in observational data), the better prepared you are to answer those questions.

The other part of it is a question of approach. I think an independent team might not even be necessary if product people were all trained scientists. They generally operate with a nice intuitive understanding of how systems work. They have to build that intuition partially on anecdote, but the more they base it on scientific evidence, the closer their intuitions come to being potentially transportable scientific results.

BTW if you’ve much optimism for global science results coming from industry. In the course of doing your interviews, is that something you’ve seen much?

No, I have had very few people mentioning this idea. Both Alvin Roth and Ronny Kohavi implicitly did, when they drew parallels between medicine, online experimenting, engineering and other disciplines.

My former co-founder Maurits Kaptein has also been putting forward thoughts along those lines for some time now. And I am an optimist, I’ll very much hope this is possible, until someone proves me wrong.

So it could be just me trying to get my hunch to become the latest fashion. Even then… I think Deirdre McCloskey put it nicely when she said:

“We want dynamic science, so we have to put up with business cycles in it — I’ve seen many, many fashions come and go in my career.”

Do you like what you have read? Get a quarterly update of what I am busy with.

--

--

Arjan Haring
I love experiments

designing fair markets for our food, health & energy @seldondigital - @jadatascience - @0pointseven