How to prepare for Interviews focused on Causal Inference Modeling and Online Experiments ?

Shreya Bhattacherjee
14 min readApr 24, 2024

--

The goal of this article is to provide the reader with a comprehensive study plan for the second/onsite interview of a Data Science position in a typical Tech company. The article starts by discussing the fundamental concepts a candidate is expected to be aware of in any standard Data Science role and then it lays out all the resources that can be very useful in helping a candidate develop a deeper understanding of the various Causal Inference Models that are used in the industry. The article includes some of my own observations and understanding of the Causal Inference literature, gathered during my long career in the field of Data Science and Econometrics.

Photo Credit: Author

1. Introduction

The purpose of this article is to help the reader prepare for Data Science interviews focused on Causal Inference Modeling in a standard Tech company . Preparing for Tech company interviews is a very time consuming process as the bar for hiring a candidate in a Tech company is very high. Hence, the candidate is required to provide the interview panel with a lot of data points in order to meet that high bar. As a result it helps if a candidate is very well prepared when he or she is taking the interview. A large part of this interview preparation process is to have a clear understanding on what is expected from a candidate at every round of the interview and access to the right resources that can help the candidate reach those expectations as quickly as possible.

This article is written for those candidates who do not have any prior experience in interviewing for a Causal Inference focused Data Science role and who are not from an Economics or Social Science Background. I learnt Causal Inference Modeling from class-room lectures at Graduate school as it is an essential tool used by folks specializing in the area of Development and Labor Economics. Back then it was mostly used in the academia and very few people in the industry had heard about them. Thanks to Amazon, the Tech industry saw the merits of Causal Inference literature very soon. After that a class of job positions, mostly in Data Science, opened up where the companies wanted to come up with rigorous and scientific methods to measure the impact of new product features. These methods are particularly useful in scenarios when A/B tests are not possible or A/B tests have failed to show statistically significant impact.

In recent times, Causal Inference modeling is a crucial part of some Data Science positions that are involved with Product Research and Product Development. Please see my other article on various job categories one can target in the Tech industry, to get a better sense of which Data Science roles are likely to have Causal Interview Modeling based questions. In this article I will list out all the resources one can use to master the various Causal Inference models that are now widely used in the industry as well as the pre-requisite concepts necessary for understanding some of these causal Models.

2. Pre-Requisites for Understanding Causal Modeling

The Basic building Block of the vast majority of the Causal Models is Linear Regression. Although, this is a topic covered in all Computer Science and Statistics courses and the reader might be aware of these concepts in general, it still helps to refresh the fundamentals of Econometrics before starting this journey. Most of us who learnt Causal Modeling at school, also learnt Econometrics through a series of courses across various semesters of our graduate and undergraduate life. I have listed some good resources one can use to master all the concepts required to have a working knowledge on Undergraduate Econometrics and Statistical Inference.

a. Introductory Econometrics by Jeffrey Woolridge

This is a standard econometrics text book familiar to anyone with an undergraduate degree in Economics. This book is probably the best resource that one can use if one wants to develop a detailed understanding of the inner working of a Linear/logistic Regression model. These are also the fundamental building blocks of most Causal Inference models- which are often a key requirement for Data Science interviews.

This is a very fat book with many chapters, usually completed by a student over the course of one or two semesters. For the purpose of understanding Causal Models, one can focus on Chapters 2,3,4,7 that are mostly discussing:

  • Simple Linear Regression (Chapter 2)
  • Multivariate Linear Regression (Chapter 3)
  • Statical Inference using Regressions (Chapter 4)
  • Dummy Variables (Chapter 7)

If the reader is interested in developing a more fundamental knowledge on statistics and Econometrics, then please see my other article to find a list of resources I found to be very beneficial when I was preparing for DS interviews. After reading this book one would develop a concrete idea on concepts like:

  1. What are the assumptions of a Linear Regression Model?
  2. How to test these assumptions and what are the work-arounds if these assumptions are violated?
  3. What is standard error and why is it so important ?
  4. How do dummy variables in an equation work ?
  5. What are interaction terms and how can one mathematically interpret them?
  6. What is omitted variable Bias ?
  7. What is multinomial Logistic Regression?

The entire Causal Modeling and A/B testing literature is focused on answering one problem. Is the new product feature really moving the key Target metric(Y) a company is interested in? The Standard practice is to select a group of individuals, called Treatment population (T) and expose them to the new product feature and measure their Y against those in the control population who did not get exposed to the new product feature. In terms of Econometrics, one can represent this problem using the following equation (referred as Equation 1 in this article). This is also called an identification strategy.

Let’s assume that there are n individuals in the population who can get exposed to the new product. For an individual i in the population, the target metric Yi can be explained by the very simple equation stated above. Ti is a dummy variable (That’s why chapter 7 of Jeffrey Woolridge’s book is so important) that takes a value one if the individual i belongs to the Treatment population and 0 if the individual belongs to the control population. εi captures variations in Y across the various individuals in the population that are not explained by the treatment. If some assumptions (e.g. SUTVA) are satisfied and the treatment is randomly assigned, then the above equation is enough to estimate the Causal impact of Treatment T on outcome Y on an average in the population. An A/B test estimates exactly the very same thing. If The reader is familiar with how dummy variables in an linear regression work, then the reader can see that estimating the β in the above equation is nothing but evaluating the mean Y across the treatment and the control population, taking their difference and then evaluating if the difference is statistically significant or not.

However, if Ti is not randomly assigned or some of the assumptions required for this framework do not hold, then an A/B test or the above equation cannot be used to measure the impact of the Treatment T on outcome Y across the population on an average. To get a deeper understanding of these assumptions and how to start conceptualizing a Causal model if Ti is not randomly assigned, the reader needs to get a good grasp on the concept of potential outcomes and DAGs.

b. Paul Goldsmith-Pinkham’s Lecture on Potential Outcomes and DAGs

The entire Lecture series of Paul Goldsmith-Pinkham on the topic of Applied Methods PhD course is available in his YouTube channel. It’s a very high quality material on this topic- useful to anyone trying to get a deeper understanding on the subject of Causal Methods. The first lecture, which mostly covers the potential outcome framework, is a must-watch for anyone trying to understand the statistical theory and assumptions behind A/B tests and Causal Modeling. Based on my experience, many questions in the interviews can be easily tackled with some of the materials covered in the first video of this lecture series. These are mostly questions designed to test the candidate’s understanding of the assumptions required to test the validity of an A/B test Design framework.

Another essential pre-requisite for any professional working on Causal Models is to have a very strong knowledge on A/B Tests- how they work, what their limitations are and how Causal Models are different from a standard A/B test. Although, A/B tests are conceptually very simple, there are still many nuances that are often a part of those trick questions that often show up in the interviews. My recommendation would be to take a close look at the resources below to develop a mastery in practical applications of A/B tests.

c. Udacity course on A/B testing by Diane and Carrie

This is an excellent refresher course for all those folks who are a little bit rusty on the basic statistical concepts used regularly in an A/B test. This course is offered by Diane and Carrie- two statisticians at Google with years years of experience in this area. It’s a very hands on course using real data and is likely to provide the reader a very thorough understanding of all the tricks and nuances associated with A/B tests. This course would enable the reader to tackle questions like

  • An A/B test is giving neutral result. Should the Data Scientist recommend the PM to not launch the product? (Use CUPED methods to validate results)
  • The results of A/B tests on sub-samples of a population are exactly opposite to the results derived if the test is run on the total population. How can a Data Scientist explain these results to a PM. (Simpson’s Paradox).

Only caveat of this course is the fact that it does not cover enough statistical theory to help the reader understand what assumptions are usually required for a standard A/B tests to be valid and what statistical problem A/B tests are actually trying to solve. If the reader has already watched the first video of Paul Goldsmith-Pinkham’s lecture series then he or she should not have any difficulty in getting the big picture that this course is trying to paint.

If the reader is trying to crack an interview for a role specializing in A/B tests and online experiments, then he or she can also take a look at the series of blogs published by the Data Scientists at Netflix.

d. Netflix Tech Blogs

Although the blog posts are centered on decision making at Netflix, the knowledge that one gains by reading these blogs can be used by folks working in many other industries like ride-sharing, food/grocery delivery etc. where the observations in the experiment are not independent of each other. One can get a sense of various work-arounds that can be implemented in situations where standard A/B tests fail. The methods and techniques explained in the Netflix Tech blogs are a part of the cutting-edge work that is being done in the Tech industry in the area of online experiments in recent times.

The first three resources displayed in this section should be enough to help a candidate face a standard Data Science interview that is not specifically focused on Causal Inference Models. However, if one is trying to get a good grasp of the various Causal Inference models used in the industry, then one can start utilizing the resources mentioned below.

3. Major Causal Inference Models

Causal Inference Literature will arm the reader with a wide variety of techniques and methods that can be used if standard A/B tests fail. Per my experience, most Causal Inference interviews start with a hypothetical business scenario where a treatment was implemented (not necessarily a random implementation ) and one has to evaluate the impact of that treatment on a specific metric. In such a situations, the question that one needs to investigate first

  • At what level of the data was the treatment administered at?

Was it administered at a group level (geographical unit, product-category etc. ) or at an individual level (event, user etc. )? Depending on the answer to that question, the candidate is expected to choose the appropriate causal method that can then be utilized to carry out the impact evaluation exercise.

There is a very good blog written by one of the Data scientists at Uber to help people develop a flow chart of various causal methods applied in various situations. In general, if the Treatment is implemented at an individual level, then one can start with Matching Based Techniques like Propensity Score Matching. On the other hand, if the treatment is implemented at a group level, then the candidate can try to test the effectiveness of methods like Difference and Difference and Synthetic control for that specific business scenario. All these methods require some pre-determined set of assumptions to be valid in order for them to generate a valid causal estimate of the impact. In order to develop a very strong understanding of these nuances associated with Causal Modeling, I would recommend the following resources.

a. Mostly Harmless Econometrics

Josh Angrist’s text book on Causal Modeling is still the standard reading material on the subject recommended in many applied methods courses at graduate schools. This is the source I used, at school, to develop my initial knowledge and intuition on selection Bias- the bread and butter of any Causal Modeling professional. It is one’s ability to identify and understand selection bias that differentiates a Causal Modeling expert from a standard Data Scientist whose work is mostly focused on running A/B tests.

This book is focused on theoretical aspect of the implantation of Causal methods in social science. It might not be the most heavily used resource by a Tech Professional in their day-to-day job. I did not find this book to be that helpful when I was preparing for interviews.

b. Paul Goldsmith-Pinkham’s Lecture Series

As already mentioned before, this is an invaluable resource available at zero cost to anyone who wants to learn the theory behind the impact evaluation methods widely used in the industry. It uses methods, techniques laid out in the most recent papers/academic research in this area and gives the viewers a detailed understanding of the most popular and widely used causal models, the assumptions needed to have these models as valid experimental designs, the pitfalls and shortcomings of each method.

However, each video in the lecture series is an hour long video and might not be suitable for a candidate who is short on time. There is another very effective source available in Medium itself. It’s the series of blog posts published by Matteo Courthoud who started writing his blogs while he was in still graduate school. He is the current online guru on this topic as he has presented many complex concepts in the Causal Modeling field in a very lucid manner using dummy data. All his analysis and python codes are also available for the readers to replicate his results and strengthen their understanding on the subject

c. Matteo Courthoud’s Blogs

As I said before, Matteo Courthoud has written many blogs on various topics of Causal Inference Modeling. My recommendation would be to start from his blog on Matching-Weighing and Regression and follow the prompts offered in Medium’s recommendation system. His coverage on the more advanced Causal Models like Causal Trees and Double ML are also very insightful and informative. However, if the reader is looking for a more academic discussion of these advanced Causal Methods, then he or she can also watch Prof Athen’s lecture series in YouTube.

d. Prof Athey’s Lecture Series on Advanced Causal Methods

This is a great lecture series, a must watch, if one is curious to learn more about how the various Machine Learning algorithms are utilized in the more advanced methods used in the Causal Modeling literature. The lecture series starts with a discussion on the implementation of Average Treatment effects (ATE), which is what the standard Causal Inference Model is focused on and then slowly builds on that literature and progresses towards introducing the various advanced tools used by the Causal Economists today to evaluate Heterogenous Treatment Effects(HTE).

Measuring Heterogenous Treatment Effect is what the industry is primarily focused on right now. Majority of the Causal Modeling specialization roles would expect the candidate to have hands on experience in implementing some of the HTE algorithms. A standard Causal model or an A/B tests is evaluating ATE represented (using equation1) by E[Yi |Ti, Xi] or E[Yi|Ti], where Xi represents the vector of all other variables that also impacts Y for an individual i. If the treatment is randomly assigned, then the A/B test is taking the difference in average outcome of the treatment population and the control population. If that statistically significant, then the A/B test is concluding that the new product feature is causing the lift in the target metric in the population on an average. The recommendation from that A/B test is to introduce the new product feature in the entire population. If the difference, is not statistically significant then the A/B test is concluding that the new product feature do not have a significant impact and the standard recommendation is to not launch the product.

However, just because a new product feature did not register an impact in the population average of the target metric Y does not mean that it did not have any impact at all. The impact might be there in some of upper or lower percentiles of the distribution of Y, the impact was not probably strong enough to get registered on the average of that distribution. Similarly, just because the product was successful on the population average, it might not make sense to implement it in the same way across all individuals in the population. For example, if the new product feature is a price decrease for a promotional offering of a new product, it might not make sense for the business to offer the same price discount for all individuals in the population. It might make more sense for the business to offer a higher price discount (like 40%) to the customers belonging to a lower income bucket and a lower price discount (like 10%) to customers belonging to a higher income bracket. To tackle these types of scenarios, Heterogenous Treatment effects (HTE) and uplift models are widely used in the industry. A very useful resource to learn about the various use cases is Econ ML’s documentation.

e. Econ ML’s Documentation

Econ ML’s Causal inference packages are now the industry standard. Gone are the days when one needed to code causal Models and all the necessary visualizations associated with it, from scratch. Their package has simplified the lives of many data scientists as their built-in graphs are now utilized by most of the folks to generate quick results when they are operating in a time constraint. Econ ML has a very rich documentation of various business specific use cases where Causal Models are utilized in order to tease out the exact causal impact of a new product features across various individuals in the population.

Hope this article has provided the reader with a very comprehensive overview of the resources that he or she can use to prepare for the Statistics, Experiment and Causal Modeling portion of the Data Science interviews. Causal Modeling is a growing literature and newer methods are getting developed very fast by the leading organizations like Netflix, Microsoft and Uber. My recommendation would be to keep an eye out for the most recent work in the industry by reading the tech blogs of these companies in order to be informed about the most recent developments in the field. I tried my best, in this article, to lay out out enough resources to give a good head start to the reader and help him/her in the interview preparation. Good Luck and all the Best!

--

--