Advertising Campaign Measurement with Causal Inference and Snowflake Data Clean Rooms

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

7 min readFeb 16, 2023

Measuring Sales Lift

Advertisers use “sales lift,” estimating the increase in sales that results from a specific advertisement, to measure the effectiveness of a specific advertising campaign. Doing so allows advertisers to make more informed decisions about where to allocate their advertising budget, what types of ads to use, and how to target those ads to maximize their effectiveness.

Estimating sales lift due to advertising can naively be estimated by simply comparing those who have been exposed to ads, to those who haven’t using descriptive analytics. However, it’s critical to note that in the absence of “randomized control trials” (RCTs), which are often expensive and infeasible for advertising, not accounting for confounding effects can lead to misleading conclusions. Confounding effects may lead Advertisers to incorrectly attribute the increase in sales to the advertisement when it is actually due to pre-existing differences between those targeted and those not targeted: confounding variables like age, income, gender and other features that may have direct effects on the probability of purchasing behavior as well as the ad impression itself. So, failing to account for these variables can result in an overestimation of the sales lift effect (such as when targeting existing purchasers) or an underestimation of sales lift (such as when prospecting competitors customers). If you are new to these concepts, I highly recommend reading Jim Warner’s blog post here.

The other big challenge for advertisers is to get access to ad impressions data at the individual level to derive more accurate insights into the true effectiveness of an ad campaign.

The Snowflake Solution

By uncovering causal factors using a Snowflake solution, advertisers can ensure that their conclusions about the impact of an ad campaign are accurate without requiring a costly holdout group.

Snowflake Data Cloud offers a lot of capabilities for brands to take control of their advertising data. (Please see the blog post from Jim Warner, Media and Advertising Industry Field CTO at Snowflake) One of these capabilities include the plethora of datasets readily available to query in the Snowflake Marketplace from data providers like Experian, Epsilon or Acxiom.

Another key differentiator is Snowflake data clean rooms (DCR) that allow two or more Snowflake accounts to analyze data without exposing PII data to one another.

In a typical advertiser-publisher setting, a lot of times a publisher brings the ad exposure info and the advertiser has the sales data. Combining these two data sets via data clean rooms is a key advantage to measure sales lift. Causal inference techniques (which are traditionally used for clinical research) can be applied to the combined dataset to estimate the average treatment effect of ad impressions either on the advertiser’s or publisher’s Snowflake account or via a third-party account without any data leaving the Snowflake Data Cloud. Getting access to to ad impressions data via clean rooms, advertisers can use “causal inference” to identify and control confounding variables that might influence both the advertisement and sales to ensure accurate conclusions. Additionally, using non-confounding effects can also help to reduce noise in the data and improve the accuracy of the causal effect estimate.

Below is a high-level architecture of the solution leveraging Snowflake data clean rooms(DCR) and Snowpark ML capabilities:

In this sample implementation, a common identity field (e.g. email, identity provider identifier) for individuals are used to make secure requests to get ad impressions data for each customer/subscriber and a combined dataset (simplified version) like below is used for causal inference model using the causallib python package in the Advertiser’s Snowflake account.

First, we do Exploratory Data Analysis (EDA) to understand our data and experiment with different Python packages to understand the confounders (AGE, HOUSEHOLD_INDEX, MARITAL_STATUS) as well as the treatment variable (SAW_AD) and their effects on purchasing in a notebook. Below is the result of a causal model we created with doWhy package indicating the effects.

Next, we read the data as a Snowpark data frame and apply feature engineering techniques like one-hot encoding using sp4py utilities. After experimenting with different Python packages for causal inference, we carefully chose causallib package to create our causal model to estimate the ATE (Average Treatment Effect) of the treatment variable. ATE is the difference between mean outcomes for treated and untreated groups, where the difference is explained by confounding.

https://medium.com/towards-data-science/hands-on-inverse-propensity-weighting-in-python-with-causallib-14505ebdc109

Since causallib package is not part of the Snowflake Anaconda channel currently, we uploaded the whl file into a stage (@lib_stg/lib3rdparty/causallib-0.9.1-py3-none-any.whl) so we can reference to it from our stored procedure that runs the training code. Causallib library allows estimating counterfactual outcomes and ATE using different techniques. (Please see the link for more.)

In our implementation, we also chose to use an Inverse propensity weighting (IPW) model and Propensity scores and observed an ATE of ~15%, which indicates the true effectiveness of seeing the ad. We also created another ML model to estimate the purchasing for new customers and used a Python UDF Batch API for inference. As we are dealing with a very large dataset (~4 million), we were able to run the code in a Snowflake Python stored procedure using a Medium size Snowpark-optimized WH.

-- The following resources are assumed and pre-existing
use role &APP_DB_role;
use warehouse &SNOW_CONN_warehouse;
use schema &APP_DB_database.&APP_DB_schema;

-- =========================
-- Load python packages and scripts to stage
-- 1. Loads wheel loader into the stage
PUT file://./src/python/wheel_loader.py @lib_stg/scripts
    overwrite = true;

-- Adding a custom code 
PUT file://./src/python/sp4py_preprocessing.zip @lib_stg/lib3rdparty
    overwrite = true;
PUT file://./src/python/causallib-0.9.1-py3-none-any.whl @lib_stg/lib3rdparty
    overwrite = true;
-- =========================
-- The model code is written in .py file for the sproc.
-- The stored procedure requires python library to be uploaded to the lib stage
-- this activity can be done via the streamlit app
PUT file://./src/python/estimate_ate_sp.py @lib_stg/scripts
    overwrite = true;

-- =========================
-- Create a stored procedure for estimating ATE
create or replace procedure estimate_ate_sp(
        treatment_var varchar,
        outcome_var varchar,
        confounder_vars array,
        model_name varchar,
        table_name varchar)
    returns variant
    language python
    volatile
    runtime_version = '3.8'
    imports = ('@lib_stg/scripts/estimate_ate_sp.py', 
               '@lib_stg/scripts/wheel_loader.py',
               '@lib_stg/lib3rdparty/sp4py_preprocessing.zip',
               '@lib_stg/lib3rdparty/causallib-0.9.1-py3-none-any.whl')
    packages = ('snowflake-snowpark-python', 'pandas', 'joblib', 'scikit-learn',
                'statsmodels', 'matplotlib')
    handler = 'estimate_ate_sp.main'

We can call the stored procedure as follows that runs the model training code, saves the model artifact into a stage and creates a Python UDF called predict() for inference:

use warehouse SNOWPARK_OPT_WH;
call estimate_ate_sp(
       'SAW_AD',
       'PURCHASED',
        array_construct('AGE_BAND', 'MARITAL_STATUS', 'HOUSEHOLD_INDEX'),
        'test_model_1.0',   
        'customers'
);

def create_udf(snf_session,  model, input_cols):
      
    @udf(name='predict', is_permanent = True, stage_location = 'model_stg', replace=True, max_batch_size=1000, 
               packages=['pandas', 'scikit-learn', 'snowflake-snowpark-python', 'statsmodels', 'matplotlib', 'joblib', 'cachetools'], 
               imports=('@lib_stg/scripts/wheel_loader.py.gz', '@lib_stg/lib3rdparty/causallib-0.9.1-py3-none-any.whl','@model_stg/test_model.joblib'), session=snf_session)
    def predict(ds: T.PandasSeries[dict]) -> T.PandasSeries[float]:
         df = pd.io.json.json_normalize(ds)[input_cols]
         return model.predict_proba(df)[:,1]

As a last step, using the Python UDF we can predict probabilities of purchase for each new customer assuming (SAW_AD=0) and (SAW_AD=1) and calculate the difference of the probability scores to determine target customers who are the best people to advertise to.

p[purchasing impacted by ad] = p[purchase | saw ad] — p[purchase | !saw ad]

Finally, we create a Streamlit UI for the solution to execute various steps of this implementation to demonstrate the DCR request, set up, model training and the predictions on the new customer data as a result of model inference.

Step 1: Streamlit UI demonstrating a DCR request between the Advertiser and the Publisher

Step 2: Streamlit UI for the setup in the Advertising Snowflake account for Causal Inference and Predictive Analytics

Step 3: Streamlit UI for executing the stored procedure for causal analysis and predictive model training

Displaying model training results captured in a Snowflake table

Purchasing probabilities of new customers: ***p[purchase] = p[purchase | saw ad] — p[purchase | !saw ad]***

As shown in the results above, ordered by PURCHASING_PROB values in the descending order, users with the highest purchasing_prob are the best audience to target for advertising.

Distributions showing which audience to target with advertising

Conclusion

We demonstrated the business value of advertising measurement using causal inference in a Snowflake solution. Causal inference is a machine learning technique to identify the effects of confounding variables that might influence both the advertisement and sales. ML models are built using attributes that are available to the Advertiser, such as household_income and ad impression all natively in Snowflake using Snowpark Python capabilities. The Advertiser also uses models to score consumers based on their likelihood to be influenced by the campaign and target these consumers. With this Snowflake solution, Advertisers can get better insights into the true effectiveness of advertising campaigns and optimize their advertising efforts and spend by making data-driven decisions.

Special thanks to Jim Warner, Kesav Rayaprolu, Tom Manfredi and Ilyes Mehaddi for their valuable contributions.

Resources:

Build A Data Clean Room in Snowflake - Advanced

Data Clean Rooms (DCRs) are secure environments that enable multiple organizations (or divisions of an organization) to…

quickstarts.snowflake.com

MTA vs Sales Lift: Understanding the Differences in Measurement Models

Multi-touch Attribution (MTA) and Sales Lift are both ad measurement methodologies intended to pinpoint the amount of sales driven by an…

medium.com