Marketing Channel Attribution with Markov Chains in Python — Part 1: The “Simpler” Approach

Image for post
Image for post

Any business that’s actively running marketing campaigns should be interested in identifying what marketing channels drive the actual conversions. It is no secret that the return on investment (ROI) on your marketing efforts is a crucial KPI.

In this article we’re going to cover:

  1. Why is channel attribution important?

The Markov Chains approach in this article will take a “simple” approach by leveraging the R package ChannelAttribution. For the full python implementation of this solution see part 2 in this series.

Why is attribution important?

As the array of platforms on which businesses can market to their customers is increasing, and most customers are engaging with your content on multiple channels, it’s now more important than ever to decide how you’re going to attribute conversions to channels. A 2017 study showed that 92% of consumers visiting a retailer’s website for the first time aren’t there to buy (link).

To illustrate the importance of attribution, let’s consider a simple example of a user journey leading to conversion. In this example, our user is named John.

Image for post
Image for post

DAY 1:

John‘s awareness of your product is sparked by a YouTube ad and subsequently visits your website to browse your product catalog.

After a bit of browsing, John’s awareness of your product is sparked, yet he does not have the intention of completing a purchase.

DAY 2:

The next day, when John is scrolling through his Facebook feed he receives another ad for your product, which pushes him to return to your website and this time John completes the purchasing process

In this case, when you look to calculate your ROI by marketing channel, how would you attribute the $ generated by John towards a marketing channel?

Traditionally, channel attribution has been tackled by a handful of simple but powerful approaches such as First Touch, Last Touch, and Linear.

Standard Attribution Models

Image for post
Image for post
3 standard attribution models

Last Touch Attribution
As the name suggests, Last Touch is the attribution approach where any revenue generated is attributed to the marketing channel that a user last engaged with.

While this approach has its advantage in its simplicity, you run the risk of oversimplifying your attribution, as the last touch isn’t necessarily the marketing activity that generates the purchase.

In the above example of John, the last touch channel (Facebook) likely didn’t create 100% of the intent to purchase. The awareness stems from the initial spark of watching the YouTube ad.

First Touch Attribution
The revenue generated by the purchase is attributed to the first marketing channel the user engaged with, on the journey towards the purchase.

Just as with the Last Touch approach, First Touch Attribution has its advantages in simplicity, but again you risk oversimplifying your attribution approach.

Linear Attribution
In this approach, the attribution is divided evenly among all the marketing channels touched by the user on the journey leading to a purchase.

This approach is better suited to capture the trend of the multi-channel touch behaviour we’re seeing in consumer behaviour. However, it does not distinguish between the different channels, and since not all consumer engagements with marketing efforts are equal this is a clear drawback of this model.

Other standard attribution approaches worth mentioning are Time Decay Attribution and Position Based Attribution.

An advanced attribution model: Markov Chains

With the 3 standard attribution approaches above, we have easy-to-implement models to identify the ROI of our marketing channels.

However, the caveat with those 3 approaches is that they are oversimplified. This may lead to overconfidence of the results driven by the marketing channels. This oversight can be detrimental — misguiding future business / marketing decisions.

To overcome this oversight, we may consider employing a more advanced approach: Markov chains.

If you have taken a statistics course, you may have come across this theory. Markov chains are named after the Russian mathematician Andrey Markov, and describe a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.

Markov chains, in the context of channel attribution, gives us a framework to model user journeys and how each channel factors into the users traveling from one channel to another to eventually purchase (or not).

We won’t go too deep into Markov chains theory in this article. (Setosa.io has a good read if you’re interested in knowing more about the math/statistics that take place behind the scenes.)

Image for post
Image for post
Example of a simple Markov chain with 2 events A and E

The core concepts of Markov chains is that we can use the generated data to identify the probabilities of moving from one event to another in our network of potential marketing channel events and conversion events.

In the next section, we’ll go through the Python code for implementing any of these attribution frameworks.

How to build the 4 attribution models in Python

If you wish to follow along, the dataset we’ll be using for this example can be downloaded here.

The Markov Chain model in this article is built using the ChannelAttribution package in R. For the full Python implementation see part 2.

Our dataset is structured by having engagement activities as columns and the rows being the channels that were engaged with, in chronological order. In this case, each marketing channel is assigned a fixed numbered value which is then displayed in a column n if the n’th engagement from a given user was with that marketing channel. Channel 21 is a conversion and our dataset only contains records of converting user journeys.

Image for post
Image for post
Sample of our dataset

The first thing we want to do is to import the necessary libraries

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import subprocess

Next, let’s load in our dataset and clean up the data points

# Load in our data
df = pd.read_csv('Channel_attribution.csv')

# Grab list of columns to iterate through
cols = df.columns

# Iterate through columns to change all ints to str and remove any trailing '.0'
for col in cols:
df[col] = df[col].astype(str)
df[col] = df[col].map(lambda x: str(x)[:-2] if '.' in x else str(x))

The Markov chain framework wants the user journeys in a single variable and on the form Channel 1 > Channel 2 > Channel 3 > …, so the next loop creates exactly that

# Create a total path variable
df['Path'] = ''
for i in df.index:
#df.at[i, 'Path'] = 'Start'
for x in cols:
df.at[i, 'Path'] = df.at[i, 'Path'] + df.at[i, x] + ' > '

Since channel 21 in our dataset is a conversion event, we will separate that channel from the path and create a separate conversion variable holding the number of conversions happening (still only 1 in our user journey level data)

# Split path on conversion (channel 21)
df['Path'] = df['Path'].map(lambda x: x.split(' > 21')[0])

# Create conversion value we can sum to get total conversions for each path
df['Conversion'] = 1

We’re now almost done with the initial data manipulation work. Our data still contains all the original columns, so we grab the subset of columns that we need going forward. Since some users may have taken the same journey we will group our data by unique user journeys and our conversion variable will hold the number of conversions for each respective journey.

# Select relevant columns
df = df[['Path', 'Conversion']]

# Sum conversions by Path
df = df.groupby('Path').sum().reset_index()

# Write DF to CSV to be executed in R
df.to_csv('Paths.csv', index=False)

The last line in the above piece of code will output our data to a CSV file now that we’re done with the data manipulations. It might be handy to have this data available for transparency purposes, and, in our case, we will also use this CSV file to run the Markov chain attribution approach.

There are a few ways to do this. Since Python doesn’t at this time have a library put together for this, one way would be to build out the actual Markov chains/networks in Python yourself. While this would allow you to have a complete overview of your model it would also be the most time-consuming approach. To be more efficient, we’ll make use of the ChannelAttribution R library which has the theory behind Markov chains centered in a single application.

We will use the standard Python library subprocess to run the following piece of R code that calculates our Markov network for us.

# Read in the necessary libraries
if(!require(ChannelAttribution)){
install.packages("ChannelAttribution")
library(ChannelAttribution)
}
# Set Working Directory
setwd <- setwd('C:/Users/Morten/PycharmProjects/Markov Chain Attribution Modeling')
# Read in our CSV file outputted by the python script
df <- read.csv('Paths.csv')
# Select only the necessary columns
df <- df[c(1,2)]
# Run the Markov Model function
M <- markov_model(df, 'Path', var_value = 'Conversion', var_conv = 'Conversion', sep = '>', order=1, out_more = TRUE)
# Output the model output as a csv file, to be read back into Python
write.csv(M$result, file = "Markov - Output - Conversion values.csv", row.names=FALSE)
# Output the transition matrix as well, for visualization purposes
write.csv(M$transition_matrix, file = "Markov - Output - Transition matrix.csv", row.names=FALSE)

The next piece of Python code will execute our R script and load in the resulting CSV file.

# Define the path to the R script that will run the Markov Model
path2script = 'C:/Users/Morten/PycharmProjects/Markov Chain Attribution Modeling/Markov.r'

# Call the R script
subprocess.call(['Rscript', '--vanilla', path2script], shell=True)

# Load in the CSV file with the model output from R
markov = pd.read_csv('Markov - Output.csv')

# Select only the necessary columns and rename them
markov = markov[['channel_name', 'total_conversion']]
markov.columns = ['Channel', 'Conversion']

If you want to get around having to create a separate R script to run the Markov calculations, then a Python library that you could use is rpy2. rpy2 allows you to import R libaries and call them directly in Python. This approach, however, did not prove very stable during my process, and therefore I opted for the separate R script approach.

Channel Attribution using Markov Chains can be seen in the below chart. This chart should tell you that channel 20 is driving a large portion of conversions while channels 18 and 19 are attributed very low total conversion values.

Image for post
Image for post
Channel contributions for Markov chain approach

While this output may be what you’re looking for there’s a great deal of value in the information around what the outputs of the traditional approaches look like compared to our Markov chains approach.

To calculate attributions for Last Touch, First Touch and Linear, we run the following piece of code

# First Touch Attribution
df['First Touch'] = df['Path'].map(lambda x: x.split(' > ')[0])
df_ft = pd.DataFrame()
df_ft['Channel'] = df['First Touch']
df_ft['Attribution'] = 'First Touch'
df_ft['Conversion'] = 1
df_ft = df_ft.groupby(['Channel', 'Attribution']).sum().reset_index()

# Last Touch Attribution
df['Last Touch'] = df['Path'].map(lambda x: x.split(' > ')[-1])
df_lt = pd.DataFrame()
df_lt['Channel'] = df['Last Touch']
df_lt['Attribution'] = 'Last Touch'
df_lt['Conversion'] = 1
df_lt = df_lt.groupby(['Channel', 'Attribution']).sum().reset_index()

# Linear Attribution
channel = []
conversion = []
for i in df.index:
for j in df.at[i, 'Path'].split(' > '):
channel.append(j)
conversion.append(1/len(df.at[i, 'Path'].split(' > ')))
lin_att_df = pd.DataFrame()
lin_att_df['Channel'] = channel
lin_att_df['Attribution'] = 'Linear'
lin_att_df['Conversion'] = conversion
lin_att_df = lin_att_df.groupby(['Channel', 'Attribution']).sum().reset_index()

Let’s merge all our 4 approaches together and evaluate the differences in outputs.

# Concatenate the four data frames to a single data frame
df_total_attr = pd.concat([df_ft, df_lt, lin_att_df, markov])
df_total_attr['Channel'] = df_total_attr['Channel'].astype(int)
df_total_attr.sort_values(by='Channel', ascending=True, inplace=True)


# Visualize the attributions
sns.set_style("whitegrid")
plt.rc('legend', fontsize=15)
fig, ax = plt.subplots(figsize=(16, 10))
sns.barplot(x='Channel', y='Conversion', hue='Attribution', data=df_total_attr)
plt.show()
Image for post
Image for post
Channel contributions across all attribution approaches

From looking at the above chart we can quickly conclude that most user journeys start with Channel 10 and end with Channel 20, while no user journeys start at Channel 20.

To get an idea of how the different channels affect the potential user journeys we can look at the total transition matrix, which can be visualized in a heatmap

Image for post
Image for post
Transition Probability Heatmap for Markov chain approach

By running the following piece of code:

# Read in transition matrix CSV
trans_prob = pd.read_csv('Markov - Output - Transition matrix.csv')

# Convert data to floats
trans_prob ['transition_probability'] = trans_prob ['transition_probability'].astype(float)

# Convert start and conversion event to numeric values so we can sort and iterate through
trans_prob .replace('(start)', '0', inplace=True)
trans_prob .replace('(conversion)', '21', inplace=True)

# Get unique origin channels
channel_from_unique = trans_prob ['channel_from'].unique().tolist()
channel_from_unique.sort(key=float)

# Get unique destination channels
channel_to_unique = trans_prob ['channel_to'].unique().tolist()
channel_to_unique.sort(key=float)

# Create new matrix with origin and destination channels as columns and index
trans_matrix = pd.DataFrame(columns=channel_to_unique, index=channel_from_unique)

# Assign the probabilities to the corresponding cells in our transition matrix
for f in channel_from_unique:
for t in channel_to_unique:
x = trans_prob [(trans_prob ['channel_from'] == f) & (trans_prob ['channel_to'] == t)]
prob = x['transition_probability'].values
if prob.size > 0:
trans_matrix[t][f] = prob[0]
else:
trans_matrix[t][f] = 0

# Convert all probabilities to floats
trans_matrix = trans_matrix.apply(pd.to_numeric)

# Rename our start and conversion events
trans_matrix.rename(index={'0': 'Start'}, inplace=True)
trans_matrix.rename(columns={'21': 'Conversion'}, inplace=True)

# Visualize this transition matrix in a heatmap
sns.set_style("whitegrid")
fig, ax = plt.subplots(figsize=(22, 12))
sns.heatmap(trans_matrix, cmap="RdBu_r")
plt.show()

Conclusion

Different marketing channel attribution approaches will fit different businesses. In this article, we’ve outlined 4 possible ways to evaluate the effectiveness of your marketing spend. We’ve explored 3 approaches that are fixed in the sense that they are not dependent on the structure of your data, which may lead to overconfidence. On the other hand, a Markov chain approach will look to model channel attribution by accounting for how your user journey data is structured; though this approach is more complex.

Analyzing the output of the Markov chain model will give you a “snapshot” of marketing channel effectiveness, at a specific point in time. You might be able to gain extra insights by looking at the model output for data just before and after a new marketing campaign launch, giving you essential information on how the campaign affected the performance of each channel.

By adding even more granularity and running daily attribution models, you could evaluate the relationship between PPC or marketing dollar spent and channel contribution using correlation models.

While adding more complexity to the approach presented in this article could increase the value of the model outputs, the real business value will come from being able to interpret these quantitative model results and combine these with domain knowledge on your business and the strategic business initiatives that have produced your data.

Combining these model results with the knowledge of your business will allow you to best incorporate the model findings into future initiatives.

Marketing channel attribution can be a complex task and with consumers being reached by more marketing than ever. As technology advances and more channels become available to marketers, it becomes more important to identify precisely the channels that are driving the most ROI.

How do you dig out the valuable attribution information from your data?

About the Author

Morten works as a Data Scientist at Wealthsimple where he uses data science to help people achieve financial freedom.

The Data Science team at Wealthsimple is always looking for new innovative, smart and ambitious people to join the team. Check out our career page or reach out on LinkedIn.

Written by

I work as a Data Scientist at Wealthsimple where I use data science to help people achieve financial freedom.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store