Hierarchical Bayesian Modeling in R for Marketing program evaluation

kuanysh italmassov
16 min readApr 3, 2019

--

If you work in Marketing Analytics, programs evaluation could be the place where well-intended assumptions of classical statistics meet harsh reality of fast business environment. For example when it comes to measuring effectiveness of your in-store activities, chances are that you will be asked to make evaluation when program is finished, long after you could identify treatment levels, split stores into random samples controlling for things like size, location and so on. As a result, traditional models come up with estimates that are unreasonably high or low.

In this post I have approached similar problem using public dataset of sliced cheese sales. Using Hierarchical Bayesian Model (HBM) it was possible to derive meaningful program estimation and moderate predictive accuracy. Parameters estimated by model allowed to segment all stores and devise marketing strategy.

If you are new to Bayesian modeling, in very short summary it estimates distribution of your data, instead of just single number as in classical approach.

One of advantages of Bayesian approach is that it is easy to model correlated data by assuming that they come from one distribution. It is convenient when modeling sales data coming from one store. Similarly it is possible to model effectiveness of program at each store, but at the same time assuming that there is common distribution of “effectiveness” from which individual stores performance comes from.

As a result of this data points “borrow strength” from each other allowing for more powerful inference. This structure of parameters: stores — total is the reason why this approach is called hierarchical.

Tools

For analysis I use R and JAGS, one of the leading software for estimating Bayesian models. Both are free and available across Mac, Linux and Windows. You will need to install several packages. Full source code is in appendix

Problem

We selected a problem of optimal marketing mix, in other words selecting best combination of available marketing tools for best business outcome. Solving this problem not only creates business value, but also reduces waste in form of inefficient resource usage, expired products write-off, consumers dissatisfaction.

In this particular case we are stepping into shoes of Marketing Executive for Borden Sliced Cheese in US. For the purpose of this assignment we are selling only one product (shown below) and our goal is to maximize profit with given budget.

We are selling our product through multitude of independent grocery shops across states. We have only 2 marketing tools available to us at the moment:

1. Price. We need to decide at what retail price we are selling our product to customer. For this exercise we are ignoring different sales tax rates across states and assuming that we can set different prices across networks and states.

2. Display advertising. How much of provided advertising area we are using up in stores. We assume that every store has standard advertising area of which we can use different percentage.

In this simplistic case we are ignoring competition to make analysis simpler.

company has data on sales performance in past where we collected selling price, percentage of advertising used and amount of units sold across all 88 retailers. We decide to analyse this data using Bayesian approach in order to make optimal decision for future.

Data

We are using cheese dataset from R package bayesm, credit to Boatwright, Peter, Robert McCulloch, and Peter Rossi (1999). Please see bayesm package for full reference. Data can be loaded into R using data(“cheese”) command.

Original dataset includes 5555 observations on the following 4 variables:

$RETAILER a list of 88 retailers

$VOLUME unit sales

$DISPpercent ACV on display (a measure of advertising display activity)

$PRICE in U.S. dollars

Below are few first rows of data:

RETAILER VOLUME       DISP    PRICE
1 LOS ANGELES - LUCKY 21374 0.16200000 2.578460
2 LOS ANGELES - RALPHS 6427 0.12411257 3.727867
3 LOS ANGELES - VONS 17302 0.10200000 2.711421
4 CHICAGO - DOMINICK 13561 0.02759109 2.651206
5 CHICAGO - JEWEL 42774 0.09061273 1.986674
6 CHICAGO - OMNI 4498 0.00000000 2.386616

Dataset is panel data, meaning that for each retailer there are multiple observations, that allows to make more accurate estimations.

In addition to these rows we added additional $LOCATION and $KEY_ACCOUNT variables by splitting $RETAILER column. This will allow us to make our analyis more actionable in terms making recomendations across locations and chains.

Selected dataset is clean and specially tailored for purpose of estimating in-store marketing effectiveness. It doesn’t require cleaning and preparation usually required for real-life data of this nature. We chose to use data because it is hard to find public data that has has the same level of depth, detail and consistency we need to demonstrate power of Bayesian analysis. We assume that if a reader chose to replicate this analysis on their company’s data, it is available to company employees or at least can be collected and put together from different sources.

One of the limitation of this dataset is absence of date dimenstion, so for purpose of analysis we are ignoring effects of autocorelations in time series, including seasonality.

Data exploration

This dataset contains data across 88 retailers, wide range of price from $1.32 to $4.64 and values of advertising used from 0% to 100%.

Sales volume

Sales volume data is heavily right-skewed. Rotated boxplot of random 5 stores below indicates that Retailers are significantly different in terms of Sales volume.

There are 50 key accounts identified, with majority of them having small overall contribution. However there are also large Key accounts that have big impact on business. Top 10 Key accounts represent 58.6% of total sales, with Top 3 (Krogger CO, Winn Dixie and Lucky) representing 30%. This indicates that during our analysis we have to take into account scale difference between stores.

Top 10 Key accounts

There are 46 locations, which appear to have less explanatory than Key accounts. Please see Anova output for both models in Appendix. Therefore it appears that Key account has more predictive power than Location, however we will revisit this assumption in modeling part.

Price

Selling price distribution is more balanced than sales volume, however significant right skewness is present.

Advertising

It appears that advertising volume is heavily right skewed, as in absolute majority of observations, no advertising was done.

Below are pairwise scatterplots of numeric variables in data model. Volumne is transformed by log function to compensate for right skewedness.

It appears that there is expected negative correlation between price and volume, however there is no clear relationship between size of in-store advertising and sales. This means that decision making about advertising should take into account multiple factors to make sure that investments are efficient.

Model

We will try several models in order to demonstrate thinking process and different methods of modeling. As a referene model we are going to use simple linear regression model, then try Bayesian approach on different set of variables. We make decision based DIC value and Mean Absolute Percentage Error (MAPE) for estimating accuracy on test (out-of-sample) dataset.

Our goal, in addition to understanding impact of factors, is also making most accurate prediction. Therefore our methodlogy includes splitting data into training and testing in order to evaluate model performance of out-of-sample data. Potentially this analysis could be improved by using k-fold cross-validation.

Reference model

Linear regression output is shown below. It is clear that model has low predictive power, but on top of it coeficients produced don’t make much sense, since it states that $1 of price increase leads to 0 sales chage, and 1 percent increase of advertising leads to Infinite increase of sales. Obviously, there is a problem different scale of sales across different stores. One alternative solution is to run 88 regressions for each store, however there is not enough data for each store and coeficients produced are also unresanobly high. For more on this please see P. E. Rossi, G. M. Allenby and R. McCulloch, 2005.

Call:
lm(formula = VOLUME ~ PRICE + DISP, data = cheese2)
Residuals:
Min 1Q Median 3Q Max
-7898 -2662 -970 1059 139587
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10798.4 439.6 24.57 <2e-16 ***
PRICE -2308.9 146.8 -15.73 <2e-16 ***
DISP 5221.6 481.0 10.86 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5753 on 5552 degrees of freedom
Multiple R-squared: 0.07116, Adjusted R-squared: 0.07083
F-statistic: 212.7 on 2 and 5552 DF, p-value: < 2.2e-16
Diagnostic pots below demonstrate that there is certain pattern in residuals and their distribution is far from normal.

Prediction error on test data set is expectedly high with Mean Aboslute Percentage Error = 101%.

Bayesian approach

As an alternative model we will consider several Hierarchical Bayesian Regression Models where we fit data on different levels of data grouping: Key Account and Retailer and decide on model based on DIC value and also test data accuracy.

Groupped by Key Account

First model we are going to check is the one that will fit data with effects at each Key Account level. It is reasonable to hypothesize that stores in one chain effectiveness of price increase and in-store advertising must be similar. At the same time we might assume that general effectiveness numbers for each key account are coming from share distribution. Description for hierarchical is below:

yi |k,pricei,ad_perci,β,σ2 ~ N(β0k+β1k*pricei+β2k*ad_perci, σ2) for i = 1,2,…..,5555

β0k| μ0,τ02 ~iid N(μ0,τ02) for k = 1,2,…..50

β1k| μ1,τ12 ~iid N(μ1,τ12) for k = 1,2,…..50

β2k| μ2,τ22 ~iid N(μ2,τ22) for k = 1,2,…..50

Where β0k, β1k, β2k are intercept, coeficients for price elasticity and coeficient for in-store display effectiveness for each corespondingly. Key Accounts are identified by k index. Accordingly we assume that each coeficient is coming from Normal distribution with means μ0, μ1 and μ2.

We use non-informative Normal priors for means and Inverse Gamma distributions for variance. See full code in appendix.

Because of high number of parameters it was required for us to run 30 thousand iterations in 3 chains to get good convergence. Multivariate psrf according to gelman.diag is 1.28 (due to multiple parameters here we report only results applicable to whole model). Effective sample size is very diverse for each parameter as a function of number of data points available for each Key Account, it ranges from 80 to 30,000. However, we believe it should be sufficient to make conclusion about model performance. Mean Absolute Percent Error is significantly lower 29.5% (vs. 101%). However DIC value is quite high at 6648, which is driven mostly by Mean deviance. See output bellow:

Mean deviance:  6521 
penalty 126.6
Penalized deviance: 6648

Grouped by Retailers

Second model we are testing is more granular. It is reasonable assume that retailer-level model will bring more accuracy at training data, but expected to have low generalization ability. We expect to see higher DIC value due to Penalty and high MAPE on test data. Description of hierarchical part is as below:

yi |r,pricei,ad_perci,β,σ2 ~ N(β0r+β1r*pricei+β2r*ad_perci, σ2) for i = 1,2,…..,5555

β0r| μ0,τ02 ~iid N(μ0,τ02) for r = 1,2,…..88

β1r| μ1,τ12 ~iid N(μ1,τ12) for r = 1,2,…..88

β2r| μ2,τ22 ~iid N(μ2,τ22) for r = 1,2,…..88

This model description is almost equivalent to previous, with exception that coefficients are indexed by r (corresponding retailer).

Because of even higher number of parameters we run model for 50 thousand iterations in 3 chains. Multivariate psrf according to gelman.diag is 1.31.

Effective sample size is very diverse for each parameter as a function of number of data points available for each Key Account, it ranges from 230 to 50,000. Mean Absolute Percent Error on test data is significantly lower 22.5% (vs. 101% for reference and 29.5% for first hierarchical model). It appears that this model generalizes better than previous.

Surprisingly as well is DIC value is substantially lower at 507. Penalty is much larger, but it significantly offset by much lower mean deviance. See output bellow:

Mean deviance:  280 
penalty 227.4
Penalized deviance: 507.4

Therefore, contrary to our expectation second, more granular model, not only fits data better, but also has better generalization ability. Further below we will build our analysis on this model.

Analysis of HB Model on Retailers

Illustration of second model accuracy on data, not used for model estimation is shown below.

It appears that based on derived coefficients rather strongly predict sales of our product. However, in order to build effective marketing plan we have to look closer at model. Hierarchical Bayesian Model allow us to look at coefficients by store and decide on our marketing strategy.

Intercept coefficient

Intercept coefficient represents log of volume that is sold when price is $0 and there is no advertising. By itself it is of little value to us but we can interpret it as “baseline” volume of each store and use is as stores size proxy. Distribution of intercept is expectedly positive, left skewed due to one outlier at zero. Mean of distribution estimated by HB is 10.32.

Price sensitivity coefficient

Price sensitivity coefficients are expectedly negative. There is slight left skewedness, but in less extent. Mean of distribution estimated by HB is -0.8 (exp(-0.8) = 0.44). We interpret is so that when price of cheese goes up by $1 above mean price of $2.86 sales go down by 56%. As it seen, some stores are more price sensitive than others and in some volume will decline less in response to price increase. This information is important for building marketing strategy.

In-store Ad effectiveness coefficient

Ad effectiveness distribution is right skewed, with mean at 1.17 (exp(1.17)=3.2). Our interpretation is that when ad is increased to 100% of available space, its sales increases by 3.2 times. For designing marketing strategy it is important to note that in some stores expected uplift from advertising is not as high.

Segmentation of stores

In order to define our marketing strategy we split stores into clusters based on their price sensitivity and effectiveness of ad. For segmentation we used hierarchical clustering and by trial and error identified that split by 5 groups produces meaningful results. For each of 5 groups Marketing manager will be able to develop different approach to increase effectiveness of investments.

All groups are shown below on Bubble chart, where each store is mapped according to value of price and ad response coefficients. Size of bubble represents “baseline” volume.

Our interpretation of 5 groups above and proposed marketing strategy is below:

Group 1: Value stores. Consumers in those stores have higher price sensitivity and low response to ads. It appears that price acts as main choice criteria. Sales volume there is also relatively high. These stores are volume drivers. Marketing investments should be invested into price promo, limited ad should focus on price comparisons.

Group 2: Standard groceries. Consumers in those stores have low price sensitivity and relatively high ad response. This is battleground where we can increase our share through active advertising.

Group 3: Loyal stores. Consumers in those stores have lower price sensitivity but also less responsive to ads. This could be used as profit driver, where consumers are ready to pay premium for their brands. Marketing investment should be invested into loyalty programs that reward consumers for staying with brand.

Group 4: Dynamic stores. Consumers here are open for ads, but also highly price sensitive. This is relatively small group where cost of maintaining share is relatively high due to necessity of active ad and promo investment. It could be of secondary priority compared to other groups.

Group 5: Flagship store. In this one store consumers are ready to pay high premium and also responsive to ads. This store could be used for high-end marketing activities, increasing brand premiumness overall.

In real life analysis will be augmented by additional information about channel of stores, projected revenue and expected return on investment. However, it goes outside of scope of this exercise.

Further Improvements

This model could explore multiple avenues for improvement. For example, one could investigate interactions of price and ads. Another opportunity is to augment data with information about location (by state).

Source code

library(“bayesm”)
library(“dplyr”)
library(“tidyr”)
library(“ggplot2”)
library(“stringr”)
library(“Metrics”)
library(“rjags”)
library(“reshape2”)
library(plotly)
library(“gapminder”)

data(“cheese”)
head(cheese)

#adding additional columns
cheese2 = tidyr::separate(cheese,c(“RETAILER”),sep=” — “,
into=c(“LOCATION”,”KEY_ACCOUNT”),
remove=F)
cheese2$KEY_ACCOUNT = factor(cheese2$KEY_ACCOUNT)

# Data exploration
cheese %>%
group_by(RETAILER) %>%
summarise(n=n())

range(cheese$PRICE)
range(cheese$DISP)

hist(cheese$VOLUME, 50, main=”Sales volume distribution”, xlab=””)

# comparison of random sample of stores
retailers = unique(cheese$RETAILER)
ret_sample = sample(retailers, 5)
sel_df = cheese[cheese$RETAILER %in% ret_sample, ]
sel_df$RETAILER = factor(sel_df$RETAILER)
sel_df$RETAILER2 = str_wrap(sel_df$RETAILER, width = 10)

ggplot(sel_df, aes(RETAILER2,VOLUME)) +
geom_boxplot() + coord_flip() + xlab(“Retailer”) + ylab(“Volume”)

#number of Key Accounts
unique(cheese2$KEY_ACCOUNT)
unique(cheese2$LOCATION)

KA = cheese2 %>%
group_by(KEY_ACCOUNT,RETAILER) %>%
summarise(VOLUME=sum(VOLUME)) %>%
group_by(KEY_ACCOUNT) %>%
summarise(“Number of stores”=n(), Volume=sum(VOLUME)) %>%
arrange(-Volume) %>%
mutate(“Volume share” =round(Volume / sum(Volume) *100,2),
“Cumulative share” = cumsum(`Volume share`))

KA_top = KA[KA$n > 1,]
head(KA, 10)

#number of Locations
Location = cheese2 %>%
group_by(LOCATION,RETAILER) %>%
summarise(VOLUME=sum(VOLUME)) %>%
group_by(LOCATION) %>%
summarise(“Number of stores”=n(), Volume=sum(VOLUME)) %>%
arrange(-Volume) %>%
mutate(“Volume share” =round(Volume / sum(Volume) *100,2),
“Cumulative share” = cumsum(`Volume share`))

anova_KA = aov(VOLUME~KEY_ACCOUNT, data=cheese2)
summary(anova_KA)

anova_LOC = aov(VOLUME~LOCATION, data=cheese2)
summary(anova_LOC)

#price exploration
hist(cheese2$PRICE, 50, main=”Selling price distribution”, xlab=””)

#advertising exploration
hist(cheese2$DISP, 50, main=”Display Advertising distribution”, xlab=””)

#pairwise distribution
cheese2$logVolume = log(cheese2$VOLUME)

pairs(cheese2[c(“logVolume”,”PRICE”,”DISP”)])

## split by train and test ############
set.seed(123)
train_id = NULL
cheese2$row_id = 1:dim(cheese2)[1]

train_id = vector()
for(KA in unique(cheese2$KEY_ACCOUNT)){
# KA = unique(cheese2$KEY_ACCOUNT)[1]
cur_id = cheese2[cheese2$KEY_ACCOUNT == KA, ‘row_id’]
cur_sample = cur_id[1:(length(cur_id)-1)]

if(length(train_id)==0){
train_id = cur_sample
}else{
train_id = c(train_id,cur_sample)
}
}
train <- cheese2[train_id, ]
test <- cheese2[-train_id, ]

dim(train)
dim(test)

#Modeling ######################
# Simple regression model
smp_size <- floor(0.8 * nrow(cheese2))

# reference model
mod_lm = lm(log(VOLUME)~PRICE+DISP, data=cheese2)
summary(mod_lm)

par(mfrow=c(1,2))
plot(mod_lm)
par(mfrow=c(1,1))

exp(-0.39283)
exp(0.54045)

#test model prediction
test$pred = exp(predict(mod_lm,test))
par(mfrow=c(1,1))
plot(test$VOLUME, test$pred, xlim=c(0,10000), ylim=c(0,10000))
mape(test$VOLUME, test$pred)

## Bayesian hierarchical model ##################
## Model 1: at Key Account level
mod_string1 = “model{
#likelihood
for(i in 1:length(y)){
y[i] ~ dnorm(nu[i], prec)
nu[i] = b0[KA[i]]+b1[KA[i]]*PRICE[i]+b2[KA[i]]*DISP[i]
}

#prior
for(j in 1:max(KA)){
b0[j] ~ dnorm(mu0,prec0)
b1[j] ~ dnorm(mu1,prec1)
b2[j] ~ dnorm(mu2,prec2)
}

mu0 ~ dnorm(0.0,1.0/1.0e6)
mu1 ~ dnorm(0.0,1.0/1.0e6)
mu2 ~ dnorm(0.0,1.0/1.0e6)

prec0 ~ dgamma(1/2.0, 1*10.0/2.0)
prec1 ~ dgamma(1/2.0, 1*10.0/2.0)
prec2 ~ dgamma(1/2.0, 1*10.0/2.0)
tau0 = sqrt(1.0/prec0)
tau1 = sqrt(1.0/prec1)
tau2 = sqrt(1.0/prec2)

prec ~ dgamma(5/2.0, 5*10.0/2.0)
sig = sqrt(1.0/prec)
}”

set.seed(116)
data_jags1 = list(y=log(train$VOLUME), KA=as.numeric(train$KEY_ACCOUNT),
PRICE=train$PRICE,
DISP=train$DISP)

params1 = c(“b0”,”b1",”b2",”mu0",”mu1",”mu2",”sig”,”tau0",”tau1",”tau2")
mod1 = jags.model(textConnection(mod_string1),data=data_jags1,n.chains=3)
update(mod1,1e3)

mod_sim1 = coda.samples(model=mod1,
variable.names = params1,
n.iter=30e3)

#convergence diagnostics
# plot(mod_sim)

gelman.diag(mod_sim1)
effectiveSize(mod_sim1)

dic1 = dic.samples(mod1, n.iter=1e3)
summary(mod_sim1)

mod_csim1 = as.mcmc(do.call(rbind, mod_sim1))

pred_coef1 = apply(mod_csim1, 2, mean)
pred_coef1[c(“mu0”,”mu1",”mu2")]

exp(pred_coef1[“mu1”])
exp(pred_coef1[“mu2”])

# predict train data
train$pred1 = 0
for(i in 1:length(train$VOLUME)){
# i = 2
KA = as.numeric(train$KEY_ACCOUNT)[i]
train$pred1[i] = exp(pred_coef1[paste0(“b0[“,KA,”]”)] +
pred_coef1[paste0(“b1[“,KA,”]”)]*train$PRICE[i] +
pred_coef1[paste0(“b2[“,KA,”]”)]*train$DISP[i] )
}
par(mfrow=c(1,1))
hist(train$pred1, breaks = 40)
plot(train$VOLUME, train$pred1)

mape(train$VOLUME, train$pred1)

# predict test data
test$pred1 = 0
for(i in 1:length(test$VOLUME)){
# i = 2
KA = as.numeric(test$KEY_ACCOUNT)[i]
test$pred1[i] = exp(pred_coef1[paste0(“b0[“,KA,”]”)] +
pred_coef1[paste0(“b1[“,KA,”]”)]*test$PRICE[i] +
pred_coef1[paste0(“b2[“,KA,”]”)]*test$DISP[i] )
}

plot(test$VOLUME, test$pred1)
mape(test$VOLUME, test$pred1)

## Model 2: at store level
mod_string2 = “model{
#likelihood
for(i in 1:length(y)){
y[i] ~ dnorm(nu[i], prec)
nu[i] = b0[r[i]]+b1[r[i]]*PRICE[i]+b2[r[i]]*DISP[i]
}

#prior
for(j in 1:max(r)){
b0[j] ~ dnorm(mu0,prec0)
b1[j] ~ dnorm(mu1,prec1)
b2[j] ~ dnorm(mu2,prec2)
}

mu0 ~ dnorm(0.0,1.0/1.0e6)
mu1 ~ dnorm(0.0,1.0/1.0e6)
mu2 ~ dnorm(0.0,1.0/1.0e6)

prec0 ~ dgamma(1/2.0, 1*10.0/2.0)
prec1 ~ dgamma(1/2.0, 1*10.0/2.0)
prec2 ~ dgamma(1/2.0, 1*10.0/2.0)
tau0 = sqrt(1.0/prec0)
tau1 = sqrt(1.0/prec1)
tau2 = sqrt(1.0/prec2)

prec ~ dgamma(5/2.0, 5*10.0/2.0)
sig = sqrt(1.0/prec)
}”

set.seed(116)
data_jags2 = list(y=log(train$VOLUME), r=as.numeric(train$RETAILER),
PRICE=train$PRICE,
DISP=train$DISP)

params2 = c(“b0”,”b1",”b2",”mu0",”mu1",”mu2",”sig”,”tau0",”tau1",”tau2")
mod2 = jags.model(textConnection(mod_string2),data=data_jags2,n.chains=3)
update(mod2,1e3)
mod_sim2 = coda.samples(model=mod2,
variable.names = params2,
n.iter=50e3)

#convergence diagnostics
# plot(mod_sim)

gelman.diag(mod_sim2)
effectiveSize(mod_sim2)

dic2 = dic.samples(mod2, n.iter=1e3)

summary(mod_sim2)
dic1
dic2

mod_csim2 = as.mcmc(do.call(rbind, mod_sim2))

pred_coef2 = apply(mod_csim2, 2, mean)
pred_coef2[c(“mu0”,”mu1",”mu2")]

exp(pred_coef2[“mu1”])
exp(pred_coef2[“mu2”])

# predict train data
train$pred2 = 0
for(i in 1:length(train$VOLUME)){
# i = 2
r = as.numeric(train$RETAILER)[i]
train$pred2[i] = exp(pred_coef2[paste0(“b0[“,r,”]”)] +
pred_coef2[paste0(“b1[“,r,”]”)]*train$PRICE[i] +
pred_coef2[paste0(“b2[“,r,”]”)]*train$DISP[i] )
}
par(mfrow=c(1,1))
hist(train$pred2, breaks = 40)
plot(train$VOLUME, train$pred2)

mae(train$VOLUME, train$pred2)
mae(train$VOLUME, train$pred1)

dic2

# predict test data
test$pred2 = 0
for(i in 1:length(test$VOLUME)){
# i = 2
r = as.numeric(test$RETAILER)[i]
test$pred2[i] = exp(pred_coef2[paste0(“b0[“,r,”]”)] +
pred_coef2[paste0(“b1[“,r,”]”)]*test$PRICE[i] +
pred_coef2[paste0(“b2[“,r,”]”)]*test$DISP[i] )
}

plot(test$VOLUME, test$pred2, xlab=”Actual”, ylab=”Predicted”, main=”Model output vs. actual volume \non out-of-sample data”)

mape(test$VOLUME, test$pred)
mape(test$VOLUME, test$pred1)
mape(test$VOLUME, test$pred2)

#### Analysis of coeficients
coef_df = data.frame(pred_coef2)
coef_df$coef_str = row.names(coef_df)

coef_df$coef = gsub(‘(.+)\\[.*$’, ‘\\1\\2’, coef_df$coef_str)
coef_df$ret_code = as.numeric( gsub(‘.+\\[([0–9]+)(?:-([0–9]+))?\\].*$’, ‘\\1\\2’, coef_df$coef_str) )

retailers = levels(cheese2$RETAILER)
coef_df$retailer = sapply(coef_df$ret_code, FUN=function(x){retailers[coef_df$ret_code[x]] })
head(coef_df)

coef_df2 = dcast(data = coef_df[c(“retailer”,”coef”,”pred_coef2")],
formula = retailer~coef,fun.aggregate=sum)

head(coef_df2,2)

## histograms for coeficients
hist(coef_df2$b0,30,main=”Intercept distribution (b0)”,xlab=””)
hist(coef_df2$b1,30,main=”Price sensitivity distribution (b1)”,xlab=””)
hist(coef_df2$b2,30,main=”Ad effectiveness distribution (b2)”,xlab=””)

coef_df2$mu1
coef_df2$mu2

exp(10.31895)
exp(-0.799909)
exp(1.160073)

mean(cheese2$PRICE)
mean(cheese2$DISP)

#scatterplot
par(mfrow=c(1,1))
plot(coef_df2$b1, coef_df2$b2, xlab=”Price coefficient”, ylab=”Ad coefficient”,
main=”Stores clusterisation”)
abline(v=coef_df2$mu1[89], lty=2)
abline(h=coef_df2$mu2[89], lty=2)

#segmenting stores based on coeficients
##Group 1: high sensitivity & low ad effectiveness
grp1 = coef_df2[(coef_df2$b1 < coef_df2$mu1[89]) & (coef_df2$b2 < coef_df2$mu2[89]), ]
dim(grp1)

##Group 2: high sensitivity & moderate ad effectiveness
grp2 = coef_df2[(coef_df2$b1 < coef_df2$mu1[89]) & (coef_df2$b2 > coef_df2$mu2[89]), ]
dim(grp2)

##Group 3: high sensitivity & moderate ad effectiveness
grp2 = coef_df2[(coef_df2$b1 < coef_df2$mu1[89]) & (coef_df2$b2 > coef_df2$mu2[89]), ]
dim(grp2)

#bubble chart
p <- plot_ly(coef_df2, x = ~b1, y = ~b2, text = ~retailer, type = ‘scatter’, mode = ‘markers’,
marker = list(size = ~b0, opacity = 0.5)) %>%
layout(title = ‘Price sensitivity and ad efficiency by retailers’,
xaxis = list(showgrid = FALSE),
yaxis = list(showgrid = FALSE))

# breaking
coef_dist = dist(coef_df2[1:88, c(‘b1’,’b2')])
hc = hclust( coef_dist, method=”complete”)
plot(hc)

coef_df2$group =factor(c(cutree(hc, k = 5),NA))

p <- gapminder %>%
ggplot(data=coef_df2[1:88,], mapping=aes(b1, b2, size = b0,text=retailer,color=group)) +
geom_point(alpha=0.5) +
theme_bw()+
geom_vline(xintercept=coef_df2$mu1[89]) +
geom_hline(yintercept=coef_df2$mu2[89]) +
ggtitle(“Clusterization of stores”) +
xlab(“price sensitivity”) +
ylab(“ad response”)

ggplotly(p)

--

--