Analyzing time between bookings and events with R + Google Analytics
As a freelance Digital Analyst, you end up working with loads of different business models even if the overall business objectives is always the same. But spending too much focus looking at aggregated objectives might mislead you. What’s valuable and challenging is to understand the customer behavior that led to a prioritized action and final conversion.
Most of us work with clients that have some kind of booking engine and one repeating question is the time range between a booking and the actual date of the event. This event could be anything from flight or train travel, tickets to the cinema, restaurant reservations, live performance or a hotel booking. Let’s give this metric an acronym and call it BTE (Booking to Event).
So what’s the story? To increase marketing efficiency, we need to understand the customer behavior. Who are they, where are they coming from and what are their motivations? By gaining insights about channels and individuals that show either spontaneity or a more rational and planned behavior we can prepare numerous efforts. Regardless category of an event, a popular prevailing hypothesis is that few days between online booking and the event (low BTE) is often considered as a spontaneous action.
To understand what effect each digital marketing channel has on the booking behavior, we need to start visualizing the current representation. Of course, the digital consumer journey today is complicated and demands more than just analyzing the last interaction before conversion, but we got to crawl, walk and run as always.
Let’s run an example that contains real data from a booking engine. I used Google Analytics as the data source and the statistical programming language R for importing, wrangling and visualizing the data. If you’re stuck with reporting the same data over and over again, you might be interested in increasing the level of efficiency and learn the basics of R. I’m convinced that efficient reproduction can’t be ignored to stay productive and competitive much longer.
Installing R & RStudio
Start by downloading R and the IDE, RStudio. Then go to the best available online guide to get started with Google Analytics for R including some awesome standalone examples from Mark Edmondson and Tim Wilson. The correlation between these guys contributions to the digital analytics community and my yearly turnover is strong. Hats off to you! 📈 🍾
Step 1 — Are you asking the right questions?
The first thing we need to do before putting our headphones on is to make sure that the actual questions will lead to some measurable business impact. I prefer sitting down with the marketing and sales people to collect all input once and receive answers about what the potential insights will ultimately lead to. Two common questions from the sales department are often:
“-How can we increase the number of bookings in a short-term perspective?” and, “-Which user segment should be activated to increase sales?”.
The marketing team might be more specific and ask:
“-How can we increase marketing efficiency and where should we place our budget?” and “-How fast can we expect results if we boost the marketing channel x?”.
The question “-Which user segment should we activate?”, means that we need to make sure that users can be categorized. The question “-How fast can we expect results ..?” requires that conversion is measured correctly, (which is not always the case) but also insight about the number of visits before purchase. We also have to know how to perform simple predictive analytics and based on historical data provide qualified answers. Always make sure that both competence and quality assured data is in place before proceeding.
These mentioned questions need to be reformulated into the language of analytics, plus approved before crunching time. Let’s exemplify this:
- What’s the weekly duration of an online reservation and the actual event?
- What’s the behaving difference between each marketing channel?
- Are there any differences between various device and user categories?
- Whats the ideal marketing spend on channel X to achieve our goal revenue?
Step 2 — Efficiency through planning and preparation
The value of trial and error can surprise you, but I do recommend to define all the metrics and dimension you need before assembling your data and conduct a proof of concept. Especially when working with R. For this specific example, I used installed custom dimensions in Google Analytics to receive information about unique transaction id, date of the reservation, date of the event and customer type (B2C/B2B).
I usually start with Google Analytics Query Explorer to make sure that the data can be exported. From here you can easily export the data to .TSV format and convert it to Excel and import it as a dataset in RStudio if that’s your thing.
Keep in mind that success of your process and initiatives are related to the perception of how well the analytics deliverable meets the expectations of stakeholders. Therefore you need to have an agreed plan on how, when and to who the data should be reported since there could be different stakeholders with various competencies. Other than right questions it’s necessary to agree on preferred communication method, activities, deliverable format and time frame. And as always, tell a story based on learned insights, don’t report on exact numbers.
N.B. In this specific case, we know that a significant part of bookings was made during the first visit, that’s why I didn’t separate the data between new and returning visits. To be sure about this, just add the filter ga:userType equals New Visitor when pulling your GA data.
Step 3— Pull data and create your data frame
N.B., I’m a novice user of R with the slogan, “make it work before refactoring”, so you might be able to do this with much less code, I hope you do. 🤓
First of all, we need to install and load needed libraries. You can either do this in your source code or click Packages > Install from the CRAN repository.
# Load Google Analytics Authorization + View ID
view_id <- xxxxxxxxx
If you have followed the initial set up guide on Dartistics the ga_auth() function should load
.httr-oauth.I choose to store my Google API credentials in a .Renviron file as an environment variable that loads on RStudio startup, but there’s other ways of doing this. If your planning to re-run your scripts for different GA Views, I would recommend this method.
# Create your data frame based on given conditions
gadata <- google_analytics_4(view_id,
date_range = c(Sys.Date() -365, Sys.Date() - 1),
metrics = "users",
dimensions = c("date","transactionId","ga:dimension1","ChannelGrouping"),
filters = c("ga:dimension2==b2c"),
anti_sample = TRUE)
anti_sample = TRUE will collect your GA data sequentially to avoid sampling. If zero error messages were firing, you should be able to view your data frame by writing
View(gadata). In this case, only 37 000 rows are downloaded, but for larger datasets, I do recommend a better way by choosing a number of rows to return by writing
If you feel like renaming your columns to something more explanatory, which you should, then go ahead and install the library dplyr and follow this guide.
One common issue when pulling data from especially GA is that the dimensions and metrics might appear in a format that needs to be converted in order to plot or perform calculations. E.g. character to numeric values or arrange the ISO standard for dates, (YY-MM-DD to DD-MM-YY) etc. Check your variable types by writing
So, let’s continue with some preparatory work.
# Convert dimension booking date (Char) to Date
gadata$BookingDate <- as.Date(gadata$dimension1)
# Add a new column for days between booking and the event
gadata$Days <- gadata$BookingDate - gadata$date
# Aggregate days to weeks
gadata$Days <-gadata$Days / 7
# Round digits towards zero
gadata$Days <- round(gadata$Days,digits=0)
# Convert the column from datatype difftime to numeric
gadata$Days <- as.numeric(gadata$Days)
# Rename df column Days to Weeks
rename(gadata, Days = Weeks)
# Instead of dplyr use following code
names(gadata)[names(gadata) == 'Days'] <- 'Weeks'
So, we’ll start to prepare for how to calculate days between two separate dates. Then create a new column and split days into weeks by divide days with 7. Later round weeks to zero digits. Don’t forget to convert the variable Days from difftime to a numeric data type. Also, rename the variable Days to Weeks.
Let’s view our updated data frame by typing
head(dataframe, n=10) in the source code. If you prefer another tool for visualizing your data you can write and export the data as CSV.
write.csv(gadata, file = MyData.csv"). The file will be placed in your working directory by default.
Q1. What’s the time duration of an online booking and the actual event?
Let’s go ahead and plot our data frame with ggplot2 choosing histogram to understand the distribution of our data. A histogram allows us to quickly see where a relatively large amount of data is situated and where there are small data to be found. We choose to plot 30 weeks 1 week at the time. Changing the “by-value” to 10 will aggregate weeks from 0–10, 10–20 and 20–30, but let’s stick to 1. (I’,m not aiming for any design awards, so let’s keep it simple but understandable.)
# Plot number of weeks between booking and the event with ggplot2
ggplot(data=gadata, aes(gadata$Weeks)) +
geom_histogram(breaks=seq(0, 30, by = 1),
alpha = 0.9) +
labs(title="Weeks between online booking and the event", size=26)+
labs(x="# of Weeks", y="# of Bookings") +
theme(panel.border = element_blank(), panel.background = element_blank())+
Tie your fist, close your eyes and…
Weekly distrubution between online boking and the event
wolla! This shows the distribution of weeks for BTE. Then count the share of bookings and events that occurred within 1 week. The answer we will get is 29 %.
- 3 out of 10 events were booked within a 7 day period
- The booking service is clearly showing a short-term planning behavior
- Fast strategic marketing efforts seem to pay off in a short matter of time
- Share insights about the spontaneous behavior
- Continue and isolate the behavior that drives 7 days BTE
Q2. What’s the difference between each marketing channel?
If we’re planning to do online marketing activities, we need to know where we should place our budget to gain an effect in the shortest period of time. Simply add
facet_grid(ChannelGrouping ~., scales="free") + in the ggplot. Before hitting run we need to remove the source “(Other)” since the channel doesn’t tell us anything else than there’s room for improvement in GA, and also often contains few records.
# Remove Channel Group containing source = (Other)
gadataChannels = gadata[!gadata$C == "(Other)", ]
gadataChannels = data.frame(gadataChannels)
In order to visualize the plot with a density line, comment following line:
#geom_histogram(aes(y=..density..),breaks=seq(0, 30, by = 1),
and replaced it with a new Y-variable and add geom_density:
geom_histogram(aes(y=..density..),breaks=seq(0, 30, by = 1),
geom_density(alpha=.3, col="#4294F7", fill="#4294F7")+
BTE per marketing channel
The chart shows the number of bookings (y-scale) and total weeks (x-scale, 0–30) to an event by marketing channel.
- Clearly, most visitors counted as Direct, Organic and Paid Search
- These channels show a shorter period for BTE and naturally represent a more significant proportion of returning users
- Referral and Social shows an extended customer journey than Organic Search, Paid Search and E-mail marketing
- The data strengthen the purpose of having channels to not only increase awareness and reach but also work with channels suited for communication and nurturing existing customers
- The exact number of bookings and events is, of course, interesting knowledge, but let’s dig deeper and increase the granularity by calculating the percentage of BTE that occurred during the first seven days by each marketing channel
# Create a new df (ChannelDF) based on Channel and Booking Days
BookingChannel = gadata$ChannelGrouping
BookingDays = gadata$BookingDate - gadata$date
ChannelDF = data.frame(BookingChannel, BookingDays)
# Convert difftime to num and double to char
ChannelDF$BookingDays <- as.numeric(ChannelDF$BookingDays)
ChannelDF$BookingChannel <- as.character(ChannelDF$BookingChannel)
# Create new df's from each marketing channel
Organic <- ChannelDF[BookingChannel == "Organic Search", ]
Mail <- ChannelDF[BookingChannel == "Email", ]
Direct <- ChannelDF[BookingChannel == "Direct", ]
Referral <- ChannelDF[BookingChannel == "Referral", ]
PaidSearch <- ChannelDF[BookingChannel == "Paid Search", ]
Social <- ChannelDF[BookingChannel == "Social", ]
We created a data frame named ChannelDF containing all of the marketing channels including days between booking and the event from an earlier variable. To calculate the share of total BTE that occurred within the first week, we need to run through each channel. Let’s exemplify this with Organic Search.
# Sum all rows in Organic and create the variable SumOrganic
SumOrganic <- nrow(Organic)
# Convert Days to Weeks and round to 1 digit
Organic <- round(Organic$BookingDays/7,digits=1)
# Extract only 0-1 weeks from the variable Organic
OneWeekOrganic <- length(subset(Organic, Organic < 2))
# Calculating share of total
OrganicShare <- OneWeekOrganic/SumOrganic*100
The variable OrganicShare will give us the value 37.7805, (38%). So, almost 4 out of 10 bookings was followed by an event within a seven day period for Organic Search. Now, add the same code for the rest of the channels to compare the results.
# Create two new vectors with values for each channel
PercentBTE <- c(OrganicShare, MailShare, DirectShare, PaidSearchShare, ReferralShare, SocialShare)
ChannelGroup <-c("Organic", "Mail", "Direct", "Paid Search", "Referral", "Social Share")
# Round percentage variables and concatenate variables to a new df
PercentBTE <- round(PercentBTE,digits=2)
#Construct a df based on the two vectors above
DFCG = data.frame(PercentBTE, ChannelGroup)
Finally, create a plot based on the data frame DFCG above.
# Plot data showing BTE for each channel
ggplot(DFCG, aes(x=ChannelGroup, y=PercentBTE)) +
geom_text(aes(label=PercentBTE), vjust=0) +
geom_bar(stat="identity", fill = "#80c3e5") +
scale_y_continuous(limits = c(0,100)) +
theme(panel.border = element_blank(), panel.background = element_blank())+
labs(title="Share of BTE within a 7 days period", size=30) +
labs(x="Channel Groups", y="% BTE within 7 days")
BTE within 7 days per channel group
The figure visualizes total share of bookings that were made within a seven-day period up to the actual event for each marketing channel.
- Paid Search and E-mail marketing are most suitable to boost revenue in the shortest period of time
- Referral and Social seems to have a wide time range between a booking and the event
- The effect of SEO optimization is often time taking and traffic from referral and direct is more an indicator and effect of something else and therefore down prioritized
- Take into consideration that Social and Referral are apparently more common in the early beginning of the customer journey
- To boost sales within a short time frame, we need to go for Paid Search
As we know, everything is about audience targeting and the psychology of the ad. To boost sales we need to create a sense of urgency in our ads. But before placing all our bets on PPC, we need to understand which audience that show the highest conversion rates. Let’s collect insight about how the chart above is reflected within different user categories.
Q3. Are there any differences between various device and user categories?
To make sure we maintain a cost-efficient ad spend, we can analyze each device category. Let’s reuse the same dataframe as in the beginning and add a filter for each device category (desktop, mobile, tablet).
filters = c("ga:deviceCategory==desktop;ga:dimension5==b2c")
Let’s pull the GA data once again, now with our new filter installed. Run through the complete R code from earlier. Construct 3x data frames based on values for each device category.
DFCG_Desktop = data.frame(PercentBTE, ChannelGroup)
names(DFCG_Desktop)[names(DFCG_Desktop) == 'PercentBTE'] <- 'Desktop'
DFCG_Mobile = data.frame(PercentBTE, ChannelGroup)
names(DFCG_Mobile)[names(DFCG_Mobile) == 'PercentBTE'] <- 'Mobile'
DFCG_Tablet = data.frame(PercentBTE, ChannelGroup)
names(DFCG_Tablet)[names(DFCG_Tablet) == 'PercentBTE'] <- 'Tablet'
# Merge all of the df above to one (DFCG_Device)
DFCG_Device = data.frame(DFCG_Desktop$ChannelGroup, DFCG_Desktop$Desktop, DFCG_Mobile$Mobile, DFCG_Tablet$Tablet)
# Rename df columns
names(DFCG_Device)[names(DFCG_Device) == 'DFCG_Desktop.ChannelGroup'] <- 'ChannelGroup'
names(DFCG_Device)[names(DFCG_Device) == 'DFCG_Desktop.Desktop'] <- 'Desktop'
names(DFCG_Device)[names(DFCG_Device) == 'DFCG_Mobile.Mobile'] <- 'Mobile'
names(DFCG_Device)[names(DFCG_Device) == 'DFCG_Tablet.Tablet'] <- 'Tablet'
# View df
Each device-df are then merged in to one single df and renamed. Before going further and plot our data we need to convert it to a .long format and melt it by using the library reshape2. There are several ways to configure your chart with ggplot of course, this is just an example of some settings to play around with.
# Plot data showing BTE for each channel per device category
geom_bar(stat="identity", position = position_dodge(width = 0.6), width=0.5)+
scale_fill_manual("legend", values = c("Desktop" = "#13A1FC", "Mobile" = "#052940", "Tablet" = "#0B507F"))+
geom_text(aes(label = paste(value,"%")),
position = position_dodge(width = 0.5), vjust = -0.4, size=3.5) +
theme(panel.border = element_blank(), panel.background = element_blank())+
theme(axis.text = element_text(size = 11))+
theme(axis.title = element_text(size = 13))+
labs(title="Share of BTE per Marketing Channel", size=30) +
labs(x="Channel Groups", y="Share% BTE within 7 days")+
BTE within seven days per markerting channel
The chart givs us the share of BTE that occurred within seven days for each marketing channel split by device. So, are there any knockout differences among various devices? Aiming for Paid Search we like to know the differences between especially mobile and desktop. It seems like regardless device, Paid Search is a winner when aiming for spontaneous customers.
- Desktop users show surprisingly high BTE for direct traffic
- Each device show few days between BTE for Paid Search
- Users from social might be social human beings, but far from spontaneous
- Mobile traffic users seems to be more “on the go” and spontaneous
- Obviously users don’t read that much e-mails on their tablets
- Increase PPC budget for mobile and maybe desktop, but not tablet
- Adjust the ad strategy for social media to be considered as a long-term activity
- Collect insights about when during the day users convert from Paid Search
Still here? You’re a hero! 🏄🏻 So, let’s collect insight about converting users from paid search.
We start by creating a new df and pull asked data from GA.
# Create your data frame based on given conditions
df_paidsearch <- google_analytics_4(view_id,
date_range = c("2017-01-01", "2017-12-31"),
metrics = c("transactions"),
dimensions = c("dayOfWeek", "hour", "userGender", "userAgeBracket"),
filters = c("ga:channelGrouping==Paid Search;ga:dimension5==B2C"),
anti_sample = TRUE)
Let’s skip the code for removing BTE longer than 7 days and continiue. A heatmap is a killer visualizaton for presenting mulitple values by categories. You can choose to plot the heatmap with our without text-values. If you got data with loads of outliers (observation point that is distant from other observations) you might need to divide your data in to several breaks instead.
# Plot the df values wit ggplot heatmapdf_paidsearch %>%
ggplot(aes(x=dayOfWeek, userGender, y=hour, fill=transactions)) +
geom_tile( colour = "black")+
geom_text(aes(label = transactions),size=3, lineheight=4, colour="black") +
That’s the core functionality, lets carry on with the cosmetics.
# Re-order and rename X-scale
"1" = "Monday",
"2" = "Thuesday",
"3" = "Wednesday",
"4" = "Thursday",
"5" = "Friday",
"6" = "Saturday",
"0" = "Sunday"))+
"00" = "00:00",
"01" = "01:00",
"02" = "02:00",
"03" = "03:00",
"04" = "04:00",
"05" = "05:00",
"06" = "06:00",
"07" = "07:00",
"08" = "08:00",
"09" = "09:00",
"10" = "10:00",
"11" = "11:00",
"12" = "12:00",
"13" = "13:00",
"14" = "14:00",
"15" = "15:00",
"16" = "16:00",
"17" = "17:00",
"18" = "18:00",
"19" = "19:00",
"20" = "20:00",
"21" = "21:00",
"22" = "22:00",
"23" = "23:00"))
# Style color of the heatmap
scale_fill_gradient(low = '#FFFFA6', high = 'red', name="Bookings",limits= c(0,7600)) +
# Declare description texts for title, Y and X scale
labs(title = "BTE within 7 days from Paid Search split by age and gender", x="Weekday", y="Hour of day") +
# Change the appearance and the orientation angle
axis.text.x = element_text(face="plain", color="black", size=9, angle=90),
axis.text.y = element_text(face="plain", color="black", size=9, angle=0),
axis.line = element_line(colour = "black", size = 0, linetype = "solid"),
strip.text.x = element_text(size=11, angle=0, face="bold"),
strip.text.y = element_text(size=11, face="bold"),
strip.background = element_rect(colour="white", fill="#FFFFFF"),
panel.background = element_blank())
BTE within 7 days from Paid Search per Age Category and Gender
The heatmap shows the number of bookings (transactions) by the hour of the day + weekday (Monday-Sunday). Also, the data is split by age and gender category. The highest value is shown in red, the lowest in light yellow. These categories can be replaced to show device (see below), interests, returning and new visits, marketing channels, conversion rates, custom dimensions etc. Just replace the metrics and dimensions in the df.
BTE within 7 days from Paid Search per Age Category and Device
- Most amounts of bookings are made between 10 am – 13 pm (and 09–16)
- Monday to Friday seems to be popular days for reservations
- People between 25–44 makes most of the bookings
- The female population makes more reservations compared to men
- During weekends the device behavior switch from desktop to mobile
- As we get older, the sleep patterns changes
- The age category 18–24 don’t seem to use tablet that often
- Target your audience group to people between 25-54
- Also, place more media spend on women than men
- Check if women and men’s mobile behavior are different through the day
- Print the plot and put it up somewhere in the organization
- Adjust the target group differently during weekends to match the behavior
Q4. How to determine the advertising cost to gain x revenue?
Now we know about the booking behavior for each marketing channel, also which channel to boost including audience group. It’s time to predict the budget for Paid Search in order to reach our sales objectives. Let’s start!
First of all, we need to know if there’s any correlation between advertising cost for Paid Search and booking revenue. Let’s kick off and create our last data frame and pull the data from GA. The challenge with AdWords dimensions in GA is that not all of them are possible to combine with others. E.g., the dimension ga:AdCost is not combinable with ChannelGrouping or any other segment. Therefore we need to construct a filter for Paid Search visits to isolate Google AdWords traffic.
df_ppc_budget <- google_analytics_4(view_id,
date_range = c("2017-08-01", "2018-02-22"),
metrics = c("adCost","transactionRevenue"),
dimensions = c("date"),
filters = c("ga:medium==cpc;ga:dimension5==b2c"),
anti_sample = TRUE)
# Check variables format
# Remove outliers on row 116 - (in this case, black friday)
df_ppc_budget <- df_ppc_budget[-c(116), ]
#Convert date from char to date, as.Date
df_ppc_budget$date <- as.Date(df_ppc_budget$date)
# Check correlation between Cost and Revenue
round(cor(df_paidsearch_budget$adCost, df_paidsearch_budget$transactionRevenue) ,2)
First, we pull our GA data, and then we make sure to remove any outliers. I usually don’t recommend deleting data in a data frame, but since Black Friday results in crazy data spikes, I remove the observation on row 116. Then we conduct a correlation study between the variables adCost and transactionRevenue.
The correlation coefficient measures the strength and direction of a linear relationship between two variables in a scatterplot. It ranges from -1.0 to +1.0. The closer r is to +1 or -1, the more closely the two variables are related (positive or negative correlation). Our correlation (r = 0.63 measures) is not perfect, but still strong. 95 times out of 100 the intervals would reach between 0.53 and 0.72.
Create a more visual picture of the correlation . First we melt the columns adCost and transactionRevenue to one single column. Then we plot our chart containing two variables in the Y-scale, and date in our X-scale.
# Melt adCost with transactionRevenue to one single column
m.df <- melt(df_ppc_budget, id="date")
#Plot the result with a smooth line
ggplot(m.df, aes(x = date, y = value, colour = variable)) +
scale_x_date(date_breaks = "1 month",
limits = as.Date(c('2017-08-01','2018-02-22')))+
labs(title = "Correlation study between Cost & Revenue)", x="Date", y="Cost & Revenue") +
Correlation between adCost and transactionRevenue by month
During August and November 2017 the variables seems to correlate. Some outliers during Christmas time and early 2018 is making the revenue go up, and then go back to normal again. It’s possible to understand the relationship between cost and revenue with this model but to calculate predictions more accuratly; we need to build a linear regression model first. Hold on..
# Build linear regression model. Response goes on the left side of the ~, and the predictor(s) on the right
linearMod <- lm(adCost~transactionRevenue,data=df_paidsearch_budget)
#Print the sum
# Call for predicted values for AdCost based on transaction revenue
predict(linearMod, data.frame(transactionRevenue = c(20000, 15000, 10000)))
To predict the cost to reach specific revenue objectives, we fill the transactionRevenue variable with an optional set of numbers. Don’t forget to think seasonal effects and also the predicting period. You could aggregate dates to weeks and then call for predicted values.
In the bottom of the console, we can read our three predicted revenue objectives. So if our targeted revenue is 20 000, then the estimated Ad Cost will be 2845 and so on.
Since our correlation is strong but not very high, we could change so that the prediction numbers contains data in our data frame and then calculate the error margin. I guess there are other better ways of training a data set, so I leave it for now.
Fredrik Cederlöf, Digital Analyst at Insight Works