Visitor Acquisition, Retention, and Attrition, Using BigQuery and a Simple Data Model

Written by Hazem Mahsou and originally posted on e-nor.com/blog

It’s a well-known fact in business that retaining your existing clients is easier and less expensive than acquiring new ones. Nevertheless, marketers put a lot of focus on acquiring new clients. In the Google Analytics web interface, there are many reports to analyze acquisition of new site users and how they reached the site. After all, one of GA’s earliest purposes was measuring the performance of online ads.

But what about retention and attrition? Frankly, measuring retention and attrition is hard compared to measuring acquisition, but we’ll take on the challenge below.

Attrition or Churn?
Some prefer to use the term “churn rate” while others prefer to use “attrition rate”. Churn may imply replacing leaving clients by newly acquired ones. In this article, we will use the term “attrition rate”.

In a previous post, we relied on BigQuery’s capabilities to join data from different tables in ways that are not possible in the Google Analytics interface. The process that we follow below goes into deeper analysis.

Exploratory Data Analysis

In this post, we will discuss using BigQuery’s capabilities to measure acquisition, retention, and attrition, in ways completely different from the Google Analytics web interface.

Broadly speaking, we’ll use exploratory data analysis: a series of queries that will progressively allow us to refine and deepen an initial observation and reach an actionable conclusion. More specifically, we’ll start with a data point in one query — Twitter is generating the highest retention rate, as we define it — and proceed with additional queries until we determine that the social media team in our example should maintain the same level of tweeting through the year.

Attrition/Retention Rates by Traffic Source

We’ll start with a simple data model. We will then write a series of queries that reveal different aspects of retention and attrition, such as the table below, until we achieve an actionable insight for our marketing team (i.e., they need to maintain a consistent level of tweeting through the year).

This BigQuery table indicates attrition (i.e. Percentage Lost) and retention rates by traffic source. In this example, Twitter show a high session count as well as high retention.

In this way, we’ll go beyond the single-session metrics to display an important and actionable multi-session performance indicator.

Data for Demo Only
Some of the data shown in this exercise was altered for educational purposes. Do not assume that any of the metrics in this blog post reflect the performance of your own traffic sources. Make sure to run the queries on your own Google Analytics data in BigQuery so you can analyze and take data-driven action accordingly.

Different Definitions of Engagement, Retention and Attrition
 First, the definition of engagement, retention and attrition vary greatly from one business model to another. For some business models (e.g. B2B), it’s quite satisfactory if the users make a weekly visit to the website, while with other businesses, you’d want users to visit your site or app daily. Thus, one factor that you must decide is the period of time after which a visitor is determined to be “lost”: two weeks, one month, two months? Your business model can help you decide when to consider a user to have become inactive long-term.

Dependency on Authentication or Cookies
 Also, unless your site requires the users to sign in every time they use it, multiple-device users pose a great challenge to measuring retention and attrition. Without authentication, we must accept the fact that at least some percentage of our users will appear as lost when, in reality, they just migrated to another device or browser, cleared their cookies (including the _ga cookie that identifies returning visitors in Google Analytics), or had just used private browsing in the previous session.

On the other hand, unsubscribing from the site is the only official way to leave the site. With the vast majority of sites, users leave the site silently which makes it difficult to measure attrition.

Model and Assumptions: Attrition Means Two Months of Inactivity

If we cannot get the actual attrition rates, let’s try to create a model in order to generate useful trends. We’re approaching our task with a few assumptions:

  • The Google Analytics property is already linked to BigQuery and the data is already flowing from Analytics to BigQuery. (For more details on GA-to-BQ export setup, see Getting Your Feet Wet in the Data Lake).
  • We have a website that requires no registration and no subscriptions.
  • In our retention/attrition analysis, we’re also using the acquisition data (i.e., source/medium) exported to BigQuery from Google Analytics.
  • You’re using Google Analytics campaign parameters whenever applicable to capture correct Medium and Source values for clickthroughs (so, for instance, the social links that you post always generate social as the medium instead of direct in the case of an app clickthrough, and your emails always generate email as the medium instead of direct or referral in the case of a clickthrough from a standalone email client such as Outlook or an online email service such as gmail).
  • As discussed below, we’ll consider a user to have abandoned the site if they didn’t make a visit in two months.

In this way, we’ll go beyond the single-session metrics to display an important and actionable multi-session performance indicator.

Our data model will focus on visitors acquired every month and count who was retained and who was lost over the following two months.

For our retention/attrition analysis, it would be meaningful to restrict counting visitors — whether retained or lost — with respect to acquired visitors over a certain period of time.

As the diagram shows, we will focus on visitors who were acquired in a certain month, e.g. January. Only this segment of visitors -acquired in January- will be investigated if they were retained or lost during the following 2 months, i.e. February and March. That is, green is marking the time window of acquisition and yellow is marking the time window for counting retention or attrition.

Consequently, this leaves some visitors out of our data model. For example:

  • Visitors who were acquired before January won’t be included.
  • Visitors who were acquired in January and came for the second time after March will be considered lost still.
  • Visitors who were acquired in January and came for the second time in Feb/Mar but never came back later will be considered retained.

This should be totally acceptable, since we are only trending and not measuring actual long-term retention and attrition rates. Also, that’s why defining the length of both time windows — acquisition and retention — can vary greatly from one site to another and can change the results significantly. You can decide the retention/attrition window that make the most sense for your business and adjust the queries accordingly.

That is, remember that you’re not bound by these numbers. As you run the queries on your own data, you can certainly decrease or increase to align with your business model and typical/desired user behavior.

Let’s now begin building our queries.

Step 1: Extracting Relevant Fields To An Intermediate Table

The following query will generate an intermediate table, extracting only a few relevant fields and simplifying following queries. Specifically, the query will extract the visitor id, visit id, visit number, and visit time in different formats. The calcMonth is the number of months starting with January 2015, the start of our report.

Before you run this query, replace XXX_ViewID_XXX with the view id of your Google Analytics view that is exported to BigQuery.

Listing 1: Intermediate table to extract visitor id, visit id, visit number, and visit time and calculates calcMonth, all to be used in queries below.

SELECT
fullVisitorid,
sourceMedium,
visitId,
visitNumber,
visitStartTime,
visitYear,
visitMonth,
(visitYear-2015)*12+visitMonth AS calcMonth,
FROM (
SELECT
fullVisitorId,
trafficSource.source + '/' + trafficSource.medium AS sourceMedium,
visitid,
visitNumber,
visitStartTime,
YEAR(MSEC_TO_TIMESTAMP(visitStartTime*1000)) AS visitYear,
MONTH(MSEC_TO_TIMESTAMP(visitStartTime*1000)) AS visitMonth,
FROM
TABLE_DATE_RANGE([XXX_ViewID_XXX.ga_sessions_],
TIMESTAMP('2015-01-01'), TIMESTAMP('2016-12-31')))

To save the output as an intermediate table, named All_Visitors, set the options as follows:

Saving the output of the query to the All_Visitors intermediate table.
Intermediate table containing visitor id, traffic source, visit id, visit number, visit time, and calcMonth

Step 2: Aggregations and Calculations

Now that we have the raw data of all visits, let’s run the following query against the intermediate table to calculate the retained and lost visitors:

Listing 2: Calculating new, retained and lost visitors per month.

SELECT
firstMonth,
COUNT(fullVisitorId) AS newVisitors,
SUM(lostVisitor) AS lostVisitors,
COUNT(fullVisitorId)-SUM(lostVisitor) as retainedVisitors
FROM (
SELECT
fullVisitorId,
firstMonth,
if ((secondMonth IS NULL
AND thirdMonth IS NULL), 1, 0) AS lostVisitor,
FROM (
SELECT
fullVisitorId,
firstMonth,
MAX(secondMonth) AS secondMonth,
MAX(thirdMonth) AS thirdMonth
FROM (
SELECT
f.fullVisitorId AS fullVisitorId,
f.firstMonth AS firstMonth,
s.secondMonth AS secondMonth,
t.thirdMonth AS thirdMonth
FROM (
SELECT
fullVisitorId,
calcMonth AS firstMonth
FROM
[XXX_VIEW_ID_YYY.All_Visitors]
WHERE
visitNumber = 1) AS f
LEFT JOIN (
SELECT
fullVisitorId,
calcMonth-1 AS secondMonth
FROM
[XXX_VIEW_ID_YYY.All_Visitors]
WHERE
visitNumber <> 1) AS s
ON
f.fullVisitorId = s.fullVisitorId
AND f.firstMonth = s.secondMonth
LEFT JOIN (
SELECT
fullVisitorId,
calcMonth-2 AS thirdMonth
FROM
[XXX_VIEW_ID_YYY.All_Visitors]
WHERE
visitNumber <> 1) AS t
ON
f.fullVisitorId = t.fullVisitorId
AND f.firstMonth = t.thirdMonth)
GROUP BY
fullVisitorId,
firstMonth))
GROUP BY
firstmonth
ORDER BY
firstMonth

To dissect this query, we have to start by the innermost select statement and move outwards.

  1. The innermost query lists all visitors and the month when they made their first visit. Then, we join this list of visitors with any visit they made on second month or third month. If there isn’t any visit, we will get a NULL.
  2. Then, we remove the redundancy of multiple visits, using the MAX functions, to ensure we are counting every visitor only once.
  3. Finally, we count all visitors, the number of visitors who did come again and those who did not.

Google has recently released a very helpful feature, which serves to integrate BigQuery with Google Sheets. Once we save the query output to a Google Sheet, we can make use of the the rich features of Google sheet, which will allow us, for example, to easily apply formulas to the BigQuery data and calculate attrition and retention percentages.

The query yielded attrition (i.e., Percentage Lost) rates that did not vary considerably through the year and were therefore not quite actionable.

A few points worth mentioning here:

  • Since the number of acquired visitors varies from month to month, calculating percentages — of retained/lost visitors versus acquired visitors — will be more meaningful (last 2 columns).
  • Remember that the lost and retained visitors of January, for example, were not lost in January but in the following 2 months.
  • Because of the way we defined the acquisition and attrition time windows, the last two months will always have a corrupt result. Numbers for the last 2 months should be ignored. That is, using this formula, we can only create the trend up to 2 months ago.
Google Analytics Cohort Analysis Report
The data in the figure above bears some resemblance to the Cohort Analysis report that is available directly within Google Analytics. The current (beta) version of the report is somewhat limited by the maximum three-month lookback and the inability to add secondary dimensions such as Source/Medium. (You could, of course, create and apply segments for each of your top Source/Medium values, but this would be a bit ungainly from a reporting standpoint, and we could not simultaneously apply more than four segments.)
The Google Analytics Cohort Analysis report goes back three months maximum.
Despite its limitations relative to the flexibility of BigQuery, the Cohort Analysis report’s ability to define cohorts by day, week, or month, as well as the additional metrics (such as goal conversion rate and E-commerce revenue) available for the cohorts, can provide some useful perspectives on the behavior of your users over a period up to three months.

Step 3: Improvements to the User Segment

As noted, the attrition rates in the query above don’t vary that much from one month to another. One also notices that the vast majority of the acquired visitors are considered lost by this formula.

This is another area where the formula varies from one site to another and the question we should ask ourselves becomes:

Whom should we consider an acquired user and not just passing by?

The previous query covered all visitors, even bouncing visitors. We can add more conditions to make sure we’re focusing on the right segment of users. For your site, you may rewrite the query to only include users who have achieved a goal, such as signing up for newsletters, made a purchase, or, simply, made more than one visit to the site.

The following query is rewritten to only include visitors who made more than one visit to the site during the first month.

Listing 3: modified query by focusing on users with multiple visits in the first month.

SELECT
firstMonth,
COUNT(fullVisitorId) AS newVisitors,
SUM(lostVisitor) AS lostVisitors,
COUNT(fullVisitorId)-SUM(lostVisitor) as retainedVisitors
FROM (
SELECT
fullVisitorId,
firstMonth,
if ((secondMonth IS NULL
AND thirdMonth IS NULL), 1, 0) AS lostVisitor,
FROM (
SELECT
fullVisitorId,
firstMonth,
MAX(secondMonth) AS secondMonth,
MAX(thirdMonth) AS thirdMonth
FROM (
SELECT
f.fullVisitorId AS fullVisitorId,
f.firstMonth AS firstMonth,
sv.firstMonth AS secondVisitMonth,
s.secondMonth AS secondMonth,
t.thirdMonth AS thirdMonth
FROM (
SELECT
fullVisitorId,
calcMonth AS firstMonth
FROM
[XXX_VIEW_ID_YYY.All_Visitors]
WHERE
visitNumber = 1) AS f
JOIN (
SELECT
fullVisitorId,
calcMonth AS firstMonth
FROM
[XXX_VIEW_ID_YYY.All_Visitors]
WHERE
visitNumber = 2) AS sv
ON
f.fullVisitorId = sv.fullVisitorId
AND f.firstMonth = sv.firstMonth
LEFT JOIN (
SELECT
fullVisitorId,
calcMonth-1 AS secondMonth
FROM
[XXX_VIEWID_YYY.All_Visitors]
WHERE
visitNumber <> 1) AS s
ON
f.fullVisitorId = s.fullVisitorId
AND f.firstMonth = s.secondMonth
LEFT JOIN (
SELECT
fullVisitorId,
calcMonth-2 AS thirdMonth
FROM
[XXX_VIEW_ID_YYY.All_Visitors]
WHERE
visitNumber <> 1) AS t
ON
f.fullVisitorId = t.fullVisitorId
AND f.firstMonth = t.thirdMonth)
GROUP BY
fullVisitorId,
firstMonth))
GROUP BY
firstmonth
ORDER BY
firstMonth

Again, saving the output to Google Sheets comes in handy; add two columns to calculate the percentage and ignore last two months.

With bounced visitors removed, we can see that new visitors in January (rows 1 and 13) show the highest retention rates based on the following two months of activity.

We notice there’s a general decline in retention over the two years. We also notice that the retention is higher at the beginning of the year, both years. Is it the result of better marketing efforts at the beginning of each year?. Let’s look deeper and from a different perspective.

Higher Retention for January Visitors

With bounced sessions now removed from our data, the remaining non-bounce user segment is showing markedly higher retention rates for user who first visit in January, which could have arisen from two basic factors (assuming that user experience remained the same through the year):

  • Retention rate for all traffic sources fell after the beginning of the year.
  • We were driving a greater percentage of good (e.g., high-retention) traffic earlier in the year.

Let’s continue our exploratory data analysis and run a few more queries to shed some additional light.

Step 4: Another Perspective: Source/Medium

Since we are calculating retention and attrition based on the acquired visitors, it makes sense to look at it with respect to another dimension, say, the traffic source.This can tell us which channels bring visitors that we are more likely to stay. Conversely, it may point us to the channels that cause the biggest attrition.

The following query will count the acquired, retained and lost visitors by source and medium, excluding traffic sources that generated fewer than 100 visitors acquisitions during the time period selected. Note that 100 visitors is another number that varies greatly from one site to another.

Listing 4: Query to calculate acquired, retained and lost visitors by source and medium.

SELECT
sourceMedium,
COUNT(fullVisitorId) AS newVisitors,
SUM(lostVisitor) AS lostVisitors,
COUNT(fullVisitorId)-SUM(lostVisitor) AS retainedVisitors
FROM (
SELECT
fullVisitorId,
firstMonth,
sourceMedium,
if ((secondMonth IS NULL AND thirdMonth IS NULL),1, 0) AS lostVisitor,
FROM (
SELECT
fullVisitorId,
firstMonth,
sourceMedium,
MAX(secondMonth) AS secondMonth,
MAX(thirdMonth) AS thirdMonth
FROM (
SELECT
f.fullVisitorId AS fullVisitorId,
f.firstMonth AS firstMonth,
f.sourceMedium AS sourceMedium,
sv.firstMonth AS secondVisitMonth,
s.secondMonth AS secondMonth,
t.thirdMonth AS thirdMonth
FROM (
SELECT
fullVisitorId,
sourceMedium,
calcMonth AS firstMonth
FROM
[XXX_VIEW_ID_YYY.All_Visitors]
WHERE
visitNumber = 1) AS f
JOIN (
SELECT
fullVisitorId,
calcMonth AS firstMonth
FROM
[XXX_VIEW_ID_YYY.All_Visitors]
WHERE
visitNumber = 2) AS sv
ON
f.fullVisitorId = sv.fullVisitorId
AND f.firstMonth = sv.firstMonth
LEFT JOIN (
SELECT
fullVisitorId,
calcMonth-1 AS secondMonth
FROM
[XXX_VIEW_ID_YYY.All_Visitors]
WHERE
visitNumber <> 1) AS s
ON
f.fullVisitorId = s.fullVisitorId
AND f.firstMonth = s.secondMonth
LEFT JOIN (
SELECT
fullVisitorId,
calcMonth-2 AS thirdMonth
FROM
[XXX_VIEW_ID_YYY.All_Visitors]
WHERE
visitNumber <> 1) AS t
ON
f.fullVisitorId = t.fullVisitorId
AND f.firstMonth = t.thirdMonth)
GROUP BY
fullVisitorId,
firstMonth,
sourceMedium))
GROUP BY
sourceMedium
HAVING
newVisitors > 100

As done previously, let’s export the query result to a Google sheet, add two columns to count the percentage of lost and retained visitors, and sort by percentage retained. The result tells us which traffic sources are driving the greatest retention percentages.

The resulting table appears at the beginning of this post.

Now that we can see attrition/retention performance by traffic source/medium, let’s isolate the performance of that source/medium over time.

Step 5: Looking Deeper at One Source/Medium

We can look deeper by filtering by one traffic source/medium, say twitter/social, and then evaluate its month to month retention trend.

Listing 5: Querying retention month by month only for visitors acquired by our twitter/social campaigns

SELECT
firstMonth,
COUNT(fullVisitorId) AS newVisitors,
SUM(lostVisitor) AS lostVisitors,
COUNT(fullVisitorId)-SUM(lostVisitor) as retainedVisitors
FROM (
SELECT
fullVisitorId,
firstMonth,
if ((secondMonth IS NULL
AND thirdMonth IS NULL), 1, 0) AS lostVisitor,
FROM (
SELECT
fullVisitorId,
firstMonth,
MAX(secondMonth) AS secondMonth,
MAX(thirdMonth) AS thirdMonth
FROM (
SELECT
f.fullVisitorId AS fullVisitorId,
f.firstMonth AS firstMonth,
sv.firstMonth AS secondVisitMonth,
s.secondMonth AS secondMonth,
t.thirdMonth AS thirdMonth
FROM (
SELECT
fullVisitorId,
calcMonth AS firstMonth
FROM
[8839142.All_Visitors]
WHERE
sourceMedium = 'twitter/social' and
visitNumber = 1) AS f
JOIN (
SELECT
fullVisitorId,
calcMonth AS firstMonth
FROM
[8839142.All_Visitors]
WHERE
visitNumber = 2) AS sv
ON
f.fullVisitorId = sv.fullVisitorId
AND f.firstMonth = sv.firstMonth
LEFT JOIN (
SELECT
fullVisitorId,
calcMonth-1 AS secondMonth
FROM
[8839142.All_Visitors]
WHERE
visitNumber <> 1) AS s
ON
f.fullVisitorId = s.fullVisitorId
AND f.firstMonth = s.secondMonth
LEFT JOIN (
SELECT
fullVisitorId,
calcMonth-2 AS thirdMonth
FROM
[8839142.All_Visitors]
WHERE
visitNumber <> 1) AS t
ON
f.fullVisitorId = t.fullVisitorId
AND f.firstMonth = t.thirdMonth)
GROUP BY
fullVisitorId,
firstMonth))
GROUP BY
firstmonth
ORDER BY
firstMonth

By limiting our month-over-month query to twitter/social, we can see more traffic early in the year and very consistent attrition/retention rates for this source/medium.

The query results indicate consistent performance over time for twitter/social traffic. The stronger overall retention performance (for all traffic) in January was likely due in part to a greater percentage of Twitter traffic in that month (rows 1 and 13).

Note:
The data presented in this discussion was compiled for demonstration only and is not in any way intended to indicate relative retention performance of any time period or traffic source. By exporting your own GA data to BigQuery and running the queries included in this post, you can determine your actual retention rates by month and by Source/Medium.

Insight and Action: Maintain Traffic From Your Highest-Retention Traffic Sources Through The Year

Observations

Let’s review the trends from the from the previous table for two consecutive years.

  • Retention rate is high for Twitter traffic.
  • Average retention rates for all traffic are significantly higher early in the year.
  • Twitter generates a consistently high retention rate throughout the year.
  • More visits are received from Twitter earlier in the year.

Insight

The higher overall retention rate earlier in the year is due at least in part to the higher proportion of Twitter clickthroughs.

Hypothesis and Action

We speak with our social media coordinator and determine that she tends to tweet less as the year progresses, particularly during the summer months. You and she decide to maintain a consistent pace of tweeting throughout the year, or even to try increasing tweeting somewhat beyond early-year levels in an effort to attract even more high-retention visitors.

After changes in the Tweet occurrence, we can run the same queries to validate our hypothesis. We can run also similar queries for all of our top source/mediums to further determine the factors behind our strong retention performance in January and then take steps throughout the year to to gain more high-performing traffic.

Conclusion: BigQuery’s Capabilities Generates Additional Insights

Google Analytics has done a great job of optimizing analytics reports for the Web. It also has provided great support for customization. However, exporting Analytics 360 data to BigQuery has taken analysis customization to a whole new level.

Trending acquisition, retention and attrition was just one example. With access to the raw analytics data, you enjoy virtually limitless possibilities for the types of exploratory data analysis that you can do with different data models and for the reports — and insights — that you can generate from a single analytics dataset.


Originally published at www.e-nor.com on February 28, 2017.