Retention: machine learning at home

Published in

Machine Intelligence Report

4 min readJul 14, 2015

My Dad always tells me that computers should help people. So Nikita, our data scientist, and I decided to utilise machine learning to help us understand why users leave our app.

I won’t tell you why — sorry :) Well, maybe in a year or so— but I will show you how you can do it for your app.

There are multiple approaches & algorithms available (most of them John Foreman explains in his “Data Smart” book) , but we decided to try only 2 of them:

Clustering — to understand behavioral characteristics common to users who leave (i.e. number of flights they tracked, types of in-apps purchased & etc.) and then devise a retention strategy for each cluster.
Predictive modeling — creating models that predict whether user leaves the app based on patterns found in historical data.

I will cover only 1st approach in this post, next post will describe the 2nd.

Clustering

R language is the most affordable solution to perform cluster analysis. I won’t get into details of working with R and R Studio (here is a Coursera course and good tutorial), but will explain key steps and the script we use to perform it.

Data preparation

This is the most important step actually: remember “trash-in trash-out” problem, so be careful with this step. What you should start with is to brainstorm the metrics or events that characterise user’s behavior in your app. Please note that each characteristic should be presented in numerical form for cluster analysis to be executed.

Below is a sub-set of such characteristics for our app:

You will end up with an Excel file with columns for each characteristic and rows with values for each user analysed.

Data normalisation & clustering

Then you will need to perform data normalisation & clustering using R Studio. Here is the script with comments explaining some commands:

#read csv
aitadata <- read.csv(‘App_in_the_Air_users_join_2014_for_R.csv’, sep = “;”)
#take 2:19 columns only, no user ids for clustering
aitadata <- subset(aitadata,select=c(2:19))
#if needed, replace character columns with numeric ones, we need taRifx library for this
library( taRifx )
aitadata <- japply( aitadata , which(sapply(aitadata, class)==”character”), as.numeric )
#replace NA with 0 values
is.na.data.frame <- function(x) do.call(cbind, lapply(x, is.na))
aitadata[is.na(aitadata)] <- 0
#check — first 5 rows (just to make sure)
head(aitadata,5)
#normalise data
aitadata <- scale(aitadata)
#check if normalisation worked
head(aitadata,5)
#find the optimal number of clusters
wss <- (nrow(aitadata)-1)*sum(apply(aitadata,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(aitadata, centers=i)$withinss)
plot(1:15, wss, type=”b”, xlab=”Number of Clusters”, ylab=”Within groups sum of squares”)
#perform k-means algorithm for clustering, number of clusters should be determined on previous step by observing the graph
fit <- kmeans(aitadata, 6)
#add cluster info to initial data
aggregate(aitadata,by=list(fit$cluster),FUN=mean)
aitadata <- data.frame(aitadata, fit$cluster)
#shows centers of each cluster to understand common characteristics
fit$centers
#number of users in cluster
cluster = aitadata$fit.cluster
cluster.freq = table(cluster)
cluster.req
#append user clusters to user ids
aitadata2 <- read.csv(‘App_in_the_Air_users_join_2014_for_R.csv’, sep = “;”)
aitadata <- data.frame(aitadata2[1], aitadata)
f <- file(“/Users/nkosholkin/Desktop/App_in_the_Air_users_join_2014_for_R_users_clusters.csv”)
write.csv(aitadata, file = f)

Understanding clusters

When you have cluster centers you may export them to Excel file, apply conditional formatting and understand what is common for each cluster:

For example, you can see that 3rd cluster clearly stands out from other clusters by the number of future flights, airlines s/he flies, countries and airport s/he has been to. We had about 10 other characteristics that give a clear picture about the common behavior of users from this cluster.

Designing retention strategy

You end up designing a retention strategy for each cluster, i.e.

Motivating users to “avoid” characteristics common to churned users, i.e. poor profile is an important characteristic, so you give some “perks” for user to connect Facebook or Twitter.
Retargeting users — since you know the user id’s and probably emails of the users of each cluster, then you may reach them in social networks or other apps & motivate to use the app again.
Notifying users by Push/Email — you may send out push notifications and/or personalised emals.

Re-designing product

Most probably (this is our case) such analysis will give you lots of ideas to re-design product onboarding or some specific user flows to improve retention. Basically, this is the most important outcome you will get! Make sure you discuss cluster centers with your team and brainstorm possible product changes.

NB

As you get cluster centers and clusters assigned to each user PLEASE PLEASE PLEASE spend some time verifying outcomes for common sense and check the source data for possible mistakes. Really, we changed the source data 5–6 times before we got to the point when all mistakes of data preparation & gathering were fixed.

Good luck! And do read John Foreman’s book & this research paper :)