What does ancient Roman military has in common with data analysis?

Data analysis is fun. But what’s more, it actually matters!

I am happily a part of Miracas.com, an online fast fashion brand. For me, sales data straight out of the database dump is almost like golden sand to play. Although some of the terms used by the gurus of the industry can be pretty intimidating. For example… Cohorts analysis. When I searched the term cohorts in the dictionary, it took me to the history of roman military, only to find out that cohorts basically means groups.

Essentially the idea behind cohorts analysis is to divide users into groups, or cohorts, and study the behaviour of these groups for better insight. For example, all users acquired in a given month is the cohort we can look at in order to understand how they stick around over subsequent months.

We must ask the right question before asking for the right answer. Or else we might end up with 42, which may as well be the correct answer but we won’t understand how. Let me start with asking the following questions in order to understand the aim of my analysis.

What user behavior do I want to study?

I want to understand how my users stick around with me over time. So I would like to calculate the number of orders made by existing users over time. My sample data spans a year, so it makes sense for me to define a month as the unit of time for the analysis.

How do we want to divide users into cohorts?

Basically what I am asking is, how to break users into meaningful groups. For example Age group, demographics or value of the first order can be used to define the cohorts. In the context of the previous question I asked, let me define the cohorts as all the unique users who ordered for the first time in a given month.

In a nutshell, what I would like to understand is, given the sales data of my company, how the users I acquired in a given month stick around for subsequent months?

The following plot is one of the ways to come up with an answer

Sample data used here is purely for illustration purpose. It does not reflect the typical industry indicators and must not be quoted elsewhere.

The chart shows that on an average, about 7% of the user’s order again in the second month. The number increases to about 8% and 9% in the third and fourth month respectively. The orange envelope shows the maximum and the minimum retention for different cohorts used in the analysis.

So now that the question is established, let’s start playing with sand.

R is an incredibly powerful tool for data crunching. I came across R during my days of semiconductor modeling. I grew incredibly fond of it pretty soon. The ease it provides to someone who loves to play with data can be pretty addictive. For processing the data I will be using basic R in this article and will create some aesthetically pleasing charts at the end using ggplot2.

I used the following R packages for the analysis.

  • plyr : Tools for Splitting, Applying and Combining Data
  • reshape2 : As the name suggests, a package for reshaping data
  • ggplot2 : One of the most awesome things about R

In addition to these libraries, I also used RColorBrewer, colorRamps and gridExtra packages in order to improve the aesthetics of the plots.

I generated some sample data for the purpose of this article.

Here is what it looks like.

Here is how I read the sample data.

dr <- read.csv('sample_date.csv', stringsAsFactors = F)
dr$date_add <- as.Date(dr$date_add)
dr$date_add_month <- as.Date(cut(dr$date_add, breaks = "month"))
dr <- subset(dr, (date_add_month > as.Date("2015-05-01") & (date_add_month < as.Date("2016-06-01") )))
number_of_months <-length(unique(as.character(dr$date_add_month)))

The first line of code just reads the csv file into a data frame. In second line I am converting the data values into dates format. The next line breaks the dates into months so that all the orders in a given month can be bucketed together. The 4th line of code is just a subset command

Tip: always use R date format for dates data. This lets R to be intelligent about it and treat them appropriately when plotting and other calculations.

I also extracted the number_of_months value here. I will need it later when I split the data into cohorts.

This, as one can guess, is the most important step. We will divide the users into groups based on the month they ordered for the first time. Let me reiterate my aim for this analysis.

The idea is the understand how many users acquired in a given month, stick around and make more orders in subsequent months.

Here is my approach to achieve this.

  1. Make a list of unique users in a given month ( ulist )
  2. Subset the data with users only in this list into a new data frame ( dr1 ). Remove the data from original data frame ( dr ).
  3. Count the number of orders made by these users in each month using dr1. I summarized the results into another data frame called dr1s
  4. Do this for each month starting from the first month. With every iteration, I am filtering out the users acquired in that one month m.
  5. Combine the result into a single data frame called Cohorts.

Here is the code that does all this. Although it might look scary at first glance, it’s actually pretty simple to understand (with a little bit of R knowledge)

Cohorts <- data.frame() ## empty cohorts data frame.
for (m in unique(as.character(dr$date_add_month))){

u <- subset(dr, dr$date_add_month == m) ## -- data for month m

ulist <- unique(u$user_id) ## -- orders by unique users in month m
dr1 <- subset(dr, dr$user_id %in% ulist) ## -- only month m users
dr <- subset(dr, !(dr$user_id %in% ulist)) ## -- remove from dr

## -- Number of orders by these users for every month
dr1s <- ddply(dr1, .(date_add_month), summarize, total = length(user_id))

## -- Combine the calculations into the Cohorts data frame.
colnames(dr1s) <- c("Month", m)
a <- c(m, dr1s[,m])
a <- data.frame(t(a))
Cohorts <- rbind.fill(Cohorts, a)
## Some more processing with the final data frame
col_names = paste("Month", array(1:number_of_months), sep = " ")
colnames(Cohorts) <- c("Month", col_names)
Cohorts["Month"] <- as.Date(Cohorts$Month)
Cohorts["max"] <- as.numeric(as.character(Cohorts[,2]))

I have used ddply here to sum the number of orders monthwise. The function comes from the plyr library and it is one of the most handy tools one can use to subset the rows of the dataframe based on multiple conditions, run calculations for each subsets, and combine the results into another data frame. I highly recommend spending some time playing with ddply to any R enthusiast.

Here is how our fruit of labor, the data frame Cohorts looks like.

Awesome isn’t it!

The toughest part is done. At this point I already have the numbers I set out to calculate. Now what remains is to plot this data into an aesthetically pleasing chart. But before that let me talk a little about reshaping the data.

Swords of course… but data too! 
Reshaping data is very important when doing any kind of analysis. R has a powerful reshape library which helps one melt and cast data as needed. Here is a nice tutorial.

With this great knowledge at hand, let’s reshape the cohorts data frame into something more plottable. Basically we need the data into a format that can be easily consumed by ggplot2, have I already mentioned that this is one of the most awesome thing about R. believe me… it is !

df_plot <- melt(Cohorts, id.vars = c('Month', 'max'), value.name = 'Orders', variable.name = 'Month_after')
df_plot <- na.omit(df_plot)
df_plot["Value"] <- as.numeric(df_plot$Orders)
df_plot["Pc_value"] <- ceiling(df_plot$Value*100/df_plot$max)
df_plot["Percentage"] <- paste(ceiling(df_plot$Value*100/df_plot$max), "%", sep="")
df_plot["Percentage_2"] <- ifelse(df_plot$Percentage == "100%", df_plot$Value , df_plot$Percentage)

Our new data frame called df_plot looks like this.

Note that the column Percentage_2, is added just for plotting convenience. The values in Percentage_2 indicated the total number of orders in Month 1 and repeat order percentage for subsequent months.

Now come my favorite part, adding some colors. I have used RColorBrewer and colorRamps libraries to add to the charm… however the default ggplot options are more than enough for our purpose.

Here goes the code

## Cohorts tile chart ##
hm.palette <- colorRampPalette((brewer.pal(9, 'YlOrRd')), space='Lab', bias=10)
p <- ggplot()
p <- p + geom_tile(data = na.omit(df_plot), aes(x = Month, y = Month_after, fill=Value), width=31, height=1)
p <- p + geom_text(data = na.omit(df_plot), aes(x = Month, y = Month_after, label = Percentage_2), color = "white")#, fontface = "bold")
p <- p + scale_fill_gradientn(colours = hm.palette(100)) + theme_bw() + coord_flip()
p <- p + theme(panel.background = element_blank(), axis.line.x = element_blank(), panel.border = element_blank())
p <- p + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), legend.position="none") + ylab("")
p <- p + xlab("Orders by new users") 

And Voylla!! Here is what we get

At this point I would like to mention that the data I used here is generated purely for purpose of this article and is not the actual sales data. The numbers calculated here must not be quoted elsewhere.

With some more reshaping, I can represent this data with another interesting chart.

Average customer retention rate

I will leave it upto you to write the code for this chart.

These charts are very important to any business. Typically the Customer acquisition cost (CAC) is a substantial fraction of average order value for any e-commerce company. It is important therefore, to measure how users sticks around and generate more orders.

The first chart shows the number of orders made by unique users acquired in a give month (Month 1), and what is the percentage of these users who order again in subsequent months. For example in July 2015, total 479 unique users ordered for the first time, out of these, about 11% ordered again in Month 2 (i. e. August 2015). The number dropped to about 4% in Month 11 (June 2016).

Customer retention rate can be very different for different businesses. for me It indicated how satisfied my customers are. By itself it might not make a complete sense. However when looked at along with other indicators like average order value, Customer acquisition cost, Customer lifetime value, number of customer complaints etc, it can be very helpful in steering the business in the right direction.

Some of these parameters can be calculated using the same sample data I used here. Though I will save that analysis for another day.

The importance of analysing user behavior can’t be overstated. Cohorts analysis is a nice way to do this. Just like Roman Military, Data becomes more manageable when subsetted into meaningful groups. It is important though to ask the right questions before trying to find an answer. It is equally important to learn the right tool to play with the data. With R and its vast repertoire one can hardly go wrong learning it, at least to a basic level.

If I can be of any help regarding this article or in any other way, please feel free to reach out. Thanks for reading, hope you enjoyed.

Originally published at medium.com on July 14, 2017.

About the author: Author is the CEO of Miracas.com, an online fast fashion brand. An alumni of IIT Bombay, he is a technology enthusiast with more than 10 years of experience in the fields of electronics, data science and web development and has authored multiple patents in these fields. 
He has a keen interest in application of Arts in Technology to invent more humane systems. In his free time he likes to read, write and make long road trips. Please contact him at rahul@miracas.com