How to Make a Cohort Analysis in Excel

Breno Teixeira
6 min readApr 7, 2024

What is Cohort Analysis?

Cohort analysis involves studying groups of individuals who share a common characteristic or experience within a particular time frame. Initially rooted in medicine, cohort analysis has found extensive application in online business and analytics.

Types of Cohorts

Time-based Cohort

Among the various types of cohorts, the time-based cohort is one of the most common in business analytics. This method involves grouping customers based on the timing of a specific event, such as their initial sign-up or first purchase.

For instance, in an e-commerce setting, we can segment customers by the month of their first purchase. Subsequently, their retention rates over subsequent months can be analyzed. Here’s an example:

Each row in the cohort analysis chart represents a distinct group of customers, while the columns display the percentage of customers from each group who returned in the months following their initial purchase.

Segment-Based Cohort

In a segment-based cohort, groups can be formed based on various characteristics other than time. We can segment customers by the marketing channel through which they were acquired, their geographic location, the level of service they’ve chosen, among others.

Combining time-based and segment-based cohorts can yield significant insights. For instance, consider a scenario where customers are segmented based on marketing channels A, B, and C. Upon analysis, it’s discovered that customers from channel B are more likely to stay than those from channels A and C. In response, the business may decide to allocate more resources and focus on channel B.

Furthermore, integrating time-based cohorts into the analysis proves invaluable. By tracking the effects of marketing spending decisions over time, businesses can assess the long-term impact and refine their strategies accordingly.

Cohort Analysis in Excel

Data Overview

For this analysis, we will use a dataset encompassing e-commerce orders spanning from January 2021 to December 2022, comprising 18,810 orders from 1169 unique customers. Below is a snapshot of the initial lines of the dataset.

Data Preprocessing

In this data, a client can have multiple orders in a month. So, before we conduct our monthly cohort analysis, we’ll need to augment the dataset with additional columns:

  • year/month: representing the year and month in the format, e.g., Jan-2021.
  • customer_acquisition: the month-year of a customer’s first purchase.
  • is_birthday: binary column indicating whether or not it’s the month of a customer’s first purchase (yes or no).
  • came_back (1 or 0): this column will be zero if the current month coincides with the birthday month of a customer and 1 otherwise.

These transformations will be clearer once we implement them in Excel.

Preparing the data

With the necessary columns defined, let’s proceed with the step-by-step process of creating them.

year/month

This column is straightforward to generate. We simply extract the year from the date and concatenate it with the month_name. The formula for that column is:

=month_name&"-"&YEAR(date)

customer_acquisition

To create this column we need to order our table by customer ID and date in ascending order. It also required that we use a normal range to implement the following formula:

=IF(current_user_id<>previous_user_id, year-month, customer_acquisition_month).

In each cell, we evaluate whether the current user ID differs from the previous one. If it is, we assign the value of the year/month column; otherwise, we retain the value from the cell above. Consequently, for each user ID, this column will be populated with the month of their first purchase. It is crucial that the data is ordered by user ID and year/month for this formula to work correctly. The illustration below demonstrates its functionality.

is_birthday

To populate the is_birthday column, we’ll use the following formula:

= IF(AND(current_user_id<>previous_user_id, month=cust_acquisition),"yes","no")

This formula evaluates whether the current user ID differs from the previous one AND if the current month matches the customer’s acquisition month. If both conditions are met, it returns “yes”; otherwise, it returns “no”.

came_back

The formula we will use for this one is the following:

=IF(AND(previous_user_id = current_user_id, is_birthday <> “yes”, 1, 0).

This formula returns one if the previous user ID is equal to the current user ID AND if it’s not the month of the user’s first purchase (as indicated by the “is_birthday” column); otherwise, it returns 0.

Cohort Analysis — Retention

Now we are all set to conduct our analysis. We’ll analyze customer retention for 2021. The final result will look like this:

Each row represents a cohort, starting with the initial month. Subsequent columns (numbered from 1 to 12) denote retention rates for each subsequent month. Notably the first column (representing month 0) typically exhibits 100% retention and can be omitted from the visualization.

Let’s clarify the calculation for a single row. In the cell corresponding to (Jan-2021, 1), we observe the count of distinct customers who made their first purchase in January 2021. Moving to (Jan-2021, 2), we track the number of customers from the Jan-2021 cohort who returned in February 2021, and so forth, until December. This process is repeated for every cohort, with our analysis concluding by December 2021.

Formulas

To facilitate calculations, transforming the range into a table is recommended, it will make referencing the columns in the formulas easier. Let’s outline the formulas for computing the number of distinct customers in the initial month (column 1) and those who returned in subsequent months.

Initial Month Count

To compute the number of distinct customers in the initial month (column 1), we utilize the following formula:

=COUNTIFS(test[customer_acquisition], initial_month, test[is_birthday], "yes")

We use the COUNTIFS function specifying two criteria: the customer acquisition month must match the initial month, and the “is_birthday” columns must be “yes”.

Returning Customers

Calculating the number of customers who returned in a given involves a slightly more complex formula:

=SUMIFS(test[came_back], test[customer_acquisition],$D$9, test[is_birthday],"no",test[year/month], INDEX($D$9:$D$20,F8,0))

Here, we sum the values in the “come_back” column using the SUMIFS function. The criteria include customer acquisition matching the initial month in the respective row, “is_birthday” being “no”, and the year/month matching the corresponding month for the column. For instance, in the analysis of the Mar-2021 cohort, column 2 corresponds to April 2021.

With the number of customers for each cohort calculated for every month, we can now determine the retention rate. Simply divide the number for each month by the number of customers in the initial month, as depicted in the figure below.

That concludes our exploration of cohort analysis in Excel. Armed with this knowledge, you can apply cohort analysis to your own dataset, examining not only the number of customers but also revenue and other metrics of interest.

Conclusion

In this article, we’ve delved into the process of conducting cohort analysis using Excel. While there are dedicated tools for such analysis, Excel remains widely used across various industries. I hope this post proves valuable to you. If you have any questions or comments, feel free to leave them below. Happy analyzing!

--

--