Using MongoDB aggregation framework to provide data for Cohort Analysis

4 min readNov 3, 2023

As a backend engineer of a small startup, I’ve often been asked by our Growth team to provide data to do Cohort Analysis, mainly to analyze retention and engagement of the users with our product. They are neither familiar with the data structure of our operational DB, nor with Mongo Query Language so they need pre-processed data in a format they can work with.

The Aggregation Framework is a feature of MongoDB that allows the transformation and reshaping of data to empower different types of analysis. It works with pipelines, so transformation may be done in stages, where each one’s input is the output of the previous one.

Source: https://timoelliott.com/blog/cartoons/analytics-cartoons

We needed to work with the Growth team first to understand the analysis they needed to do to figure out a good way to provide them with the data. After some back and forth, we agreed that this could be achieved with a table with a row per user per period (either weeks or months), including the count of the most relevant interactions of the users in the given date bucket (either weeks or months).

Then the tech challenges started. To demonstrate the solution, we’re going to calculate the count of likes, dislikes, and total interactions sent by user-by-week on the following simplified dataset.

[
  {
    "_id": "654559d3d99d55b786221257",
    "interactions": [
      {
        type: "LIKE",
        date: ISODate("2023-08-30T08:55:10.000Z")
      },
      {
        type: "LIKE",
        date: ISODate("2023-08-30T12:51:10.000Z")
      },
      {
        type: "DISLIKE",
        date: ISODate("2023-08-31T08:30:10.000Z")
      },
      {
        type: "LIKE",
        date: ISODate("2023-08-31T08:51:10.000Z")
      },
      {
        type: "DISLIKE",
        date: ISODate("2023-09-06T08:51:10.000Z")
      },
      {
        type: "LIKE",
        date: ISODate("2023-09-07T10:51:10.000Z")
      },
      {
        type: "LIKE",
        date: ISODate("2023-09-09T12:51:10.000Z")
      }
    ],
    "createdAt": ISODate("2023-08-30T08:51:10.000Z")
  },
  {
    "_id": "654559e574e73904c2b5b084",
    "interactions": [
      {
        type: "LIKE",
        date: ISODate("2023-08-25T09:55:10.000Z")
      },
      {
        type: "LIKE",
        date: ISODate("2023-08-28T12:51:10.000Z")
      }
    ],
    "createdAt": ISODate("2023-08-25T08:51:10.000Z")
  }
]

In the sample dataset, there is an interactions array embedded in the user document. In real life, this array may come from data in different collections, so it could be the result of a previous $lookup stage in the pipeline.

The first step is to add an $unwind stage to our pipeline, to have a single document for each interaction sent by a user:

let pipeline = [
  {
     $unwind: "$interactions"
  }
];

Once we have a document for each interaction, the next step is to group results by the desired date buckets. Every interaction has a date, but we want to group all documents whose date is included in a given period (for example, week 1 of 2023 or January 2023). We tried several different and rather complex approaches to achieve this until we realized that $dateToString allowed us to transform a date into an expression that meant what we call a dateBucket. So now we add a $set stage to add a field containing the expression according to the given timeframe:

pipeline.push({
  $set: {
    dateBucket: {
      $dateToString: {
        date: "$interactions.date",
        format: "%Y-%U"
      }
    }
  }
});

The documentation of $dateToString includes a full explanation of its syntax. In this case, we’re trying to do a weekly analysis, so we used %Y, which is the full year, followed by a dash followed by %U, which is the week of the year in 2 digits, padded by zeros. If we need to do a monthly analysis, then the expression would be %Y-%m, %m being the month expressed in 2 digits, padded by zeros.

Now we have the desired timeframe in the dateBucket field, so we can add a $group stage to the pipeline, whose _id is an object containing userId and dateBucket. We need to provide a count of all of the grouped documents as well as a count of those whose type is “LIKE” and of those whose type is “DISLIKE”. The first one just requires a $sum accumulator, adding 1 for each document. For the other ones, we need to combine the $sum accumulator with a $cond operator, to only count the documents of each type.

pipeline.push({
  $group:
    _id: {
      userId: "$_id",
      dateBucket: "$dateBucket"
    },
    likesCount: {
      $sum: {
        $cond: [
          {
            $eq: [
              "$interactions.type",
              "LIKE"
            ]
          },
          1,
          0
        ]
      }
    },
    dislikesCount: {
      $sum: {
        $cond: [
          {
            $eq: [
              "$interactions.type",
              "DISLIKE"
            ]
          },
          1,
          0
        ]
      }
    },
    interactionsCount: {
      $sum: 1
    }
  }
});

Then we need to add a new $set to expose userId and dateBucket as a part of the root document, and a $project stage to get rid of the _id object, so our result is a flat document, more compatible with a table representation:

pipeline.push({
  $set: {
    userId: "$_id.userId",
    dateBucket: "$_id.dateBucket"
  }
});

pipeline.push({
  $project: { _id: 0 }
});

This is the output we get after executing the pipeline:

[
  {
    "dateBucket": "2023-35",
    "dislikesCount": 1,
    "interactionsCount": 4,
    "likesCount": 3,
    "userId": "654559d3d99d55b786221257"
  },
  {
    "dateBucket": "2023-34",
    "dislikesCount": 0,
    "interactionsCount": 1,
    "likesCount": 1,
    "userId": "654559e574e73904c2b5b084"
  },
  {
    "dateBucket": "2023-36",
    "dislikesCount": 1,
    "interactionsCount": 3,
    "likesCount": 2,
    "userId": "654559d3d99d55b786221257"
  },
  {
    "dateBucket": "2023-35",
    "dislikesCount": 0,
    "interactionsCount": 1,
    "likesCount": 1,
    "userId": "654559e574e73904c2b5b084"
  }
]

A final step is to use an $out stage to save these results in a new collection, so they can be directly queried using SQL (with the Atlas SQL interface) or even downloaded to a CSV for further analysis.

This playground contains the sample dataset and the pipeline to play with the results.

Using MongoDB aggregation framework to provide data for Cohort Analysis

Written by Rafael Ochoa