A Recommendation playthrough [1/3]: tracking users with Mixpanel and exporting data to AWS S3

Johan Chataigner
BeTomorrow
Published in
16 min readJul 20, 2021

Having users wandering inside your application, browsing its content, watching videos, and … getting recommendations or personalized experience, seems to be the go-to model when creating a content-based platform. But do we really know what technologies are behind such features ?

This is exactly our objective here, in this succession of 3 articles.

We aim to show precisely from the first tracking event to the last query of recommendations how to implement an architecture capable of handling this kind of feature.

For this purpose, we made a sample application to browse YouTube API. With it, users can see and subscribe to channels, watch and like videos. It will be the technical basis on which we will work along the 3 articles of this series.

We made the technical decision to use Mixpanel to track our users, and Amazon Web Services (AWS S3, Glue and Personalize) to build this recommendation architecture.

In this first article, we will mainly deal with Mixpanel to see how User Tracking actually works. In addition, we will show how you can export the generated data to AWS S3, as Mixpanel provides a tool to automate such processes.

If you are more interested in creating an ETL pipeline with AWS Glue to transform a dataset, you can directly go to the second article : A Recommendation playthrough : transforming data with AWS Glue.

Finally, if what you really want to see is a walkthrough of implementing a recommendation engine with AWS Personalize, the last article is for you : A Recommendation playthrough : creating and exposing a Personalize model.

Before we start, you need to have as a requirement a Mixpanel account with the Data Pipelines addon activated (in trial or not, it does not matter). Similarly, an AWS account with its billing configured is also required.

Let’s get started!

What is Mixpanel?

According to Mixpanel itself,

Mixpanel is a tool that allows you to analyze how users interact with your Internet-connected product. It’s designed to make teams more efficient by allowing everyone to analyze user data in real-time to identify trends, understand user behavior, and make decisions about your product. It employs an event-based, user-centric model that connects each interaction to a single user. Mixpanel’s data model is built on the concepts of users, events, and properties.

From a developer/entrepreneur perspective, Mixpanel is a really powerful tool when it comes to analyzing and understanding how your users interact with your product (your application). You can create cohorts, dashboards, manage users, …, and connect everything to many other products and services. For instance, you can create cohorts, connect to Braze and create personalized communication campaigns based on data of these cohorts.

Mixpanel comes with a SDK available for multiple platforms (Java, Python, Android, iOS, …, and even React Native or Flutter), and is really easy to use. You just need to specify your project ID to start sending events.

Tracking your users with Mixpanel

With the help of Mixpanel, you can track the users of your custom service.

This means that each user is associated with a distinct ID, and all his actions are tied to this ID. That way, each user has a profile with all the information you need on your Mixpanel project, bound to an history of events you tracked.

Example of a Mixpanel user profile

A Mixpanel event can be any interaction of your users you want to track in your app. For instance, it can be a like on a post, the purchase of an item or even something simpler like the user logging in.

Here is an example of event:

Example of an event details

As you can see, an event has a name that you choose (here likeEvent). Then, each event comes with a list of properties that can be split in two categories:

  • Custom properties
  • Mixpanel properties

You can put any information you need in an event properties, this is what are custom properties. Then, Mixpanel on his own adds some properties to all your events, such as the distinct ID of the user which triggered the event (hopefully) and other information like the time of the event.

Now, let’s see how tracking looks like in terms of code with the Mixpanel SDK!

Mixpanel SDK gives access to libraries for various languages/frameworks like Python, React Native and Node.js. As a consequence, it is possible to work on client-side as well as on server side.

In our case, we made a Flutter app to make users interact with YouTube content (videos, channels and playlists). That’s why the following code samples are examples of the Flutter library of Mixpanel SDK.

First of all, you need to create a Mixpanel instance before tracking users and events:

Mixpanel initialization

This way, you have access to your Mixpanel instance everywhere in your app. We chose this approach because in our case every screen of our app can trigger Mixpanel events.

Note: be careful with the server url you specify, here we set it to api-eu.mixpanel.com since we are in Europe.

Tracking events

Tracking events with the SDK is as simple as this code snippet:

Event tracking example

This example tracks a user like on a video in our app. The first parameter is the event name and then the properties parameter is an object where you can add your custom event properties. For example here we added information about the liked video such as its author and title.

Tracking users

Tracking users with the SDK is a bit more tricky in terms of logic, but it is as simple as the event tracking in terms of code. If you want to know the details of how it works, we invite you to check the docs. For our example app we didn’t need to manage user identity as in a production app with a backend since it is just an example app. To manage user identity, we simply used this code:

Manage user identity sample code

To choose which user is tracked in our app, we made a simple login where we choose the distinct id of the user. This distinct ID is passed to the identify method to tie all future events to this user. Calling identify with an unexisting ID will just create a new Mixpanel user profile and this is not a problem for our simple use case.

Generating data for a Mixpanel project

If you remember well, the final goal of this series of articles is to train a custom recommendation solution based on the data we track with Mixpanel. In order to reach this objective, a good amount of data is needed.

However, our example app is only used by developers (us) to test it, which is not sufficient to have a large number of users and events to work on.

Fortunately for us, Mixpanel SDK offers support for Python, so we just wrote a script generating users and events tied to these users. This is pretty straightforward and similar to what is shown above in Dart so we won’t give a code sample here.

Setup a data pipeline

Mixpanel data pipelines feature is made to allow you to connect your Mixpanel project to your custom data lake in the cloud for various providers such as Google Cloud and Amazon Web Services. Here, we will use this feature to connect to Amazon Web Services storage.

To manage Mixpanel data pipelines, go to this page. We want to create a data pipeline to export our users and events to AWS S3, so let’s go to the Data Pipelines API section at the bottom of the left menu:

Click on the Create pipeline option and let’s start creating our export to S3!

First of all, we need to give our project secret API token to the request in order to create the pipeline for the right project:

API secret input

You can find this button on the right of the request URL. Your project API secret can be found in your project settings in the Access Keys section:

Project access keys

This being done, we can now choose the type of our pipeline. What we want here is a pipeline to AWS S3, so let’s select ‘AWS S3 or Glue’ pipeline type in the dropdown:

Now we can start setting up the configuration of the pipeline. The first thing that is asked is type, but here only one type is available (aws) and is set by default so no change is needed.

Then we have to choose a value for the trial parameter. This value set to true will give you some default configuration for your pipeline. In our case, we will set it to false since this configuration doesn’t match our needs.

Note: The default value is said to be false but is set to true when we arrive on the page, so be careful when you want to remove this feature.

Trial default configuration

Next parameter is about the data schema type, which can be set to monoschema or multischema. Since our events don’t hold the same properties it is more appropriate to store them in different tables, so we set the schema type to multischema here.

Output schema type

To continue, we are asked to choose a data source for the pipeline we’re creating. You can choose between users and events. For our example recommendation app, we need both of these data to be exported to S3. This means that we will need to create 1 data pipeline for each data source. Don’t worry, the configuration is nearly the same, and pretty straightforward.

Pipeline possible data sources

The sync value defaults to false. Setting it to true will automatically update the previously exported data on any changes in your Mixpanel dataset. We set this to true for the events pipeline, it can be useful for instance if you send an event immediately when it occurs but then need to add a value you are still waiting (API call for example) as a custom property of the event.

Note: As we are writing this article, setting sync to true is unavailable for a users pipeline.

The next 2 parameters, from_date and to_date simply allow you to export data only within a fixed time window. We won’t use it here but it can be useful if your early events and users in a Mixpanel project are just test data and you don’t want to export it. To have an indefinite exporting window and export data “forever”, just leave to_date empty and set from_date to the creation date of your project for example (it is a required parameter).

Pipeline time window

For an indefinite window, we can then set the frequency parameter to tell the pipeline how often it should export our data. We choose hourly here, and that is one of the reasons we didn’t set trial to true, because it sets the frequency to daily. We didn’t want to wait a full day to start getting data on our S3 bucket in our use case.

Data export frequency

The last set of parameters about Mixpanel data, events and where, offers you the possibility to filter the data you send with event names and properties. We don’t use it here since we want to export all our data at once but it can be great depending on your needs.

Then, choose a data format as follows:

It can be either parquet or json, we choose json here.

Now, we can start giving the pipeline information about S3 and Glue. Here is the main information required by the pipeline:

Pipeline S3 configuration

The s3_prefix will be the name of the root folder containing your imported data.

For your role, you have some extra steps to do. Mixpanel’s documentation perfectly exposes it, but we will do a slightly shortened version here : What you are going to create is a policy and a role directly linked to Mixpanel’s main AWS account, to enable it to write to your S3 destination.

First, go to your AWS portal and open the IAM console. You will have to create a policy (with the name you desire), which will enable you to read and write in S3 (PutObject, GetObject, ListBucket and DeleteObject).

After the creation of this policy, you have to associate it with a new role, dedicated to Mixpanel. Go back to your IAM console, go to the roles tab and click on “create role”. Then, select the “Another AWS Account” type of trusted entity. Enter Mixpanel’s account ID : 485438090326. You can now associate the policy you created to this role, and validate its creation. You finally need to change the trust relationship of this role : find your role, click on it, go to “Trust relationships”, “Edit trust relationship”.

Role trust relationship interface

In the JSON Editor, you just have to update the “Principal” object with its line “AWS” with the following ARN : arn:aws:iam::485438090326:user/mixpanel-export.

And that’s all!

The last values to fill are about S3 encryption which we do not use here and Glue setup. We tried to use Glue at first but we finally removed it from the pipeline and used it later in the AWS console. This will be explained later in this article.

Finally, to create your pipeline you can either press the “Send” button at the top of the page or copy paste it from the right pane to launch it with your prefered tool (curl for example).

Note: be careful with the URL used in the request, by default it is data.mixpanel.com but since in our case we were in Europe, the default doesn’t work. If you are doing this from Europe use the following URL: data-eu.mixpanel.com

Mixpanel is now going to export your data on a daily or hourly basis.

Your S3 bucket files will match the following pattern : s3://{your-bucket}/[prefix]/{mixpanel_project_id}/{mixpanel_event_name}/{year}/{month}/{day}/your-dataset.json.gz.

For instance in our case : s3://mixpanel-export/lm_/123456789/likeevent/2021/05/15/some-dataset.json.gz.

Now that our data is continuously exported to our S3 bucket, it is time to reference its structure and create our dataset!

Reference your data with Glue Data Catalog

What are AWS Glue and Glue Data Catalog?

As we will see in the next article of this series, we can use AWS Glue to reference and process our data. In this article we will not focus on exactly defining what Glue is, except from telling that it is Amazon Web Services’ “data integration service”, which can be defined as ETL and tools around ETL. With Glue you are able to visualize, prepare, structure and engineer your data.

Glue Data Catalog is the part of AWS Glue designed to structure your data. It behaves like a NoSQL database, where you just register databases and tables. Each table is described by a schema (the dataset’s columns), and is referencing a path where the data is located. The path can either be an AWS S3 bucket’s path, or an AWS Kinesis/Apache Kafka reference (Glue Data Catalog can also be used to structure your events pipeline).

Once this is done, these tables can be used in many other AWS services (Athena, Glue Jobs that we will see in the next article, but also many others) and are really efficient to simplify the way you are accessing your data.

Glue Data Catalog and Mixpanel

In the previous steps, we asked you not to select the option for Mixpanel to create Glue Data Catalog tables. During our exploration of the service in itself, we faced problems when operating with the entries created by Mixpanel.

In Glue Data Catalog, you can partition your data, and the partition in which a record is located is concretely symbolized by a partition key inside each record of the Table. In our case, Mixpanel creates a hierarchy of directories based on the date of the output. For instance, for the output of 8:00AM to 9:00 AM of the 20 may 2021, the output content will be located in s3://…/event/2021/05/20/content.json. Thus, Mixpanel is creating a partition based on the date. However, when we ask Mixpanel to create a Glue Data Catalog entry (database and tables), it only creates the tables with the schemas and link to S3’s main event directory, but without any partition key for the date. As a consequence, when you later want to access this data, your tools do not know where to look for it because they do not know any partition and any structure associated with it.

We have to reference the data by ourselves, and this is exactly what Glue Crawlers are for. AWS Glue Crawlers are workers designed to browse an entire tree of directories to discover how is structured your data, and then to create (or update) Data Catalog tables to match paths and schemas according to your data’s.

Creating Crawlers

To create your first crawler, you just need to go to AWS Glue, and select the “Crawlers” tab. Next, just click on the button “Add crawler”, and here we go!

Crawlers list in Glue

After entering your Crawler’s title, you can continue to the behavior your crawler will have when browsing your datasets.

Crawler’s behavior setup

Our source is rightly a data store, and because Mixpanel may output multiple times inside the same directory (as we configured it to export data every hour), we select to crawl all folders. A crawler may be run several times if your data changes over time.

Crawler’s data store configuration

Next we have to set up the connection to the datasource. Here you can choose among a classical JDBC driver, AWS DynamoDB, … or AWS S3. And this is what we are going to connect.

By clicking on “Add Connection”, a new window opens. These fields are quite simple to fill, you should just pay attention to the VPC and its subnets. Be sure that the VPC you are associating to this connection has access to AWS S3 (should be enabled by default).

Once this is done, we just have to select which Mixpanel event type we want to crawl, and select the main directory in our S3 bucket where the associated data is located.

Finally you will have to associate a role to this crawler. Make sure that you created a role with access to the bucket the crawler is about to analyze, and that the crawler also has access to the Glue Console. To do so, create a policy that can read to your S3. Then, create a role with a name that starts with “AWSGlueServiceRole-” (mandatory), and ends with the name you want. To finish, attach the policy you created and the policy “AWSGlueServiceRole” to the role.

The next step is about the frequency of runs, then you can configure how your data will be outputted to your AWS Glue Database (you can create one right there by clicking on “add database”). I personally prefer to use a prefix, as it makes it easier to recognize and group tables. An important option in our case, is to select the optional element “Create a single schema for each S3 path”. This is what tells our crawler to partition data based on the path of each records dataset. Otherwise, you will end up with as many tables as you have paths (one per day, at least). In the additional configuration options you can set behaviors like what to do if the schema changes over time, etc … .

You are now done! You can run your crawler. Once this is done, you finally have your Glue Data Catalog Table entry!

Example of output schema…

And most importantly, your data is now partitioned.

…which is partitioned

You can rename fields as you wish. Here, partition_0 is the year, partition_1 is the month, and partition_2 is the day.

About pricing

To finish this article, we will try to give you some hints about the pricing of the various tools we used and presented to you.

Mixpanel

To be short, Mixpanel’s pricing depends on the client. Their pricing is not fixed and there aren’t many hints on their dedicated page. You will need to either contact them to have a clearer idea of how much you will pay for your service using their solution or get some vague estimations from the Internet (for example).

Glue

We won’t treat everything about Glue’s pricing here but only what’s linked to what we used so far.

The storage pricing of Glue Data Catalog is really simple, you won’t pay anything if you don’t exceed 1M objects stored and 1M requests each month. If you overcome those limitations, you will pay $1.00 for each 100K objects over 1M, and $1.00 for each additional 1M requests.

For the crawlers, the measure unit used by AWS is the DPU-hour.

A single Data Processing Unit (DPU) provides 4 vCPU and 16 GB of memory.

You will pay the run time of your crawlers in increments of 1 second, with each run being billed as if it was at least 10 minutes long.

These features of Glue won’t cost you a lot for what we are doing here. For example, you will only need to run your crawlers once and then each time your data schema will change.

Conclusion

To conclude for this first article of the series, we saw that tracking users with Mixpanel is pretty simple thanks to the Mixpanel SDK. Then, the data export step with Mixpanel data pipelines has proved to be pretty straightforward too. In a nutshell, the tricky part is to know well your data’s structure to set up correctly how it will be stored (here with Glue Data Catalog and S3).

The only thing that should remain unclear at the end of this article is Mixpanel’s pricing, which is really specific to your project needs.

Now that we have data on our Cloud storage, we are ready to work on it and continue our journey towards recommendation with the next article: A Recommendation playthrough : transforming data with AWS Glue.

If you are already an experienced user of AWS Glue, you can skim over the next article and go for the last one about recommendation: A Recommendation playthrough : creating and exposing a Personalize model.

Thank you.

I hope you enjoyed this article! Please share your comments with us in the section below👇

BeTomorrow is a consulting, design and software agency. Click here to have a look at what we do!

--

--