A Recommendation playthrough [3/3]: creating and exposing an AWS Personalize model

Published in

BeTomorrow

13 min readJul 20, 2021

In the first article of this series, we saw how to track interactions between users and content with the help of Mixpanel, and export all this data to AWS S3. Then, in the second article, we talked about ETL with AWS Glue to prepare this data for AWS Personalize.

As you can guess, this last article will be about ingesting the previously prepared data into AWS Personalize, and train a custom recommendation solution. Here, we will take the interactions between users and YouTube videos extracted from our app as an example. Let’s get to it!

How does recommendation work?

Recommendation, and more precisely a recommender system, aims to recommend to users things that they are likely to purchase, like, …

To do such a thing, a recommender system needs a lot of data, which is composed of users profiles, items information and interactions between these two entities. If you want to learn more about this topic, you can check this article.

In our case, we will focus on how works recommendation with AWS Personalize.

AWS Personalize offers the possibility to train a recommendation solution on your custom data. To do that, you can choose between different recipes, which are algorithms that match specific use cases of 3 categories:

Similar items: find items that are the most similar to a given one;
Personalized ranking: personalize the results of a research for instance;
User personalization: recommend the next item that your user is most likely to buy for example.

For example, you may want for one functionnality of your service to know which is the next article a user is more likely to purchase. And for another functionnality, you could need to show your users items similar to the one they consulted recently. AWS Personalize recipes are here to optimize recommendations based on such use cases.

Create a Dataset group

A dataset group is the AWS entity that will contain your datasets for an interaction between a user and a given item. It is the first step to build and train a solution with AWS Personalize.

Only a name is required here to process the creation, all the work comes just after. As an example, we want here to train AWS Personalize to make recommendations based on users’ interactions with Youtube videos. So first, we create a Dataset group for this purpose:

Going back to dataset groups dashboard, we can see that the operation worked.

To continue, let’s see what comes next by clicking on our fresh dataset group, here called user-videos.

This dashboard shows the different steps to follow. In our case we’ve just created an empty dataset group, so let’s start by uploading datasets.

Uploading your datasets

As you can see, 3 datasets are needed by AWS Personalize to create a solution. Each one matches one of the 3 entities involved in a user-item interaction, which are:

User
Item
The interaction itself

For each of them, you need a single dataset CSV file in an S3 bucket, which can be for example the outputs of your AWS Glue ETL jobs.

The first step to import a dataset is to create a Schema, which is a JSON representation of your data as a list of fields that must match exactly your CSV file’s column names and data types. This is used by AWS Personalize to know the structure of your data and parse it.

Your 3 schemas (for your 3 datasets) must follow some more rules in addition to matching your CSV.

User Schema

User schema example

Here is an example of a simple user schema given by AWS.

Any user schema requires a USER_ID field, with this exact name. If you are building your datasets with Glue for example, you should rename the column storing what identifies your users USER_ID.

Moreover, AWS Personalize requires a metadata field for a user schema, which is a numerical field or a string categorical field. In the example, this corresponds to the age and the gender. You can use between 1 and 5 metadata fields for your user schema. Metadata fields are required because some recommendation algorithms recommend the same content to users with similar profiles. To compute the similarity (distance) between two profiles you need those metadata fields, simply because you can’t compare two users only with their unique ID…

Item Schema

Video item schema example

Similarly to the user schema, the item schema requires a unique identifier field, called ITEM_ID this time. In addition, at least one metadata field is required, and you can use up to 50 metadata fields for an item schema. This is due to the same reason as for the user schema, sometimes you will build solutions based on similarities between items.

As you can see on this example, it is possible to specify the nullability of a field in the schema with a list as value for a field type.

Interactions Schema

Intercation schema (minimalist) example

The interactions dataset stores the interactions between users and items. That is why the interactions schema requires the USER_ID and ITEM_ID fields. In addition, the TIMESTAMP of type “long” is also required for an interaction. One reason of the TIMESTAMP requirement is that some algorithms used by AWS Personalize base their recommendations on the user’s interactions history.

It is in this schema (and dataset) that you want to put informations about the user’s device (type, OS, …) and the event type (a click, a view or a like for example).

You can find more about datasets’ schemas here.

Note: Keep in mind that once you have created a schema, AWS doesn’t offer a way to modify it, and deleting a dataset won’t delete the schema it uses since you can use the same schema for multiple datasets. Typically, you will need a user dataset for each dataset group and will be often the same dataset with the same schema.

Once you’ve created your schema, hitting “Next” will verify that your schema follows AWS Personalize requirements as explained above. Then you will be able to see the following page:

Here you first can choose a name for the job AWS Personalize will create to import your data. Then, you need to choose an IAM role allowing it to access your data, i.e. the bucket where your CSV file lives.

Next, you need to give AWS Personalize some permissions to access your data in S3. To do so, you need to create an access role, for example AmazonPersonalizeAccessRole, with the following policy attached to it:

Policy to attach to Personalize access role

Then, you will need to also attach a policy to your S3 bucket:

Policy to attach to your S3 bucket

After that, AWS Personalize should be able to access your data when it will import your dataset. You can find a more detailed documentation here.

Finally, you can specify the path to your CSV data file. Hitting the “Finish” button will check if your schema matches your CSV data. If you are new to AWS Personalize this step may involve a few tries before successfully importing your data. Be careful with your fields names and your data types. This step is a bit tedious because of the impossibility of modifying schemas. Also, you can’t change the schema of a dataset once you’ve reached the second page of the import steps. So each time you will fail importing because of errors in your schema you will need to delete the corresponding dataset entity in your dataset group (and it takes a few seconds to be taken into account) and create a new one, with a new schema.

When your datasets are successfully being imported into your dataset group, you can take a break and grab some coffee, because it will take some time.

You can launch at most 5 dataset import jobs simultaneously, but it still takes at least 10 minutes for a dataset to be imported successfully.

You should finally see something like this for each 3 of you dataset group’s datasets:

You are now ready to train an AWS Personalize recommendation solution!

Create a solution

Our datasets being ready, we are now able to train a custom model to make recommendations.

Note that you can start creating solutions without having all 3 datasets ready (this is an alternative to your previous coffee break).

To create a solution, you will need to choose a name and pick a recipe. AWS Personalize offers a list of recipes, which are basically algorithms that match certain recommendation use cases.

For example you may want to recommend items to users based on their tastes or get some similar items to a given one (like when you buy something on Amazon.com).

As a consequence, each recipe requires different things as input to give you recommendations. You can find the docs about recipes here.

In our case, we want to recommend to a user Youtube videos that he might like, because of his interactions (view, like, dislike, …) in our app with other videos. Our use case belongs to the ‘User Personalization’ category of recipes. For this example, we chose the ‘aws-user-personalization’ recipe, which requires a user ID and gives as output a list of recommended item IDs (here Youtube videos IDs).

To improve your solution, AWS Personalize can optimize the hyperparameters of the algorithm that will be trained. To do so just ensure that you clicked on the switch for “Perform HPO” like in the figure below.

You can then change the default values set by AWS Personalize, like the number of training jobs during HPO or the Hidden dimension (depending on the complexity of your data). For this example, we will just perform HPO with the default values.

Once you finished this step and hit finish, you should see this on your dataset group’s dashboard:

Now your solution will be trained on the data you imported to AWS Personalize, it can take a while, depending on the size of your dataset. To go to the next step for the solution you train, you will need to wait for the training to finish.

Create a campaign for your solution

To get recommendations from your trained solution, you need to create a Campaign. This is the entity you will communicate with (through the AWS SDK for example) to get your solution’s outputs.

As usual, you need to choose a name for your campaign. Then you can pick the solution of your choice among all the ones you trained. And that’s it!

(Note: you can choose the minimum number of transactions per second for your campaign, we will leave it to the default value of 1 for this example).

Make your predictions available

The last step to get recommendations from your trained algorithm is to use the AWS SDK to communicate with your campaign.

For this purpose, we chose in this example to use AWS Lambda and AWS API Gateway. We won’t go in depth on this topic because the main subject here is AWS Personalize. To synthesize, you need to set up a Lambda triggered by a route of your API. In our example, we want to get recommendations for a given user, so your route needs a user ID as path parameter for example. Then you can get recommendations for this user with the following sample code:

Get recommendation code sample

The SDK requires your campaign ARN, the user ID and can also take a maximum number of results.

This will give you a list of recommended items, more precisely a list of item IDs. In our case, we get a list of YouTube videos IDs:

Example of response from our recommendation engine

Bonus: Set up Personalize tracking

Until now, we successfully trained an algorithm that makes custom recommendations for our YouTube videos recommendation app. However, this algorithm was trained at a given moment on a fixed set of user-videos interactions.

Hopefully, to adapt it to the incoming interactions of your users with your content, AWS Personalize offers the possibility to add new data in real-time to your recommendation solutions and datasets. This is done through the Event Trackers feature which allows your solutions to learn from the most recent interactions without training.

Event trackers allow you to track new interactions from your users and integrate them in your datasets. With the help of the trackers, AWS Personalize can update the recommendations it makes to the new events tracked in real-time.

I made this section a bonus because it is the icing on the cake for your custom recommendation engine with AWS Personalize. You can find more details about this in the docs.

To create a tracker, you will just need to give it a name like in the following example:

Then, the tracking ID will allow you to track events with your new tracker using the AWS SDK for example.

For our YouTube video recommendation app, we also chose to work with Lambda to track new events. In the Lambda’s code, tracking a new event looks like this:

Event tracking code sample

You will obviously need the IDs of the user and the item involved in the interaction, as well as the timestamp for the time at which the event occured. These informations match the fields of your interactions dataset’s schema (USER_ID, ITEM_ID, TIMESTAMP).

The tracking ID is the one mentionned above, which you get after creating a tracker. The session ID is useful in case you want to keep track of events that occured before the user logs in (when you don’t have the user ID).

How does AWS Personalize Pricing work?

To finish this article and make it complete, let’s talk about what you will pay for using the AWS Personalize service.

The first thing you will pay is the data ingestion, which will cost you $0.05 per GB of data. This includes the data you import from S3 as explained earlier, but also the data ingested in real-time by your trackers. Note that the first 20GB of each month are free! So this is obviously not what will cost you the most money.

The next thing you will pay by using AWS Personalize is the training of your solutions. The measure used to compute what you will pay is the training hour.

A training hour represents 1 hour of compute capacity using 4v CPUs and 8 GiB memory.

Note that sometimes you will pay more than the actual elapsed time during your trainings because:

Amazon Personalize automatically chooses the most efficient instance types to train your data, which may be an instance that exceeds the baseline specifications in order to complete your job more quickly.

Similarly to the data ingestion, there is a free tier of 100 training hours. A training hour costs $0.24.

Finally, the last (but not least) feature of the service that you will pay is Inference. We will speak here only about real-time inference. To make it simple, you will pay real-time inference as if you were renting a remote machine.

The service supports real-time recommendations, which is measured in transactions per second (TPS).

You have to specify the minimum TPS you want to the service, according to your needs. Then, the amount you will pay is based on the TPS-hour quantity which is computed with the following formula:

TPS provisionned corresponds to the minimum TPS you specified when creating your campaign and TPS actual is the actual amount of TPS you consumed. You can find more details and examples about this princing here. Inference pricing works as a bulk discount, the first 20K TPS-hour for a month are billed $0.20 per unit, the next 180K cost $0.10 per unit and finally over the amount of 200K TPS-hour you will pay TPS-hour $0.05 each. The more you use it the less you pay! (There is also a monthly free tier of 50 TPS-hour.)

Because we are nice people, here is a conversion of these thresholds in terms of TPS (and not TPS-hour) to help you choose your provisioned TPS:

20K monthly TPS-hour corresponds to ~28 provisionned TPS
200K monthly TPS-hour represents ~278 provisionned TPS

To sum up for the pricing, the important thing to keep in mind is that the Inference is what will cost you the most and you will start paying for it when you create your campaign.

Conclusion

As a conclusion, I would simply say: it works.

In order for you to save a good amount of time, we advise you to first think carefully about the structure of your data before trying anything with AWS Personalize. This way, you will avoid struggling importing your datasets in the service.

Also, be careful with the pricing, don’t keep alive campaigns that you don’t use anymore because it can cost you an arm and a leg without using it.

To finish, it is not possible (for now at least) to get recommendations directly on your client-side. You will need to either use a serverless approach (as we did with Lambda) or a dedicated backend server for this purpose.

You finally reached the end of this series of article! We hope you could learn some useful information along these 3 articles.

Thank you.

I hope you enjoyed this article! Please share your comments with us in the section below👇