Building Movie Recommendation System with AWS ML Services.

3 min readJun 5, 2024

Introduction

Movie or Product recommendation is one of the most popular applications in Machine Learning. And recommendation systems are based on modelling the preferences of the users on the items based on their past interactions. (e.g click events, user ratings)

There are two ways we can build recommendation system in AWS Cloud platform.

Amazon SageMaker — Factorization Machine Algorithm
Amazon Personalize

Amazon Personalize vs SageMaker: Factorization Machine

Here is the table talks about Amazon Personalize vs SageMaker FM algorithm

**Amazon Personalize vs SageMaker Factorization Machine**

In this article, lets see quick example for Movie recommendation using Factorization Machine.

Factorization Machine — Intro

Deals with Sparse data
It is extension of linear learning model to work on sparse data
Supervised Algorithm
Can be used for classification or regression problem
It expects all categorical values to be OneHotEncoded.
For model training, it expects recordio protobuf data format
It expects data in float-32 data type.
CSV file does not work

What is Record IO Protobuf ??

RecordIO is a file format used for efficient data storage and retrieval, particularly in the context of deep learning and data processing.
It allows to store large amounts of data in binary format, making it more compact and faster to read.
Protobuf (Protocol Buffers) is a language-neutral, platform-neutral, and extensible data interchange format
RecordIO + Protobuf, provides a way to store data efficiently in binary format

OneHotEncoding

One hot encoding provides a method of having a numeric representation of a feature that does not also have a size difference.
Binary values are assigned to each category. The zeros and ones form binary variables which show the presence or absence of a category.

Building FM Model : High level steps

Steps involved in dataset preparation, preprocessing, training and deployment of Factorization Machine model.

Training & Test DataSet

In this article, we will use MovieLens data sets, consists of: 100,000 ratings (1–5) from 943 users on 1682 movies.

Here is dataset link https://grouplens.org/datasets/movielens/100k/

Dataset consist of the following attributes.

userId = unique identifier of the user. [Input Feature]
movieId = unique identifier of the movie. [Input Feature]
rating = User rating of the movie (1–5). [Target]
timestamp = date variable.

Download dataset.

# !wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
# !unzip -o ml-100k.zip

2. Create training & testing dataframe

ratings_train_df = pd.read_csv('ua.base', sep='\t', names=['userId','movieId','rating','timestamp'] )
ratings_test_df = pd.read_csv('ua.test', sep='\t', names=['userId','movieId','rating','timestamp'] )

3. Convert the target variable (rating) into binary

ratings_train_df['rating_bin'] = (ratings_train_df.rating>=4).astype('float32')
ratings_test_df['rating_bin'] = (ratings_test_df.rating>=4).astype('float32')

4. Use OneHotEncode strategy

encoder = OneHotEncoder(handle_unknown='ignore')
encoder.fit(ratings_train_df[['userId','movieId']])

x_train = encoder.transform(ratings_train_df[['userId','movieId']]).astype('float32')
y_train = ratings_train_df['rating_bin']

x_test = encoder.transform(ratings_test_df[['userId','movieId']]).astype('float32')
y_test = ratings_test_df['rating_bin']

5. Prepare for Model training with instance count, instance type, max run time

from sagemaker.image_uris import retrieve

role = get_execution_role()
session = sagemaker.Session()

training_image = retrieve(region=boto3.Session().region_name, framework="factorization-machines", version='latest')

fm = sagemaker.estimator.Estimator(
    training_image, role,
    instance_count=1,
    instance_type='ml.c4.xlarge',
    volume_size=30,
    max_run=86400,
    output_path=output_path_prefix,
    sagemaker_session=session,
)

5. Set Hyperparameter & channel to initiate training job.

fm.set_hyperparameters(
    feature_dim=columns,
    predictor_type='binary_classifier',
    mini_batch_size=200
)

data_channels = {
    "train": train_data,
    "test": test_data
}

fm.fit(inputs=data_channels, logs=True)

6. Deploy the Model for the prediction (which creates real time endpoint)

%%time
fm_predictor = fm.deploy(
    initial_instance_count=1,
    instance_type="ml.c4.xlarge",
    deserializer= JSONDeserializer()
)

7. Cleanup (Delete the endpoint)

fm_predictor.delete_endpoint()

Conclusion

To summarize

Amazon Personalize is a fully-managed service focused on real-time recommendation systems, providing ease of use and scalability.
Amazon SageMaker is a more versatile and customizable service that enables developers to build, train, and deploy custom machine learning models for various tasks

We’ve seen quick example for item recommendation using Factorization Machine, we ‘ll cover using Amazon Personalize in next article.

Building Movie Recommendation System with AWS ML Services.

Written by Bharathvajan G