Building Movie Recommendation System with AWS ML Services.

Bharathvajan G
3 min readJun 5, 2024

--

Introduction

Movie or Product recommendation is one of the most popular applications in Machine Learning. And recommendation systems are based on modelling the preferences of the users on the items based on their past interactions. (e.g click events, user ratings)

There are two ways we can build recommendation system in AWS Cloud platform.

  1. Amazon SageMaker — Factorization Machine Algorithm
  2. Amazon Personalize

Amazon Personalize vs SageMaker: Factorization Machine

Here is the table talks about Amazon Personalize vs SageMaker FM algorithm

Amazon Personalize vs SageMaker Factorization Machine

In this article, lets see quick example for Movie recommendation using Factorization Machine.

Factorization Machine — Intro

  • Deals with Sparse data
  • It is extension of linear learning model to work on sparse data
  • Supervised Algorithm
  • Can be used for classification or regression problem
  • It expects all categorical values to be OneHotEncoded.
  • For model training, it expects recordio protobuf data format
  • It expects data in float-32 data type.
  • CSV file does not work

What is Record IO Protobuf ??

  • RecordIO is a file format used for efficient data storage and retrieval, particularly in the context of deep learning and data processing.
  • It allows to store large amounts of data in binary format, making it more compact and faster to read.
  • Protobuf (Protocol Buffers) is a language-neutral, platform-neutral, and extensible data interchange format
  • RecordIO + Protobuf, provides a way to store data efficiently in binary format

OneHotEncoding

  • One hot encoding provides a method of having a numeric representation of a feature that does not also have a size difference.
  • Binary values are assigned to each category. The zeros and ones form binary variables which show the presence or absence of a category.
Example : OneHotEncoding

Building FM Model : High level steps

Steps involved in dataset preparation, preprocessing, training and deployment of Factorization Machine model.

Training & Test DataSet

In this article, we will use MovieLens data sets, consists of: 100,000 ratings (1–5) from 943 users on 1682 movies.

Here is dataset link https://grouplens.org/datasets/movielens/100k/

Dataset consist of the following attributes.

  • userId = unique identifier of the user. [Input Feature]
  • movieId = unique identifier of the movie. [Input Feature]
  • rating = User rating of the movie (1–5). [Target]
  • timestamp = date variable.
  1. Download dataset.
# !wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
# !unzip -o ml-100k.zip

2. Create training & testing dataframe

ratings_train_df = pd.read_csv('ua.base', sep='\t', names=['userId','movieId','rating','timestamp'] )
ratings_test_df = pd.read_csv('ua.test', sep='\t', names=['userId','movieId','rating','timestamp'] )

3. Convert the target variable (rating) into binary

ratings_train_df['rating_bin'] = (ratings_train_df.rating>=4).astype('float32')
ratings_test_df['rating_bin'] = (ratings_test_df.rating>=4).astype('float32')

4. Use OneHotEncode strategy

encoder = OneHotEncoder(handle_unknown='ignore')
encoder.fit(ratings_train_df[['userId','movieId']])

x_train = encoder.transform(ratings_train_df[['userId','movieId']]).astype('float32')
y_train = ratings_train_df['rating_bin']

x_test = encoder.transform(ratings_test_df[['userId','movieId']]).astype('float32')
y_test = ratings_test_df['rating_bin']

5. Prepare for Model training with instance count, instance type, max run time

from sagemaker.image_uris import retrieve

role = get_execution_role()
session = sagemaker.Session()

training_image = retrieve(region=boto3.Session().region_name, framework="factorization-machines", version='latest')

fm = sagemaker.estimator.Estimator(
training_image, role,
instance_count=1,
instance_type='ml.c4.xlarge',
volume_size=30,
max_run=86400,
output_path=output_path_prefix,
sagemaker_session=session,
)

5. Set Hyperparameter & channel to initiate training job.

fm.set_hyperparameters(
feature_dim=columns,
predictor_type='binary_classifier',
mini_batch_size=200
)

data_channels = {
"train": train_data,
"test": test_data
}

fm.fit(inputs=data_channels, logs=True)

6. Deploy the Model for the prediction (which creates real time endpoint)

%%time
fm_predictor = fm.deploy(
initial_instance_count=1,
instance_type="ml.c4.xlarge",
deserializer= JSONDeserializer()
)

7. Cleanup (Delete the endpoint)

fm_predictor.delete_endpoint()

Conclusion

To summarize

  • Amazon Personalize is a fully-managed service focused on real-time recommendation systems, providing ease of use and scalability.
  • Amazon SageMaker is a more versatile and customizable service that enables developers to build, train, and deploy custom machine learning models for various tasks

We’ve seen quick example for item recommendation using Factorization Machine, we ‘ll cover using Amazon Personalize in next article.

--

--