Building Movie Recommendation System with AWS ML Services.
Introduction
Movie or Product recommendation is one of the most popular applications in Machine Learning. And recommendation systems are based on modelling the preferences of the users on the items based on their past interactions. (e.g click events, user ratings)
There are two ways we can build recommendation system in AWS Cloud platform.
- Amazon SageMaker — Factorization Machine Algorithm
- Amazon Personalize
Amazon Personalize vs SageMaker: Factorization Machine
Here is the table talks about Amazon Personalize vs SageMaker FM algorithm
In this article, lets see quick example for Movie recommendation using Factorization Machine.
Factorization Machine — Intro
- Deals with Sparse data
- It is extension of linear learning model to work on sparse data
- Supervised Algorithm
- Can be used for classification or regression problem
- It expects all categorical values to be OneHotEncoded.
- For model training, it expects recordio protobuf data format
- It expects data in float-32 data type.
- CSV file does not work
What is Record IO Protobuf ??
- RecordIO is a file format used for efficient data storage and retrieval, particularly in the context of deep learning and data processing.
- It allows to store large amounts of data in binary format, making it more compact and faster to read.
- Protobuf (Protocol Buffers) is a language-neutral, platform-neutral, and extensible data interchange format
- RecordIO + Protobuf, provides a way to store data efficiently in binary format
OneHotEncoding
- One hot encoding provides a method of having a numeric representation of a feature that does not also have a size difference.
- Binary values are assigned to each category. The zeros and ones form binary variables which show the presence or absence of a category.
Building FM Model : High level steps
Steps involved in dataset preparation, preprocessing, training and deployment of Factorization Machine model.
Training & Test DataSet
In this article, we will use MovieLens data sets, consists of: 100,000 ratings (1–5) from 943 users on 1682 movies.
Here is dataset link https://grouplens.org/datasets/movielens/100k/
Dataset consist of the following attributes.
- userId = unique identifier of the user. [Input Feature]
- movieId = unique identifier of the movie. [Input Feature]
- rating = User rating of the movie (1–5). [Target]
- timestamp = date variable.
- Download dataset.
# !wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
# !unzip -o ml-100k.zip
2. Create training & testing dataframe
ratings_train_df = pd.read_csv('ua.base', sep='\t', names=['userId','movieId','rating','timestamp'] )
ratings_test_df = pd.read_csv('ua.test', sep='\t', names=['userId','movieId','rating','timestamp'] )
3. Convert the target variable (rating) into binary
ratings_train_df['rating_bin'] = (ratings_train_df.rating>=4).astype('float32')
ratings_test_df['rating_bin'] = (ratings_test_df.rating>=4).astype('float32')
4. Use OneHotEncode strategy
encoder = OneHotEncoder(handle_unknown='ignore')
encoder.fit(ratings_train_df[['userId','movieId']])
x_train = encoder.transform(ratings_train_df[['userId','movieId']]).astype('float32')
y_train = ratings_train_df['rating_bin']
x_test = encoder.transform(ratings_test_df[['userId','movieId']]).astype('float32')
y_test = ratings_test_df['rating_bin']
5. Prepare for Model training with instance count, instance type, max run time
from sagemaker.image_uris import retrieve
role = get_execution_role()
session = sagemaker.Session()
training_image = retrieve(region=boto3.Session().region_name, framework="factorization-machines", version='latest')
fm = sagemaker.estimator.Estimator(
training_image, role,
instance_count=1,
instance_type='ml.c4.xlarge',
volume_size=30,
max_run=86400,
output_path=output_path_prefix,
sagemaker_session=session,
)
5. Set Hyperparameter & channel to initiate training job.
fm.set_hyperparameters(
feature_dim=columns,
predictor_type='binary_classifier',
mini_batch_size=200
)
data_channels = {
"train": train_data,
"test": test_data
}
fm.fit(inputs=data_channels, logs=True)
6. Deploy the Model for the prediction (which creates real time endpoint)
%%time
fm_predictor = fm.deploy(
initial_instance_count=1,
instance_type="ml.c4.xlarge",
deserializer= JSONDeserializer()
)
7. Cleanup (Delete the endpoint)
fm_predictor.delete_endpoint()
Conclusion
To summarize
- Amazon Personalize is a fully-managed service focused on real-time recommendation systems, providing ease of use and scalability.
- Amazon SageMaker is a more versatile and customizable service that enables developers to build, train, and deploy custom machine learning models for various tasks
We’ve seen quick example for item recommendation using Factorization Machine, we ‘ll cover using Amazon Personalize in next article.