Factorization Machines: Pictures + Code (PyTorch)

Daniel Lam
9 min readJul 26, 2023

TLDR:
Problem:
Given a dataset of users, movies, and ratings. Use factorization machines to give movie recommendations.
Dataset: ml-1m.zip from https://grouplens.org/datasets/movielens/
Code: “03_factorization_machines.ipynb” https://github.com/Datadote/matrix-factorization-pytorch
Refs: 1) Factorization machine post
2) Original factorization paper by Rendle
3) Medium post on matrix factorization

Steps:
1) Describe the problem and explore dataset
2) Preprocess dataset for training and validation
3) Create factorization machine
4) Train model
5) Check results
6) Save files for Streamlit deployment

1) Describe the problem and explore dataset

Problem: Given a dataset of users, movies, and ratings. Use factorization machines to give movie recommendations.
Dataset: ml-1m.zip from https://grouplens.org/datasets/movielens/
Data features: movies, ratings, timestamps, titles, and genres. User metata features: age, gender, occupation, zip code. Movies without ratings are removed.

# Data:     https://files.grouplens.org/datasets/movielens/ml-1m.zip
# Metadata: https://files.grouplens.org/datasets/movielens/ml-1m-README.txt
DATA_DIR = './data/ml-1m/'
df_movies = pd.read_csv(DATA_DIR+'movies.dat', sep='::',
names=['movieId', 'title','genres'],
encoding='latin-1',
engine='python')
user_cols = ['userId', 'gender' ,'age', 'occupation', 'zipcode']
df_users = pd.read_csv(DATA_DIR+'users.dat', sep='::',
header=None,
names=user_cols,
engine='python')
df = pd.read_csv(DATA_DIR+'ratings.dat', sep='::',
names=['userId','movieId','rating','time'],
engine='python')
# Left merge removes movies with no rating. # of unique movies: 3883 -> 3706
df = df.merge(df_movies, on='movieId', how='left')
df = df.merge(df_users, on='userId', how='left')
df = df.sort_values(['userId', 'time'], ascending=[True, True]).reset_index(drop=True)

2) Preprocess dataset

i) Convert columns into categorical with defaultdict(LabelEncoder)
This remaps column range to [0, len(unique(column))]. For example, the raw max movieId is 3952, but there are only 3706 unique movieIds. This remapping (3952 -> 3706) is important for reducing memory usage in the model’s embeddings. Note the “_index” columns added. These are the labels after labelEncoding.

# Convert users and movies into categorical - use LabelEncoder
# Column categorical range remapped between [0, len(df.column.unique())-1]
# Remapping is important to reduce memory size of nn.Embeddings
d = defaultdict(LabelEncoder)
cols_cat = ['userId', 'movieId', 'gender', 'age', 'occupation']
for c in cols_cat:
d[c].fit(df[c].unique())
df[c+'_index'] = d[c].transform(df[c])
print(f'# unique {c}: {len(d[c].classes_)}')

min_num_ratings = df.groupby(['userId'])['userId'].transform(len).min()
print(f'Min # of ratings per user: {min_num_ratings}')
print(f'Min/Max rating: {df.rating.min()} / {df.rating.max()}')
print(f'df.shape: {df.shape}')

ii) Add feature offsets to allow factorization machine to use 1 embedding matrix
Original paper uses 1-hot encoding, here we use ordinal encoding to reduce memory size.

For factorization machines, one additional step is needed after label encoding: adding feature offsets. By adding feature offsets, we’re able to use only 1 embedding matrix versus using multiple embedding matrix + a for loop. This is important to increase training efficiency.

# To use 1 embedding matrix, need to calculate & add offsets to each feature column
# Orig. paper uses 1-hot encoding, here we use ordinal encoding
# Ordinal encoding reduces memory size. Important for train speed
feature_cols = ['userId_index', 'movieId_index', 'gender_index', 'age_index',
'occupation_index']
# Get offsets
feature_sizes = {}
for feat in feature_cols:
feature_sizes[feat] = len(df[feat].unique())
feature_offsets = {}
NEXT_OFFSET = 0
for k,v in feature_sizes.items():
feature_offsets[k] = NEXT_OFFSET
NEXT_OFFSET += v

# Add offsets to each feature column
for col in feature_cols:
df[col] = df[col].apply(lambda x: x + feature_offsets[col])
print('Offset - feature')
for k, os in feature_offsets.items():
print(f'{os:<6} - {k}')

iii) Split data into train/validation. Create MovieDatasets + dataloaders
Each user has minimum 20 ratings. Use most recent 5 ratings for validation. Use remaining data for train. Features used are 1) userId_index 2) movieId_index 3) gender_index 4) age_index 5) occupation_index

# Make train and val dataset. Use last 5 rated movies per user
# 6040 unique users. Each user has minimum 20 rated movies
THRES = 5
cols = ['rating', *feature_cols]
df_train = df[cols].groupby('userId_index').head(-THRES).reset_index(drop=True)
df_val = df[cols].groupby('userId_index').tail(THRES).reset_index(drop=True)
print(f'df_train shape: {df_train.shape}')
print(f'df_val shape: {df_val.shape}')
df_train.head(3)

Check dataloader shapes and output.

class MovieDataset(Dataset):
""" Movie DS uses x_feats and y_feat """
def __init__(self, df, x_feats, y_feat):
super().__init__()
self.df = df
self.x_feats = df[x_feats].values
self.y_rating = df[y_feat].values
def __len__(self):
return len(self.df)
def __getitem__(self, idx):
return self.x_feats[idx], self.y_rating[idx]

BS = 1024
ds_train = MovieDataset(df_train, feature_cols, 'rating')
ds_val = MovieDataset(df_val, feature_cols, 'rating')
dl_train = DataLoader(ds_train, BS, shuffle=True, num_workers=2)
dl_val = DataLoader(ds_val, BS, shuffle=True, num_workers=2)

xb, yb = next(iter(dl_train))
print(xb.shape, yb.shape)
print(xb)
print(yb)

3) Create Factorization Machine model

i) Why factorization machine?
Factorization machines are a general form of
matrix factorization. Factorization machines also work for cold start problems by using metadata features. Matrix factorization only uses 2 features (user, movie), and takes a dot product to train a model. What if we had other features, such as age or gender? Factorization machines allow you to use any number of features to train a model. To model pairwise (feature-to-feature) interactions, we take a dot product of every feature with every other feature. Then we sum them all up.

In addition to the feature-to-feature dot products, the paper adds a global offset, and feature biases. We also include the offset and bias in our implementation. Below is the equation from the paper. “n” = number of features. “k” = feature dimension

Unoptimized equation. Time complexity: O(k*n²)
Dot product of feature vectors

In the equation above, every feature is multiplied by every feature. If there are “n” features with “k” dimension, this leads to O(k * n²) time. Shown below, the paper derives a faster implementation with time complexity: O(k * n).

Optimized equation. Time complexity: O (k * n)

PyTorch factorization machine with optimized equation + offset + biases. Note, we use the nn.Embedding layer to handle the inputs (which are ordinally encoded). You can think of nn.Embedding layer as a combination of a one-hot vector + Linear/Dense layer. A sigmoid range function (idea taken from FastAi) is used to clamp the outputs for faster training.

class FM(nn.Module):
""" Factorization Machine + user/item bias, weight init., sigmoid_range
Paper - https://www.ismll.uni-hildesheim.de/pub/pdfs/Rendle2010FM.pdf
"""
def __init__(self, num_feats, emb_dim, init, bias, sigmoid):
super().__init__()
self.x_emb = nn.Embedding(num_feats, emb_dim)
self.bias = bias
self.sigmoid = sigmoid
if bias:
self.x_bias = nn.Parameter(torch.zeros(num_feats))
self.offset = nn.Parameter(torch.zeros(1))
if init:
self.x_emb.weight.data.uniform_(0., 0.05)

def forward(self, X):
# Derived time complexity - O(nk)
x_emb = self.x_emb(X) # [bs, num_feats] -> [bs, num_feats, emb_dim]
pow_of_sum = x_emb.sum(dim=1).pow(2) # -> [bs, num_feats]
sum_of_pow = x_emb.pow(2).sum(dim=1) # -> [bs, num_feats]
fm_out = (pow_of_sum - sum_of_pow).sum(1)*0.5 # -> [bs]
if self.bias:
x_biases = self.x_bias[X].sum(1) # -> [bs]
fm_out += x_biases + self.offset # -> [bs]
if self.sigmoid:
return self.sigmoid_range(fm_out, low=0.5) # -> [bs]
return fm_out

def sigmoid_range(self, x, low=0, high=5.5):
""" Sigmoid function with range (low, high) """
return torch.sigmoid(x) * (high-low) + low

4) Train model

AdamW optimizer and mean-squared loss (MSE) are used to train the model. Hyperparameters are put into a config class (CFG) for ease of use.
*Important, to figure out the nn.Embedding size, we take the max feature number, and add 1 to it. The +1 accounts for 0-indexing

CFG = {
'lr': 0.001,
'num_epochs': 8,
'weight_decay': 0.01,
'sigmoid': True,
'bias': True,
'init': True,
}
n_feats = int(pd.concat([df_train, df_val]).max().max())
n_feats = n_feats + 1 # "+ 1" to account for 0 - indexing
mdl = FM(n_feats, emb_dim=100,
init=CFG['init'], bias=CFG['bias'], sigmoid=CFG['sigmoid'])
mdl.to(device)
opt = optim.AdamW(mdl.parameters(), lr=CFG['lr'], weight_decay=CFG['weight_decay'])
loss_fn = nn.MSELoss()
print(f'Model weights: {list(dict(mdl.named_parameters()).keys())}')

Train & validation script:

epoch_train_losses, epoch_val_losses = [], []

for i in range(CFG['num_epochs']):
train_losses, val_losses = [], []
mdl.train()
for xb,yb in dl_train:
xb, yb = xb.to(device), yb.to(device, dtype=torch.float)
preds = mdl(xb)
loss = loss_fn(preds, yb)
train_losses.append(loss.item())
opt.zero_grad()
loss.backward()
opt.step()
mdl.eval()
for xb,yb in dl_val:
xb, yb = xb.to(device), yb.to(device, dtype=torch.float)
preds = mdl(xb)
loss = loss_fn(preds, yb)
val_losses.append(loss.item())
# Start logging
epoch_train_loss = np.mean(train_losses)
epoch_val_loss = np.mean(val_losses)
epoch_train_losses.append(epoch_train_loss)
epoch_val_losses.append(epoch_val_loss)
s = (f'Epoch: {i}, Train Loss: {epoch_train_loss:0.2f}, '
f'Val Loss: {epoch_val_loss:0.2f}'
)
print(s)

5) Check results

Let’s do some sanity checks. The model’s rating range is [0.65, 5.45], which is out of actual rating range [1, 5]. The prediction distribution looks good

# Check predictions distribution
lpreds, lratings = [], []
mdl.eval()
for xb,yb in dl_val:
xb, yb = xb.to(device), yb.to(device, dtype=torch.float)
preds = mdl(xb)
lpreds.extend(preds.detach().cpu().numpy().tolist())
lratings.extend(yb.detach().cpu().numpy().tolist())

print(f'Preds min/max: {min(lpreds):0.2f} / {max(lpreds):0.2f}')
print(f'Rating min/max: {min(lratings):0.2f} / {max(lratings):0.2f}')
plt.figure(figsize=(4,2))
plt.hist(lratings, label='ratings', bins=(np.arange(1,7)-0.5),
rwidth=0.25, color='blue')
plt.hist(lpreds, label='preds', bins=20, rwidth=0.5, color='red')
plt.title('Ratings & Predictions Distribution')
plt.grid()
plt.legend();

Use TSNE to see if the trained nn.Embeddings separated for movie genres. We see Children’s, Horror, and Documentary groupings were learned

# Check TSNE for genres - Make dataframe of movie + embeddings + biases
movies = df.drop_duplicates('movieId_index').reset_index(drop=True)
movies['movieId'] = d['movieId'].transform(movies.movieId)
# Get movie embeddings and biases
idxs_movies = torch.tensor(movies['movieId_index'].values, device=device)
movie_embs = mdl.x_emb.weight[idxs_movies]
movie_biases = mdl.x_bias[idxs_movies]
movies['emb'] = movie_embs.tolist()
movies['bias'] = movie_biases.tolist()

# Check TSNE, and scatter plot movie embeddings
# Movie embeddings do get separated after training
genre_cols = ['Children\'s', 'Horror', 'Documentary']
GENRES = '|'.join(genre_cols)
print(f'Genres: {GENRES}')

movies_subset = movies[movies['genres'].str.contains(GENRES)].copy()
X = np.stack(movies_subset['emb'].values)
ldr = TSNE(n_components=2, init='pca', learning_rate='auto', random_state=42)
Y = ldr.fit_transform(X)
movies_subset['x'] = Y[:, 0]
movies_subset['y'] = Y[:, 1]

def single_genre(genres):
""" Filter movies for genre in genre_cols"""
for genre in genre_cols:
if genre in genres: return genre

movies_subset['genres'] = movies_subset['genres'].apply(single_genre)
plt.figure(figsize=(5, 5))
ax = sns.scatterplot(x='x', y='y', hue='genres', data=movies_subset)

Get movie recommendations for “Toy Story 2 (1999)”. Cosine similarity is used to find similar movies. Note we have to do some feature offset transformations to go from nn.Embeddings to the actual labelEncoder movie index.

# Helper function/dictionaries to convert form name to labelEncoder index/label
d_name2le = dict(zip(df.title, df.movieId))
d_le2name = {v:k for k,v in d_name2le.items()}

def name2itemId(names):
"""Give movie name, returns labelEncoder label. This is before adding any offset"""
if not isinstance(names, list):
names = [names]
return d['movieId'].transform([d_name2le[name] for name in names])

# Input: movie name. Output: movie recommendations using cosine similarity
IDX = name2itemId('Toy Story 2 (1999)')[0] # IDX = 2898, before offset
IDX = IDX + feature_offsets['movieId_index'] # IDX = 8938, after offset to get input movie emb
emb_toy2 = mdl.x_emb(torch.tensor(IDX, device=device))
cosine_sim = torch.tensor(
[F.cosine_similarity(emb_toy2, emb, dim=0) for emb in movie_embs]
)
top8 = cosine_sim.argsort(descending=True)[:8]
movie_sims = cosine_sim[top8]
movie_recs = movies.iloc[top8.detach().numpy()]['title'].values
for rec, sim in zip(movie_recs, movie_sims):
print(f'{sim.tolist():0.3f} - {rec}')

Helper dictionaries showing the labelEncoder user meta feature encodings.

d_age_meta = {'Under 18': 1, '18-24': 18, '25-34': 25, '35-44': 35,
'45-49': 45, '50-55': 50, '56+': 56
}
d_gender = dict(zip(d['gender'].classes_, range(len(d['gender'].classes_))))
d_age = dict(zip(d['age'].classes_, range(len(d['age'].classes_))))
print(f'Gender mapping: {d_gender}')
print(f'Age mapping: {d_age}')

Get cold start movie recommendations for a male aged 18–24. To do this, create a metadata embedding vector by adding the gender and age embedding vectors. Then we dot-product this metadata vector with each movie, add the movie biases, and sort/rank the results.

# Get cold start movie recs for a male (GENDER=1), ages 18-24 (AGE=1)
GENDER = 1
AGE = 1
gender_emb = mdl.x_emb(
torch.tensor(GENDER+feature_offsets['gender_index'], device=device)
)
age_emb = mdl.x_emb(
torch.tensor(AGE+feature_offsets['age_index'], device=device)
)
metadata_emb = gender_emb + age_emb
rankings = movie_biases + (metadata_emb*movie_embs).sum(1) # dot product
rankings = rankings.detach().cpu()
for i, movie in enumerate(movies.iloc[rankings.argsort(descending=True)]['title'].values[:10]):
print(i, movie)

6) Save files for Streamlit app deployment

Save the model, and helper utilities for deployment.

# Save files for Streamlit app deployment - https://github.com/Datadote/movieyelp
SAVE = False
if SAVE:
movie_embs_cpu = movie_embs.cpu()
d_utils = {'label_encoder': d,
'feature_offsets': feature_offsets,
'movie_embs': movie_embs_cpu,
'movies': movies,
'd_name2le': d_name2le,
}
pd.to_pickle(d_utils, 'data/d_utils.pkl', protocol=4)
mdl_scripted = torch.jit.script(mdl)
mdl_scripted.save('mdls/fm_pt.pkl')

This factorization model is deployed as an Streamlit app at movieyelp.com . Deployment code located at https://github.com/Datadote/movieyelp

Streamlit deployment

Finish — Feel free to ask questions. Have a nice day

--

--

Daniel Lam

Machine learning. Pictures + code | MS EE | linkedin.com/in/dnylam/ | Creator of leetracer.com/screener - "LeetCode with Spaced Repetition"