Building a Recommender System with Text Embeddings using Python and LightFM

4 min readJun 19, 2023

Unleashing the Power of Jokes…

In today’s digital age, recommender systems have become an integral part of our online experiences. From e-commerce platforms to content streaming services, personalized recommendations help users discover relevant products, services, and content. While traditional recommender systems often rely on user-item interactions, incorporating textual information can greatly enhance the quality and accuracy of recommendations. In this article, we will explore how to build a collaborative filtering recommender system using Python and the LightFM package, with the assistance of the TensorFlow Universal Sentence Encoder for creating text embeddings.

What is LightFM?

LightFM is a Python library that provides a hybrid recommendation framework. It allows for the integration of various data sources, such as user-item interactions and item features, to build recommender systems. LightFM is designed to handle large datasets efficiently and provides a flexible API for training and evaluation.

Integrating Textual Data

Textual data, such as product descriptions or user reviews, contains valuable information that can enrich recommendation systems. By converting text into numerical representations called embeddings, we can capture the semantic meaning of words and phrases.

Harnessing the Power of Universal Sentence Encoder

The TensorFlow Universal Sentence Encoder allows us to encode text into fixed-dimensional embeddings. In our case, we utilize this powerful tool to embed the jokes from the Jester dataset. By encoding jokes into numerical representations, we capture their underlying meanings and enable the recommender system to understand their nuances. For the record this encoder can handle multiple languages (Arabic, Chinese-simplified, Chinese-traditional, English, French, German, Italian, Japanese, Korean, Dutch, Polish, Portuguese, Spanish, Thai, Turkish, Russian). So you can be funny in 16 languages :-)

First we start with installing the LightFM package:

pip3 install lightfm

Load the Universal Sentence Encoder’s TF Hub module and Libraries

Next we import the libraries and models

import tensorflow as tf
import tensorflow_hub as hub
import numpy as np
import pandas as pd
from lightfm import LightFM
from lightfm.data import Dataset

module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" #@param ["https://tfhub.dev/google/universal-sentence-encoder/4", "https://tfhub.dev/google/universal-sentence-encoder-large/5"]
sent_model = hub.load(module_url)
print (f"module {module_url} loaded")

Next we load the data from Google drive. You can add the data to your own drive and acces it within colab.

# Load item information
items = pd.read_csv('/content/drive/MyDrive/jester_items.csv')
# Load user ratings
ratings = pd.read_excel('/content/drive/MyDrive/jester-data-1.xls', header=None)
# Remove the first column containing the number of jokes rated by that user.
ratings.drop(columns=0, inplace=True)
ratings = ratings.T
ratings.columns = ["user_" + str(col) for col in ratings.columns]

# Convert the ratings below -10 to -10 and above 10 to 10
ratings[ratings < -10] = -10
ratings[ratings > 10] = 10

df_normalized = ratings.applymap(lambda x: 0 if x == 99 else x )

The data consist of two files. This first file represents the users and its ratings. The second file contains the joke id and joke text. We load both files and within the rating file we drop the first column since it contains the number of jokes rated a user. Next we transpose the columns and rows so that we can give the columns names (i.e. user_1, user_2 etc.). We convert the ratings between -10 and 10 and remove ratings 99 since it stands for not rated.

Now we focus on the items. We need to transform the text to embeddings first we create a quick function to embed our text.

def embed(input):
  return sent_model(input)

We take the second column of the items dataframe and loop through the column while passing to our created embed function. We append the embeddings to a list.

# Create item features
item_list = []
# Encode jokes with Universal Sentence Encoder
for joke in items.iloc[:, 1].values:
    embeddings = embed([joke])
    item_list.append(embeddings[0])

Next we can build our LightFM Dataset. We pass the the users, items and lenght of the vector to the Dataset model and call the fit function. Next we create an interaction matrix in where we pass the users and their jokes.

np_items = df_normalized.T.columns.values.astype(int)
dataset = Dataset(item_identity_features=False)
dataset.fit(users=df_normalized.columns.values, items=df_normalized.T.columns.values,
                         item_features=list(range(len(item_list[0]))))
(interactions, weights) = dataset.build_interactions(((k, v)
                                                    for k, v in
                                                    zip(list(df_normalized.columns.values),
                                                    list(df_normalized.T.columns.values))))

Now we have the interaction matrix, next we do the same for the items and embeddings. We pass the embedding to their correspoding joke id. We do this by calling the build_item_features function.

item_features = dataset.build_item_features(  
    ((k, {fn: fv for fn, fv in enumerate(v)}) for k, v in zip(df_normalized.T.columns.values, item_list)), normalize=False)

We have our Datasets ready, now we initialize the model and pass the Datasets to LightFM and train.

model = LightFM(loss='warp')
model.fit(interactions, item_features=item_features, 
                epochs=5)

This model in not optimized, you can play with different loss functions and epochs. See the LightFM docs for all possible loss functions. Depend on the amount of data, the training function can take a while. If it is taking to long, consider reducing the dimension of your embeddings with the PCA method. A dimension reduction function could look like this:

from sklearn.decomposition import PCA
pca = PCA(n_components=n_dims)
pca.fit_transform(embeddings)

Now that our model is finished training we can call it to make the recommendations.

user_id = "user_943"
scores = model.predict(dataset.mapping()[0][user_id],
                                    np.arange(dataset.interactions_shape()[1]))

We take a user and look them up in our Dataset and calculate the scores. We transform these scores to the correspoding Joke_ids.

recomendations = items['jokeText'].loc[np_items[np.argsort(-scores)][:5]]
print(f"Recommendations for  {user_id} \n{recomendations} ")

et voilà:

Recommendations for  user_943 
45    A couple has been married for 75 years. For th...
95    Two attorneys went into a diner and ordered tw...
41    Two men are discussing the age old question: w...
78    Q: Ever wonder why the IRS calls it Form 1040?...
71    On the first day of college, the Dean addresse...

In this blog we wrote a simple recommender system with the help of Tensorflow and LightFM. You can find the full code over in this Colab notebook.

Building a Recommender System with Text Embeddings using Python and LightFM

What is LightFM?

Integrating Textual Data

Harnessing the Power of Universal Sentence Encoder

Load the Universal Sentence Encoder’s TF Hub module and Libraries

Written by Rick Kosse