Content-Based Recommendation Systems with Apache MXNet

Greg Chase
Apache MXNet
Published in
10 min readJan 23, 2019
MXNet Logo

Introduction

Recommendation systems are known for providing amazing experiences across any industry and user base. This post demonstrates how to build a content-based recommendation system using Scikit-Learn and MXNet.

This recommendation system will request the top N recommended news articles, relative to the content of each news article.

Prerequisites

  • Download the blog post repository.
  • AWS r4.2xlarge EC2 instance for processing the full data set example. A subset of the first 1000 articles is included within the repository.

Background — Recommendation Systems 101

Recommendation systems have two fundamental architectures: collaborative filtering and content-based filtering. Each system has its strengths and weaknesses.

Collaborative Filtering — Robust System with Cold Start Issues

Collaborative filtering is a type of recommendation system which works via prior user behavior. Collaborative filtering was utilized to win the Netflix prize, and is in use by Amazon. For example, any time you see “you may also like…” or “Customers also bought…” on Amazon’s website, this is collaborative filtering at work. For more information, you can read this article.

Collaborative filtering at work. When reviewing a product to buy, Amazon’s system recommends other items purchased by users with similar purchasing habits.

However, because prior data is necessary, collaborative filtering suffers from the “cold start” problem. Meaning: the system has difficulty making good recommendations when the system has no user data. The common solution is to “front load” products. In the case of a company like Netflix, they front load their own content. Once you have enough data, the recommendation system starts to work, but not until then.

Content-Based Filtering — Solving the Cold Start Problem

For businesses getting started, or have minimal data, content-based recommendation systems serve as a fantastic alternative to providing users recommendations.

A content-based recommendation system is highly customized per user, but the cold start problem is non-existent. Because of this, the business can provide recommendations starting from the beginning of their tenure.

Hands-On Example: Building A News Article Recommendation System

Download Data

For this tutorial, we’ll be using the “All the news” data set from Kaggle. You can download the data with a Kaggle account (free of charge), or use the Kaggle API. Additionally, there is a subset of 1000 articles provided in the repository.

Import Data

Once the data is downloaded, you’ll want to import each of the CSV’s using Pandas, shown below. For development purposes we’ll subset the first 1000 rows of the entire article collection.

import pandas as pd
import glob
import os
file_path = "../data/"
all_files = glob.glob(file_path + "*.csv")
# Articles combines the following steps:
# 1. For each CSV, import only the id, title, publication, and content columns.
extract_features = lambda f : pd.read_csv(f, usecols = ["id", "title", "publication", "content"])# 2. Concatenate/ combine all CSV's into a single Pandas DataFrame.articles = pd.concat((extract_features(f) for f in all_files))
# Subset the articles
articles = articles.head(1000)

Create TF-IDF Matrix

TF-IDF is designed to reduce the number of tokens occurring within the corpus. When the TF-IDF Vectorizer is utilized, a vocabulary is created from the entire set of news articles, also referred to as “documents”.

After importing the documents, define the TfidfVectorizer from Scikit-Learn, and run against the content of all the articles. The parameters utilized are described below, and available in more detail in the Scikit-Learn documentation.

  • analyzer — Specifies if the feature should be made of word or character n-grams.
  • ngram_range — Defines the boundaries for creating n-grams, read as min_n <= n <= max_n. In this case, the maximum words that can occur together is 3, also known as a “trigram”. If we were looking for only 2 words occurring together, we’d specify the range as (1,2), which would yield a “bigram”.
  • min_df — The cutoff value for ignoring words with a document frequency lower than the given threshold. In this example, if words occur in less than 20% of the documents, they’re removed. A higher value yields more aggressive text filtering.
  • stop_words — Words considered to possess no value in the context of the problem. The most common include words such as “I”, “me”, “the”, “and”, etc. In this case, we’ll filter out the common stop words in the English language. You can also create a custom stop word list for your specific problem.

The fit_transform function returns a NumPy sparse matrix. In this type of matrix, the majority of the elements are zero. If the majority of elements are not zero, this would be considered a dense matrix. The justification for a sparse matrix is a smaller data size, and greater computational efficiency.

from sklearn.feature_extraction.text import TfidfVectorizertf = TfidfVectorizer(analyzer="word", ngram_range=(1, 3),min_df=0.2, stop_words="english")tfidf_matrix = tf.fit_transform(articles["content"])

Create the Cosine Similarity Matrix

After the TF-IDF matrix has been created, we can create the cosine similarity matrix with Scikit-Learn. Cosine similarity is a metric which calculates the cosine of an angle between two vectors (in this case, two documents). Cosine similarity is calculated by multiplying a matrix by the transpose of itself. The output of the dot product is a number on a normalized space [0,1], and shows the relationship between a pair of documents. In the context of a content recommendation system, a value closer to 1 implies a stronger relationship.

Source: http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/

Create Recommendations with MXNet

Why would you use MXNet for building a recommendation system? For one, MXNet is highly optimized, which lends itself well to the mathematical computations involved in machine learning. However, one of MXNet’s many killer features: MXNet can process data faster than NumPy.

In certain cases, MXNet also possesses performance gains. In a specific case, AWS Technical Evangelist Julien Simon compared TensorFlow to MXNet on the MNIST and CIFAR10 datasets, using Keras as the front end. He found MXNet performed 60% faster than TensorFlow.

This being said, let’s compare the speed of NumPy to MXNet.

NOTE: The calculations shown below are for a dot product, not cosine similarity. Depending on the length of your text, dot product can be preferable.

import numpy as np%timeit np.dot(tfidf_matrix, tfidf_matrix.T)28.8 ms ± 157 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

While this appears to be fast at first glance, 28.8 milliseconds is incredibly slow in Python.

We’ll now convert the matrix to an MXNet Sparse NDArray, and perform the same dot product operation. Just like the TfidfVectorizer, mx.nd.sparse.array creates a sparse matrix, where the majority of the elements in the matrix are zero.

Notice the ctx parameter when creating the matrix; this is another killer app within MXNet. The ctx parameter specifies the context of where the data should reside. The context can be set to mx.cpu() for the DRAM & CPU, or mx.gpu() for GPU memory. Furthermore, you can also specify [mx.gpu(0), mx.gpu(1), mx.gpu(N)] if you have multiple GPU’s, or work in a distributed computing environment.

import mxnet as mxmx_tfidf = mx.nd.sparse.array(tfidf_matrix, ctx=mx.cpu())

After converting to a Sparse NDArray, compute the dot product.

%%timeit
mx.nd.sparse.dot(mx_tfidf, mx_tfidf.T)
mx.nd.waitall()
3.81 ms ± 70.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Notice the cosine similarity calculation in MXNet runs in 3.81 milliseconds. For comparison, let’s compute the cosine similarity matrix on the GPU.

mx_tfidf = mx.nd.sparse.array(tfidf_matrix, ctx=mx.gpu())

Once again, we’ll compute the dot product, but on the GPU.

%%timeit
mx.nd.sparse.dot(mx_tfidf, mx_tfidf.T)
mx.nd.waitall()
1.84 ms ± 8.82 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

We can calculate the speedup using the formula below.

numpy_time = 28.8mxnet_time_cpu = 3.81mxnet_time_gpu = 1.84speedup_cpu = 1 - (mxnet_time_cpu / numpy_time) = 0.8677speedup_gpu = 1 - (mxnet_time_gpu / numpy_time) = 0.9361

Based on the calculation above, MXNet is 86% faster than NumPy for the same calculation on CPU, and 93% faster on GPU. Despite the overhead of converting the matrix into an NDArray, this is a substantial speedup!

Provide Recommendations

Once the cosine similarity matrix has been created, you’re now able to provide recommendations to your user. In this case, we’ll be providing recommendations created from the article content, but basing the recommendations off of the article title. First, print the first 10 article titles in your console.

In []: articles["title"].head(10)Out[]: 
0 Alton Sterling’s son: Everyone needs to protest the right way, with peace
1 Shakespeare’s first four folios sell at auction for almost £2.5m2 My grandmother’s death saved me from a life of debt3 I feared my life lacked meaning. Cancer pushed me to find some4 Texas man serving life sentence innocent of double murder, judge says5 My dad’s Reagan protests inspire me to stand up to Donald Trump6 Flatmates of gay Syrian refugee beheaded in Turkey fear they will be next7 Jaffas and daredevils: life on the world’s steepest street8 NSA contractor arrested for alleged theft of top secret classified information9 Donald Trump to dissolve his charitable foundation after mounting complaints

The article at index 3 looks compelling, and I’d want articles with similar content to this. To get the top 10 recommended articles similar to the one above, run the following commands. We’ll add an additional recommendation, since the first recommendation will be the article we’re comparing against.

n_recs = 10article_sims = mx_tfidf[3].asnumpy()
article_recs = np.argsort(-article_sims)[n_recs + 1]
df_recs = articles.loc[list(article_recs)]
df_recs["similarity"] = article_sims[article_recs]

df_recs produces the top 10 articles.

Top 10 recommended articles from first 1000 articles.

While this is only the first 1000 articles in the dataset, we see the top articles are related in a few, intuitive ways. The first is all appear to be editorials, and all of them are written about life experiences. A few articles recommended lean more towards politics, but this is due to using the first 1000 articles, versus the entire set of news articles.

Processing the Full Dataset (50K Rows)

For those that have access to AWS, you can process all the news article content. However, in order to process all articles, we’ll first lemmatize the article content using spaCy. Billed as “Industrial Strength Natural Language Processing”, spaCy is a powerful NLP library for parallel processing text data.

Lemmatization is a technique to reduce words to a common base form. Per the Stanford NLP documentation, “If confronted with the token saw, stemming might return just s, whereas lemmatization would attempt to return either see or saw depending on whether the use of the token was as a verb or a noun”. Because context matters in news articles, lemmatizing the text is optimal.

In the Python script, load the PreprocessText class, and run the lemmatization function. As a caveat: even though the text will be processed in parallel, the articles can take upwards of 60 minutes to complete processing.

prep = PreprocessText()articles["articles_lemmatized"] = prep.lemmatization(articles["content"].values)

After lemmatizing the article text, convert each list of strings into a single string. This format is required by the TfidfVectorizer. Then, re-run the TfidfVectorizer against the full data set.

articles["articles_lemmatized"] = articles["articles_lemmatized"].apply(" ".join)tfidf_matrix = tf.fit_transform(articles["articles_lemmatized"])

Create Recommendations with MXNet (Full Data Set) [50K Rows]

Once again, we can benchmark NumPy against MXNet for creating recommendations. But this time, we’ll create a cosine similarity matrix against the full data set. Be aware this will take about 30 minutes to complete.

%timeit np.dot(tfidf_matrix, tfidf_matrix.T)2min 53s ± 3.59 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

OUCH! While NumPy is well coded, a dot product computed across a dense matrix can be incredibly slow, and pushes the limits of memory on your server. Even on the r4.2xlargeEC2 instance, computing each NumPy dot product took nearly all the available 60GB of DRAM!

Let’s convert to an MXNet Sparse NDArray, and create the cosine similarity matrix using MXNet.

mx_tfidf = mx.nd.sparse.array(tfidf_matrix, ctx=mx.cpu())%%timeit
mx.nd.sparse.dot(mx_tfidf, mx_tfidf.T)
mx.nd.waitall()
25.9 s ± 183 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Despite being computed on the CPU, MXNet manages to compute the cosine similarity matrix, averaging under 26 seconds per run. Once again, compared to the NumPy dot product, this is 85% faster than NumPy on CPU.

Provide Recommendations (Full Data Set) [50K Rows]

As performed above, let’s get the top 10 recommended articles for the article at index 3.

n_recs = 10article_sims = mx_tfidf[3].asnumpy()
article_recs = np.argsort(-article_sims).tolist()[n_recs + 1]
df_recs = articles.loc[list(article_recs)]
df_recs["similarity"] = article_sims[article_recs]

The following are the top 10 recommendations, including the original article.

3       I feared my life lacked meaning. Cancer pushed me to find some26790   I trained myself to be less busy — and it dramatically improved my life

3363 Our life in three stages – school, work, retirement – will not survive much longer
3400 How a parenting prenup made my life amazing1988 A moment that changed me: group therapy stopped me falling for versions of my dad35881 I desperately wanted kids. It didn’t happen. And I’m okay with that.15548 A Planet With Brains? The Peril And Potential Of Self-Aware Geological Change27088 Why South Koreans now live longer than Americans7430 The Extended Beauty Of Photosynthesis13540 His Instrument Gave Me Wings: Remembering Synth Inventor Don Buchla

While the recommendation system’s been given the news article content, the article titles imply the processed text utilized for recommendations is providing good recommendations!

Conclusion

As described above, content-based recommendation systems serve as a simple, straightforward way to provide recommendations, even with minimal data.

When utilizing MXNet to perform the necessary computations, the speedup is substantial, saving time to get results. When performing computations on a GPU, the speedup is that much more pronounced.

For even more information, review the MXNet tutorials!

--

--

Greg Chase
Apache MXNet

Data scientist, in pursuit of creating insanely great products.