Content Discovery & Recommendations Using Spark Machine Learning on Linode’s High Memory Instances

Karthik Shiraly
Jun 29, 2017 · 11 min read

One reason I love statistics and machine learning is that they provide techniques to make computers solve problems smartly and quickly that would otherwise require considerable manual effort and time.

In this article, I describe how I approached one such problem related to content discovery and recommendations, using unsupervised machine learning techniques.

I also used the opportunity to explore a solution using Apache Spark, instead of using a more common machine learning platform like Python’s scikit-learn. An advantage of Spark’s machine learning implementations are that they support distributed cluster processing out of the box, unlike scikit-learn’s implementations.

I chose to run my experiments on Linode’s cloud for multiple reasons. First, it’s a good idea to get familiar with multiple clouds in case one needs to migrate or setup a fallback quickly in a disaster scenario.

Second, Linode recently introduced high memory instances which are ideal for Spark, and I was curious about their performance.

And third, I find Linode’s network bandwidth, limits and pricing are better, compared to other clouds.

In this article, I’ll describe the machine learning approach, the deployment architecture I used and optimizations I did during these experiments.

Now to the problem…

The problem

Ample content gets added to the internet every day (3 exabytes by some estimates). While most of it is undoubtedly irrelevant to me, there must also be plenty of useful content — like articles, discussions and videos — that fall into my areas of interest, but that I miss out on because I don’t have the time or patience to constantly monitor multiple websites for updates.

I already visit my favorite websites daily, use RSS feed readers and subscribe to online news aggregator portals. And they do help.

But I also wanted to explore approaches that are more automated, more flexible, more personalized, more private and demand less effort.

The problem statement I came up with: Given a list of my interests as inputs, could a system find and recommend content that matches those interests?

An unsupervised machine learning approach

There are multiple approaches to a problem like this. Although “content” can also include non-textual content like images, videos and music, I will stick to text processing approaches and process any non-textual content using only the text data associated with visuals, such as video descriptions or image captions.

One straightforward approach is to crawl the web and use information retrieval approaches like TF-IDF (term frequencies-inverse document frequencies) features along with metrics like cosine similarity to find content which matches other content. I would use something like Apache Nutch and Solr or ElasticSearch for this. But the problem with such a system is that associated keywords are often not rich enough to adequately describe content, and any content that does not match those keywords explicitly is likely to be discarded.

Then there is the supervised classification approach. I would have to find some exemplar content that describe my interests, label them with one or more interests, and use supervised classification techniques, like support vector machines or random forests, to classify new content.

While this is a fine technique that yields good results, the problem I have with it is that finding content and labelling it requires manual effort of the kind I don’t want to perform. Whenever I were to develop a new interest (which I often do), I would have to put in some kind of labelling effort again.

Another problem with supervised classification is that if a list of interests gets lengthy because of the fine-granularity differences between some of them, the volume of training data required also increases.

A third approach is the unsupervised clustering approach, where content is grouped into multiple clusters. In the context of this problem, each cluster of content can be thought of as representative of one interest. Any new content found is then assigned to one of these clusters, and recommendations can be extracted made per cluster.

This is a low-effort approach, but its issue is that any content that is actually a mixture of topics — such as this very article — ends up being assigned to just one cluster while its relationships to other clusters are discarded. This is not how the real world works, and it affects recommendations adversely.

But the previous paragraph holds a key phrase — a mixture of topics — that leads us to a better approach, called Topic Modelling. It acknowledges the reality that the same content can belong to multiple bins. Among the many topic modelling techniques, the one I chose is Latent Dirichlet Allocation (LDA).

LDA is a technique that comes up with a list of so-called “topics” and views every piece of content as a probability distribution over those topics. But an LDA topic is not a simple label, as in classification. Instead, every LDA topic is itself a probability distribution over the words in a corpus of documents.

The word “latent” in statistics means hidden information that is not directly observed but is inferred using statistical techniques. “Latent” in LDA comes from the fact that it infers a set of topics using just the words in a corpus of documents.

The diagram below depicts an example of LDA’s world view, where a number of topics exist and every document in the corpus is a combination of those topics.

LDA derives probability distributions from the documents at top. Each color is a different topic. From the list of words in the topics, we can infer that Topic 1 is related to mathematics, Topic 2 to zoology, and Topic 3 to computer graphics or games. Unfortunately, when processing actual data, topic modelling rarely produces such clear-cut, distinct topics.

So, how does LDA help with my content discovery and recommendations problem?

The idea is that given a training corpus of content:

  • first discover a set of topics from the training corpus using LDA.

The next question is what to use as a training corpus of content for LDA? Remember, the actual inputs for this problem are my interests, and so a training corpus should be representative of those interests. However, interests are mere abstract concepts in my mind, while computers need something more concrete to crunch, like a table of numbers.

One set of readily available data that emerges from my interests are the contents of the URLs in my internet browsing history, because it’s unlikely I’d keep browsing something that does not interest me. Internet browsing history is, therefore, the training inputs in this solution.

There are other possible representative data that can be used — such as a portable EEG monitor recording my brain activity or a head -mounted camera capturing my activities while I indulge in different interests — but internet browsing history data is easily available, simple and practical.

Content Recommendations based on browsing history

Implementing the solution using Spark

Apache Spark comes bundled with its own library of machine learning algorithms, called MLlib. MLlib provides two different implementations of LDA:

  • a distributed LDA model, which uses expectation-maximization as the optimization algorithm, but does not support online learning.

Online learning refers to machine learning models that can update themselves as and when new data is received, which is very useful in environments where inferences should be drawn immediately using the latest data.

  • a non-distributed model, which uses variational Bayes algorithm and supports online learning.

I selected the distributed LDA model mainly because my simple solution did not make use of online learning. But I did try them both and have written about their relative performance in a later section.

For obtaining my browsing history data, I installed this no-frills History Export Chrome extension. It does not require any special permissions and exports the history to a JSON file without any fuss.

The recommender requires a set of target URLs to monitor and from which to recommend content. A complete solution would involve constantly crawling these URLs, fetching latest content and analyzing it for recommendations. But that would be overkill.

Instead, I configured the system to monitor only a few subreddits I know I’m interested in and the latest YouTube videos using its API. The system supports adding more content- fetching plugins if required.

Deployment architecture

Spark jobs can be executed on a single node in “local” mode or on a cluster of nodes if more resources are required. A Spark cluster consists of a master and a number of workers. The master coordinates and distributes operations, while workers execute operations.

The storage most commonly deployed with Spark is HDFS. For truly big data, it’s a great filesystem with high availability and durability. But for my simple application, HDFS was overkill.

However, I still needed some kind of shared filesystem, because every Spark machine should have access to the input and target data to process them. So, I just deployed plain old NFS shares on the Spark master.

Spark Cluster Deployment

Try it out

The scripts, code and instructions for the deployment architecture are on GitHub at https://github.com/pathbreak/content-recommender-spark-lda.

As of now, my Spark proof examines contents of only YouTube and HackerNews URLs in browsing history, and based on the topic distributions in those URLs, recommends latest YouTube videos or content from any RSS/ATOM feed (Reddit and WordPress blogs are good examples that provide comprehensive RSS feeds).

The system supports pluggable modules for handling URLs. If you want support for other websites, you are welcome to file GitHub issues or contribute your own plugins.

Both history and targeted content fetching are deliberately kept single threaded and infrequent, to avoid overwhelming website servers. Content fetching can take considerable time — 20 to 50 minutes should be expected.

A training run typically takes only about 5 to 10 minutes on a 32GB or 60GB high-memory Linode.

Recommender Output

Screenshots of the recommender’s output:

Recommender’s output. On the left are the list of recommendations and the basis for those recommendations . On the right are the topics and the top 10 words in their probability distributions.

Performance measurements & optimizations

What black box optimizations are possible when using Spark and MLlib?

Underlying Spark’s data processing algorithms are a set of libraries that provide linear algebraic, numerical computing, and optimization routines. The stack looks like this:

Spark’s computing stack

From the stack, two possible optimizations suggest themselves:

  • Instead of java/scala implementations of linear algebraic routines that are executed by the JVM interpreter, use a native BLAS/LAPACK implementation like OpenBLAS/ATLAS/MKL written in C/C++/Fortran and deployed as native components that execute directly on the CPU without any intermediate interpreter like the JVM.

Distributed LDA vs Online LDA

Spark’s distributed LDA implementation, which uses an expectation-maximization algorithm, turned out to be surprisingly faster and much lighter on system resources than its online LDA implementation that uses a variational Bayes algorithm.

Run times of the two algorithms. Distributed LDA that uses expectation-maximization is 5–6 times faster than Online LDA that uses Variational Bayes. “EM” is for expectation-maximization. “ON” is for online LDA. The numbers indicate the number of iterations, a parameter that is passed to the LDA model.

The algorithm selection is via a parameter to Spark’s API and does not require any other change in the Spark job.

The stark contrast in time taken and resources consumed between the two algorithms makes me think, perhaps, the online LDA implementation is not as optimized as the distributed one.

I had also expected improvement in run times by switching to native stacks like OpenBLAS, ATLAS or Intel’s MKL.
But as the run times bar chart above shows, none of them made much of an improvement for this app. While one possible explanation is that Spark’s default stack itself is highly optimized, I’m leaning towards either the job or the algorithm implementations being memory- and disk-bound rather than CPU-bound.

Disk and Memory Usage of the two implementations. “EM” is for expectation-maximization. “ON” is for online LDA. The numbers indicate the number of iterations, a parameter that is passed to the LDA model. Notice how EM barely consumes about a fifth of the 90GB disk space while Online LDA used so much /tmp space that it even ran out of disk space once and failed. Similarly, EM consumed only about a third of the 60GB RAM, while Online LDA was brushing against the maximum heap space allocated to it.
CPU Usage of the two implementations. “EM” is for expectation-maximization. “ON” is for online LDA. The numbers indicate the number of iterations, a parameter that is passed to the LDA model. Clearly, Online LDA places a much heavier load on the system for a longer time.

Online LDA is especially hard on disk and memory. I encountered frequent resource exhaustion problems with lower Linode configurations until I reached the Linode 60GB high memory instance with 90GB storage and 4 CPUs (and even there, I got a disk exhaustion failure once).

Based on these resource graphs and results, I recommend using the distributed LDA implementation whenever possible. The recommender app, too, uses it by default. Anything with 12GB RAM and 2 or more cores should be enough for distributed LDA.

Conclusions

A personalized, private recommender system using unsupervised machine learning is useful in many other scenarios — for example, recommending articles or discussions to visitors on large websites.

Building one using Spark was fun, if a bit unusual, because Spark is rarely used for such personal applications. Nevertheless, it helped me get good insights into Spark’s LDA implementations.

Linode’s configurations — even the mid-level ones — handled distributed LDA just fine. Online LDA was an entirely different story — only the highest configurations could handle it. Linode’s network bandwidth and generous data download quotas enabled fast and frequent content fetching, ensuring that not much useful content got missed.

If you are new to machine learning and want to learn about machine learning techniques, I suggest starting with this excellent and popular online course taught by Professor Andrew Ng, one of the researchers who introduced LDA.

Credits

  1. Contributors and committers of the Apache Spark project

About me: I’m a software consultant and architect specializing in big data, data science and machine learning, with 14 years of experience. I run Pathbreak Consulting, which provides consulting services in these areas for startups and other businesses. I blog here and I’m on GitHub. You can contact me via my website or LinkedIn.

Please feel free to share below any comments or insights about your experience using Linode, machine learning and the recommender system. You are welcome to report bugs or feature requests on the project’s GitHub repo. If you found this blog useful, consider sharing it through social media.

Linode Cube

We’re covering everything from tech news and industry…

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store