Building Spotify’s “Discover Weekly” with Spark

An MLlib & PySpark implementation of audio recommendation system using a collaborative filtering algorithm

Today is the third day of my internship at NBC Universal, and I’m inspired to achieve a new goal: mastering Spark before leaving 30 Rock & 1221 Campus (where I’ll be spending most of my time working) this summer.

As a leading media and entertainment industry, our company has hundreds of petabytes of telecast data from all over the country and through international cable TV. As amazing and overwhelming as this might sound, it is no doubt that more and more companies nowadays rely on the power of recommendation systems to meet a variety of individual tastes and needs, to then enhance their customers’ satisfaction and loyalty.

In the entertainment industry, music tech giants like Spotify made recommendation engines a salient part of their product. Pattern detection and analysis involving user interests have played a critical role in providing personalized recommendations. This is because adding another dimension to a user experience helps recommendation systems do a surprisingly great job of identifying tracks we didn’t know we would like.

What algorithm to use?

Let me start with two main approaches we can use for building audio recommendation systems. The first one is content filtering, which uses known information about the products and users to make recommendations. With this approach, we create profiles based on products (e.g. movie information, price information, and product descriptions) and users (e.g. demographic and questionnaire information).

A fairly well known example of the implementation of content filtering is the Music Genome Project of the online radio Pandora. Here, an expert scores a song based on hundreds of characteristics. A user also provides information about his/her music preferences. Recommendations are made based on pairing these two sources.

While Pandora uses content filtering, Spotify uses collaborative filtering for their Discover Weekly recommendation system. This latter technique uses previous users’ input/behavior to make future recommendations. We ignore any a priori user or object information. We use the ratings of similar users to predict the rating.

One way to understand Spotify’s technique further is through a neighborhood-based approach, where you (1) first define a similarity score between you and other users based on how much our overlappfgving ratings agree, then (2) let others vote on what you would like based on these scores.

A major appeal of collaborative filtering over content filtering is its domain-free approach, meaning that it doesn’t need to know what is being rated, just who rated what, and what the rating was.

Note that both approaches are not mutually exclusive. Content information can also be built into collaborative filtering system to improve performance. Now that we understand how the algorithm works, lets understand the technology we’ll be using to build the recommendation system — Spark.

Why Spark?

Apache Spark is a fast and general-purpose cluster computing system. As industries develop more creative, customer-based products and services, the need for machine learning algorithms to help develop personalizations, recommendations, and predictive insights becomes much more important.

Tools like R and Python had long been used to perform a variety of machine learning activities. However, with the vast growth of information, computing efficiency becomes very critical to help address high time and space complexity.

Furthermore, Spark provides data engineers and data scientists with a powerful, unified engine that is not only fast (100x faster than Hadoop for large-scale data processing) and easy to use, but also simple, highly scalable, and effectively integrable with other tools, like R, SQL, Python, Scala, and Java.

Source: Infoworld Analytics

My favorite thing about Spark is that it’s able to help us data scientists solve highly complex machine learning problems involving graph computation, streaming, and real-time interactive query processing in an interactive way and at a much greater scale.

A step by step implementation using PySpark

We will now implement collaborative filtering algorithm for constructing an audio recommendation system using an AudioScribbler dataset (download the compressed archive here).

Import libraries and data sets

Once you have three files in the dataset, you can start spark-shell. The first step in building a model is to understand your data and parse it into forms that are useful for analysis in Spark. Let’s start by importing libraries and initializing your SparkContext to begin coding with PySpark.

import findspark
import pyspark
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.mllib import recommendation
from pyspark.mllib.recommendation import *
'''initialize spark in VM'''
findspark.init('/usr/local/bin/spark-1.3.1-bin-hadoop2.6/')
try:
sc=SparkContext()
except:
None

Next, store each file to a variable. The user_artist_data.txt represents a playlist dataset, where each line of the file contains a user ID, an artist ID, and a play count, separated by spaces. The artist_data.txt contains opaque numeric IDs associated with artist names. The artist_alias.txt maps artist IDs that may be misspelled or nonstandard to the ID of the artist’s canonical name. It contains two IDs per line, separated by a tab.

'''define variables'''
rawUserArtistData = sc.textFile("vagrant/user_artist_data.txt")
rawArtistData = sc.textFile("vagrant/artist_data.txt")
rawArtistAlias = sc.textFile("vagrant/artist_alias.txt")

Preprocess data

We want to obtain a list of raw artist data with each ID and name stored in tuples. Lets use artist_data.txt to create this list.

def pairsplit(singlePair):
splitPair = singlePair.rsplit('\t')
if len(splitPair) != 2:
return []
else:
try:
return [(int(splitPair[0]), splitPair[1])]
except:
return []
artistByID = dict(rawArtistData.flatMap(lambda x: pairsplit(x)).collect())

We also use artist_alias.txt to map “bad” artist IDs to “good” ones, instead of just using it as raw data pairs of artist IDs. We convert bad IDs to good ones using the codes below. The first entry, for instance, maps ID 6803336 to 1000010, which means it maps “Aerosmith (unplugged)” to “Aerosmith.”

def aliaslookup(alias):
splitPair = alias.rsplit('\t')
if len(splitPair) != 2:
return []
else:
try:
return [(int(splitPair[0]), int(splitPair[1]))]
except:
return []
artistAlias = rawArtistAlias.flatMap(lambda x: aliaslookup(x)).collectAsMap()

Build a model

We then create a lookup function to convert the data into Rating objects. Note that any underlying MLlib models would need products as an objective measure. In our model, the products are artists. We will therefore use user_artist_data.txt for implementing our train data.

def ratinglookup(x):
userID, artistID, count = map(lambda line: int(line), x.split())
finalArtistID = bArtistAlias.value.get(artistID)
if finalArtistID is None:
finalArtistID = artistID
return Rating(userID, finalArtistID, count)
trainData = rawUserArtistData.map(lambda x: ratinglookup(x))
trainData.cache()

We also create a broadcast variable called bArtistAlias for artistAlias. This makes Spark send and hold in memory just one copy for each executor in the cluster. When there are thousands of tasks, and many execute in parallel on each executor, this can save significant network traffic and memory.

bArtistAlias = sc.broadcast(artistAlias)

Finally, we build our model using collaborative filtering algorithm as follows. The operation will likely take minutes or more depending on your cluster. It took me around 15 minutes to run the model.

'''build model'''
model = ALS.trainImplicit(trainData, 10, 5)

We should first see if the artist recommendations make any intuitive sense, by examining a user, his or her plays, and recommendations for that user. Take, for example, user 2093760. Extract the IDs of artists that this user has listened to and print their names. This means searching the input for artist IDs for this user, and then filtering the set of artists by these IDs so you can collect and print the names in order:

'''test artist'''
spotcheckingID = 2093760
bArtistByID = sc.broadcast(artistByID)
rawArtistsForUser = (trainData
.filter(lambda x: x.user == spotcheckingID)
.map(lambda x: bArtistByID.value.get(x.product))
.collect())
print(rawArtistsForUser)

Get your recommendation

I want to take top 10 songs based on my data model:

'''output recommendations'''
recommendations = map(lambda x: artistByID.get(x.product), model.call("recommendProducts", spotcheckingID, 10))
print(recommendations)

Running this on my Spark VM cluster, I got the following output:

  • Jay Z, 50 Cent, Snoop Dogg, 2Pac, Nas, Kanye West, Outkast, Eminem, Dr. Dre, and Ludacris.
A screenshot of my model output. Note that your top 10 listed songs can be in random order.

Yup, the artists look like a mix of rappers! Remember, this set of artists were very popular during 2005, which is the year of when this dataset was pulled.

Possible next step? We can pull a set of recommended songs using another set of available songs data and query them based on our top artists.

Hope you enjoy and happy hacking!

Coding Reference

You can find my original codes here: https://github.com/moorissa/audiorecommender


Moorissa is a graduate student currently studying machine learning at Columbia University, with a hope that someday she can leverage these skills for making the world a better place, one day at a time.