SAGE: an artificially intelligent band recommender

hate5six.com
Sep 8, 2017 · 22 min read
Image for post
Image for post

sage
sāj/
noun
1. a profoundly wise person
2. an artificially intelligent band recommender developed by hate5six

One of the primary functions of hate5six has always been to connect people with music in a variety of ways. Aside from showcasing new artists on the site, a common theme in the trajectory of the project has been to leverage data in order to help facilitate band recommendations.

Image for post
Image for post
The neighborhood of bands around Youth of Today. For a given band, adjacency displays that band’s relations and some of the relations for those bands. If you think of it like a social network, it shows you your friends and some of your friend’s friends.
Image for post
Image for post
A demo of the community detection approach via Louvain modularity to identify communities of bands. This algorithm attempts to find a graph partition such that edges within a community are dense compared to edges between communities. You can think of it as finding a tight knit group of close friends, even though each person in that group might be loosely connected to different groups with other people.
Image for post
Image for post
A snapshot of the lyric-based approach to similar band clustering. This algorithm was able to find very intuitive relationships such as Agnostic Front + Madball, then identifying Sick of It All as most similar to the Agnostic Front/Madball cluster. Gorilla Biscuits and Youth of Today form a cluster with In My Eyes, likely due to the shared themes of positivity and clean living. Floorpunch and Negative Approach merge with The Rival Mob and Agitator, which makes sense given how lyrically hostile each band is. This was done by taking a collection of lyrics for each band and producing a term frequency-inverse document frequency (tf-idf) weighted vector of n-grams (single words up to sequences of 3 words) and running hierarchical agglomerative clustering using cosine similarity between the vectors.
Image for post
Image for post

Who/What is Sage?

SAGE stands for: Sage Analyzes Graph Embeddings. A cute recursive acronym, but what does it mean? Before we explore “graph embeddings” and how Sage learns and leverages them, let’s first investigate how we can represent bands in a mathematical way. Throughout the remainder of this piece you will encounter the word vector, which you can think of simply as a list of numbers. Very loosely, a vector space is a collection of vectors that you can apply operations on such as adding two vectors together to get a new vector. You can also compute how similar or “close” two vectors are, which will be important. (Note: for the mathematically inclined readers, I am sacrificing mathematical rigor for these hand-wavy definitions in order to maintain the readability and accessibility of this post).

Black Flag: [1, 0, 0, 0]

Minor Threat: [0, 1, 0, 0]

SSD: [0, 0, 1, 0]

Justin Bieber: [0, 0, 0, 1]

You can think of these as points existing in a 4-dimensional vector space. We can’t visualize in higher than 3 dimensions, so consider a 2 dimensional case where Black Flag is on the x-axis and Minor Threat is on the y-axis:

Image for post
Image for post

vAlice = [1, 1, 0, 0]

vBob = [0, 0, 1, 0]

vCindy= [0, 1, 0, 0]

vDarien = [0, 0, 0, 1]

Since Alice likes several bands, her vector exists in a region that is a combination of those bands, while Bob’s vector exists along just the band/dimension he cares about. However, since Alice and Bob do share one band of interest, the angle between their vectors is less than 90°, namely 45°, so cosine(45°) = 0.707.

Image for post
Image for post

cosine_similarity(vCindy, vBob) = 0

cosine_similarity(vCindy, vDarien) = 0

What this says is that Cindy and Bob are as dissimilar as Cindy and Darien. But that doesn’t make intuitive sense. Our knowledge about the world should tell us that Cindy and Darien should be closer together since they both listen to punk bands. Darien, who only listens to Justin Bieber, is likely more dissimilar but our model doesn’t capture that. Why not?

Image for post
Image for post
“term frequency”, or tf, simply means the number of times term x appears in document y. However, we want to diminish the importance of very common words such as “the”, which is handled by the “inverse document frequency” component. As the document frequency of term x approaches N, the ratio N/dfx goes to 1 and the value of log(N/dfx) approaches 0, and that nukes whatever contribution is made by the tf term. For example, if every band sings about “breaking the chains” then words like “chain” and “chains” will occur very frequently and it will not be a strong signal to differentiate between bands. The idf (and therefore its tf-idf) value of these words would effectively be 0.
Image for post
Image for post
A word cloud generated from Martin Luther King Jr.’s, “I Have A Dream” Speech. The size of the words is determined by the value of its term frequency-inverse document frequency (tf-idf) value. It’s a rudimentary but sensible way to quickly find the salient words/topics of a document.

Embedding with Artificial Neural Networks

The meaning of a word is its use in the language. — Ludwig Wittgenstein, Philosopher, 1953

You shall know a word by the company it keeps. — J.R. Firth, Linguist, 1957

In 1954, linguist Zellig Harris stated what is known as the Distributional Hypothesis. His claim was words that occur in similar contexts tend to have similar meanings. What have we been trained to do whenever we encounter new words in text? Look for context clues. Given an unknown word we can infer its meaning by looking at the distribution of words around it. Therefore, we might expect words that have similar meaning to have similar word distributions around them. The question is then: can we train an algorithm to learn this without any supervision? With a sufficiently large dataset, the answer is yes. It is done with an algorithm called word2vec (Note: you can also do it with GloVe, but the current implementation of Sage uses word2vec which will be the focus for the remainder of this piece.)

Aravind went to the ______ and bought groceries.

We know from our knowledge of the world (i.e. having learned from a sufficient amount of data through life experience) that the missing word is most likely “store” or “market” and probably not “school” or “zoo”.

Image for post
Image for post
A diagram of a simple shallow feed-forward artificial neural network
Image for post
Image for post
Without an activation function the model just produces linear combinations of the input features. These “activation functions” tend to be non-linear which results in non-linear decision boundaries. Simply put, a “decision boundary” defines a partition. In binary classification, the model learns a boundary such that points to one side of it are most likely one class and points on the other are most likely the second class. In this toy example, if the blue dots represent players who made the team and red dots players who didn’t make the team, you can see how a non-linear decision boundary (dotted line) is able to correctly classify cases that are missed by the linear decision boundary (dashed line). Learning robust decision boundaries that can generalize to new data is one of the things that makes machine learning/artificial intelligence difficult!
Image for post
Image for post
The CBOW architecture for a neural network. The input is a set of c context words, each represented by a one-hot encoded vector, and over time the network learns the appropriate weights to predict the most likely target word given that context. The internal set of weights yields the embedding.
Image for post
Image for post
An example of an autoencoder that tries to reconstruct an image by way of a compressed representation. There is some signal loss but the compression, by way of the hidden layers, is able to learn the key features of the image.

king - man + woman = queen

Another example you will find inthe literature is:

Germany - Berlin + France = Paris

Let that sink in. We have a way to manipulate words mathematically while retaining their semantic meaning. And this is just scratching the surface.

Image for post
Image for post
The context subgraph based on the 1,100 bands currently on hate5six. The highlighted nodes are the related bands in the immediate vicinity of The Exploited.
Image for post
Image for post
Random walk of six steps around Inside Out
inside out -> turning point -> dys -> death threat -> killing time 
-> caught in a trap -> district 9
Walk 1: inside out -> 108 -> coliseum -> heiress -> pulling teeth 
-> dangers -> cult leader
Walk 2: inside out -> hope conspiracy -> killing the dream
-> poison the well -> la dispute -> brooks was here -> crows-an-wra
Walk 3: inside out -> burst of rage -> corrective measure -> mindset -> chain of strength -> 108 -> sinking shipsWalk 4: inside out -> unbroken -> give up the ghost -> backtrack
-> caught in a trap -> icepick -> casey jones
Walk 5: inside out -> rival mob -> spine -> clear -> line of sight
-> exit order -> odd man out

Visualizing the Embeddings

Now that we’ve produced 100-dimensional embeddings for every band, the question becomes: how well do these vectors geometrically capture band relations? Let’s literally look and see.

Image for post
Image for post
Visualization of the embeddings using t-SNE

Sage in Action

Do you like Incendiary and Indecision? Here are some bands Sage thinks you might like:

Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post

Kid Rock - rap (ie DMX) = more country/southern rock leaning bands

Conclusions

We’ve been able to leverage publicly available data about communal listening habits across over 200,000 bands and developed a novel model for finding new music. The model has been able to learn fairly robust mathematical representations of bands that preserves their “context”: bands that share members, have similar tempos, are lyrically and thematically related, tend to cluster together in the embedded space. This enables the user to define taste profiles capturing what they do and don’t like, and that corresponds to a well-defined set of mathematical operations on the embedded representations of bands.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store