# SAGE: an artificially intelligent band recommender

sage

sāj/

noun

1. a profoundly wise person

2. an artificially intelligent band recommender developed by hate5six

One of the primary functions of hate5six has always been to connect people with music in a variety of ways. Aside from showcasing new artists on the site, a common theme in the trajectory of the project has been to leverage data in order to help facilitate band recommendations.

When looking for new music, one might ask a friend for suggestions based on what they do and don’t like. That assessment may be rooted in anything from acoustic similarity to lyric similarity to band member overlap to bands that play together often to something else entirely. If you ask enough people for recommendations based on your taste profile, you might expect the recommendations to converge after a while. In other words, there is value in asking “the crowd”.

In the past I’ve experimented with a number of band recommendation systems. In 2014 I launched adjacency as a tool for visualizing neighborhoods of bands that are “related”. The website last.fm tracks what bands people listen to and aggregates those listening habits. If Alice and Bob listen to Minor Threat and SSD, that tells us something about those bands even if we know nothing about them. If 1,000 or 10,000 or 100,000 other people also listen to Minor Threat and SSD, then that signal becomes stronger and we can naturally be more confident in the relationship. adjacency provided an interactive way to navigate the cloud of bands around a given artist: bands are nodes in a graph and edges are drawn if they are related according to last.fm, and the opacity of the edge is determined by the similarity score provided by last.fm. Exactly how they compute that score is a bit of a black box but they likely use collaborative filtering, which is (was?) how Netflix generates those (un)useful movie recommendations.

Rather than using last.fm data we could also try using hate5six viewership habits: if multiple people watch videos of the same bands, or alternatively, if someone watches a number of videos of a small group of bands, we can infer some type of approximate similarity. However, last.fm aggregates significantly more data and is a much cleaner and more reliable dataset than hate5six statistics. Though it never left the lab, I did experiment with viewership data and an approach called community detection. In this experiment I constructed a graph of related bands using viewership data, increasing the strength of the edge between bands as more people watched videos of both. I then looked for the most prominent bands using a method called PageRank, which is how Google ranks their search results, and then ran community detection on these “important” bands. You can see a demo of that research here.

Another approach I took analyzed common word/phrase usage and built up clusters of bands that are lyrically similar. While the de-facto method for many would be suggesting bands that sound similar, this was novel in that it was a simple approach to finding bands that are *thematically* similar. What’s more, the algorithm “learned” these relationships simply by analyzing what amounts to word and phrase frequency in the lyrics.

All of these approaches can lead to sensible band recommendations, but they require significant interpretation from the user. They are useful from an exploratory point of view but neither provides a sensible, natural way to interact with the underlying data in a meaningful fashion. There is no way to clearly define a taste profile and receive suggestions in a principled manner. Until today.

Say hello to Sage.

Sage is the first artificially intelligent band recommender developed by hate5six labs that learns from the crowd.

The purpose of this post is to introduce you to the machine learning mechanics that make Sage tick. Though parts unavoidably dive into computer science jargon, I am assuming for the most part the reader has no prior mathematical or computer science experience. Questions are encouraged in the comments and/or email to hate5sixproductions at gmail dot com.

**Who/What is Sage?**

SAGE stands for: *Sage Analyzes Graph Embeddings*. A cute recursive acronym, but what does it mean? Before we explore “graph embeddings” and how Sage learns and leverages them, let’s first investigate how we can represent bands in a mathematical way. Throughout the remainder of this piece you will encounter the word **vector**, which you can think of simply as a list of numbers. Very loosely, a **vector space** is a collection of vectors that you can apply operations on such as adding two vectors together to get a new vector. You can also compute how similar or “close” two vectors are, which will be important. (Note: for the mathematically inclined readers, I am sacrificing mathematical rigor for these hand-wavy definitions in order to maintain the readability and accessibility of this post).

Computers understand numbers better than they understand words. By design they can manipulate numbers easily, so the first task is to transform our data (bands and taste profiles) to some meaningful numerical form with a geometric interpretation. Better yet, we will aim to transform bands into some type of vector. In the field of **natural language processing/NLP **(a branch of artificial intelligence focused on processing and understanding text) a common representation is called *one-hot encoding.* Where the “vocabulary” in NLP is typically the set of English words, our vocabulary will be a set of band names.

Suppose our vocabulary or universe of bands consists of the following only: Black Flag, Minor Threat, SSD, and Justin Bieber. If we define a vector with four entries, we can assign each band a unique list of numbers such that only one entry is 1 and the rest are 0 (hence the name, “one hot” encoding). For example:

Black Flag: [1, 0, 0, 0]

Minor Threat: [0, 1, 0, 0]

SSD: [0, 0, 1, 0]

Justin Bieber: [0, 0, 0, 1]

You can think of these as points existing in a 4-dimensional vector space. We can’t visualize in higher than 3 dimensions, so consider a 2 dimensional case where Black Flag is on the x-axis and Minor Threat is on the y-axis:

One way to quantify how similar two vectors are is by computing the cosine of the angle between them. Under the one-hot encoding, a band’s vector exists along one dimension and is therefore **orthogonal **to all other vectors, meaning they are at right angles (90°) to one another. You may or may not recall from geometry that cosine(90°) = 0. As a technical aside, we are not allowing any of the entries in the vector to be negative, so the value of the cosine between any two vectors will necessarily be between 0 and 1, where completely dissimilar vectors will have a value of 0 and identical vectors a value of 1. This gives us some mathematical way of quantifying the similarity of two vectors.

We can define someone’s taste profile simply by adding together the vectors of the bands they like. For example, if Alice likes Black Flag and Minor Threat, and Bob likes SSD and Justin Bieber, and Cindy and Darien only listen to Black Flag and Justin Bieber, respectively, their taste profiles could be defined using the vectors above:

vAlice = [1, 1, 0, 0]

vBob = [0, 0, 1, 0]

vCindy= [0, 1, 0, 0]

vDarien = [0, 0, 0, 1]

Since Alice likes several bands, her vector exists in a region that is a combination of those bands, while Bob’s vector exists along just the band/dimension he cares about. However, since Alice and Bob do share one band of interest, the angle between their vectors is less than 90°, namely 45°, so cosine(45°) = 0.707.

There are a number of problems with this naive approach. First, the size of these vectors grows as rapidly as the space of bands of interest. If we want to capture and represent taste profiles across 200,000 bands, then we will need entries for each one, which quickly becomes intractable. Furthermore, taste profile vectors will be very **sparse**, meaning they will be mostly filled with 0’s, and this is problematic for many machine learning algorithms. The major shortfall though is that there is a fundamental disconnect between our mathematical representation and what we know about the world. Cindy and Bob share no bands in common, and neither do Cindy and Darien, so what we find is:

cosine_similarity(vCindy, vBob) = 0

cosine_similarity(vCindy, vDarien) = 0

What this says is that Cindy and Bob are *as dissimilar* as Cindy and Darien. But that doesn’t make intuitive sense. Our knowledge about the world should tell us that Cindy and Darien **should** be closer together since they both listen to punk bands. Darien, who only listens to Justin Bieber, is likely more dissimilar but our model doesn’t capture that. Why not?

While we’ve found a way to represent bands and people’s taste profiles, we are not capturing the context or semantics of what they represent. We’ve simply devised an encoding but fail to capture the *meaning*. Therefore, what we need is a way to encode the data we care about in such a way that things that are “close” in the world are mathematically similar. We also want our representation to not be bound to the size of the bands in the space, nor be sparse. We want **dense** vectors with many non-zero entries and a **fixed **number of dimensions.

Before we go down that route, let’s discuss how vectors were used in the lyric analysis research. There, the vectors were constructed using a vocabulary of tens of thousands of words and sequences of words. For example, if a band has the lyric “break the chains” there would be entries in the vector for “break”, “break the”, and “break the chains”. As you can imagine, the vectors end up having an obscenely large number of entries. The value of each entry in the vector was based on the tf-idf (term frequency-inverse document frequency) of that word/phrase.

While we could just use the number of occurrences of that word/phrase, it’s far more informative to use their tf-idf weighting, which is a way to scale their importance: words that occur frequently across the all bands have low weight (think: words like “the”) but words that appear in small bursts are likely more important. One way to restrict the size of the vector is to set a tf-idf threshold and only retain words in the vocab that meet or exceed that value. This effectively filters out unimportant words from the start and you can construct a vector with just the top n number of words. If you’ve ever seen a word cloud you’ve seen tf-idf in action. There are other statistical methods for achieving this, namely starting with a set of large vectors and determining which components, or **features**, are important and which can be discarded. This family of methods is known as dimensionality reduction.

The size of the words in a word cloud is based on the tf-idf score of that word in the document(s)! Not bad. And by looking at how similar these simple lyric vectors are we are able to produce the band clusters demonstrated earlier.

### Embedding with Artificial Neural Networks

The meaning of a word is its use in the language. — Ludwig Wittgenstein, Philosopher, 1953

You shall know a word by the company it keeps. — J.R. Firth, Linguist, 1957

In 1954, linguist Zellig Harris stated what is known as the Distributional Hypothesis. His claim was words that occur in similar contexts tend to have similar meanings. What have we been trained to do whenever we encounter new words in text? Look for context clues. Given an unknown word we can infer its meaning by looking at the distribution of words around it. Therefore, we might expect words that have similar meaning to have similar word distributions around them. The question is then: can we train an algorithm to learn this without any supervision? With a sufficiently large dataset, the answer is yes. It is done with an algorithm called word2vec (Note: you can also do it with GloVe, but the current implementation of Sage uses word2vec which will be the focus for the remainder of this piece.)

There are two architectures that can be used to train this model: Continuous Bag of Words (CBOW) and skip-grams.

The CBOW architecture tries to solve the problem: Given a context, predict the target word. For example, consider the following sentence:

Aravind went to the ______ and bought groceries.

We know from our knowledge of the world (i.e. having learned from a sufficient amount of data through life experience) that the missing word is most likely “store” or “market” and probably not “school” or “zoo”.

The skip-gram architecture tries to solve the other problem: give a word, predict the context.

To see how these problems are solved, we must develop a basic understanding of how the model actually works. One such computational model is called an artificial neural network (ANN), named after the fact that it loosely mimics how biological neurons in the brain operate and interact.

In its most basic form, an ANN consists of an input layer of nodes (input features), a hidden layer of nodes (the compression layer), and an output layer of nodes (the prediction), where nodes in adjacent layers are connected with directed edges:

Let’s suppose we want to predict whether an athlete will make the cut on a baseball team. Some “features” of the athlete we may consider could be: their age, height, weight, batting average, runs batted in, on base percentage, number of home runs, number of walks, number of errors, etc. In other words, each player could be represented by a vector consisting of these statistics. The ANN would have an input node for each of these features, and if we are trying to predict the binary case of “made the team” vs “didn’t make the team” the output layer would have two nodes with a prediction probability for each (technical point: you can get away with one node for binary classification. Can you see why?) The hidden layer can have an arbitrary number of nodes (we can also have an arbitrary number of hidden *layers*, but that ends up posing serious technical challenges which are beyond the scope of this primer, see: vanishing gradients). The nodes in the hidden layer essentially try to capture the *hidden *interactions between the features: it might learn that a weighted combination of batting average and number of errors is a strong predictor versus just either feature alone. The output nodes are similarly some combination of the hidden nodes. This model is just combining numbers and with a little massaging (i.e. applying a softmax) the output nodes will be a probability distribution over the classes (i.e. 70% “made the team”, 30% “didn’t make the team”).

The parameters the model attempts to learn are the *weights* along the edges. How does it do this? We can start by randomly initializing the weights in the network and feed it tons of data. Suppose we have historical stats for every player who ever made/didn’t make the team. One by one we could feed the player’s stats into the network via the input layer. The value at each node in the hidden layer will then take into account the values of the input features combined with the weights that are incident to it. Each hidden node will also apply something called an activation function which determines whether the combined signal is strong enough to “fire” to the next layer. In the biological setting, neurons in your brain are constantly firing signals to the next neuron, but if the signal is too weak then that next neuron may not fire.

When we get to the output layer we will have a prediction for that player. But since we know whether that player made the team or not, we can compare the output and quantify the degree to which it was correct or incorrect via something called an objective function or loss function or cost function.

If the prediction is wrong, the loss function tells the ANN how far off it was. The ANN can then readjust the weights using a method called backpropagation. What backpropagation does is modify the internal edge weights/parameters using gradient descent; it determines how much the weights need to be adjusted by such that the model does a better job at minimizing the value of the loss function. For example, the initial model with random weights might place too much emphasis on a player’s age, but as the network sees more data it may decrease its weight, thereby reducing the importance of age as a feature. In other words, the model adjusts itself or “learns” to correctly predict the output for that sample.

This process repeats for all samples in the input dataset, with potentially many repeated passes, or **epochs**, over the dataset. Over time, the hope is that the model slowly converges. By seeing enough samples it attempts to **learn **associations between the input features and the output classes. This is called model **training**. Once a model has been trained, we can feed it new data (i.e. a new player’s stats) and make a prediction (will they make the team?)

What you have is a complex network of simple interconnected machines that take signal(s) as input and either relay or don’t relay to the next. It is fascinating that human cognition, in all its intricacies and endless capacity for thought, emotion and creativity, essentially emerges from this style of network. Like ants in an ant colony, the collective transcends the individual. Though ANN’s are a very rough mathematical approximation with their own set of inherent limitations, they too have the ability to capture meaningful patterns and generalizations.

Under the CBOW architecture of word2vec, the input to the neural network is a sequence of one-hot encoded vectors corresponding to each word in the context, x. The network tries to then predict the vector for the target word, and in the process tries to learn a (compressed) representation of the input by way of the hidden layer. By feeding it a large corpus of text, over time the model should (and does) yield semantically consistent sets of weights. In other words, it captures Harris’ initial observation: words that are semantically similar (i.e. “store” and “market”) will have similar representations within the network.

What we see is that word2vec allows us to automatically learn distributed representations of things simply by feeding it a large amount of data. As the model trains it learns the context of each item and adjusts its internal representation (weights) accordingly. In the end the model is able to produce a condensed set of numbers (derived from the weights) for each item in the vocabulary. This set of weights becomes, you guessed it, the vector representation or **embedding **for the item!

One parameter to word2vec is the number of weights to use, which is completely determined by the user. You can think of this parameter as the degree to which you want to **compress **the input. We start out with something big and we want to squeeze it into a smaller, more compact representation. There might be some information loss but an efficient compression will retain most of the important detail. If you think about images, for example, compressing them into smaller formats might cause blurriness and some details to be lost, but the essence of the original image remains. The number of weights in word2vec allows us to control the degree to which we want to shrink the input.

There is a flavor of neural networks called autoencoders, in which the network tries to reconstruct the input by way of a small hidden layer. Because the hidden layer consists of fewer nodes, the model essentially learns a compressed representation of the input. For example, an autoencoder could take in as input a 100x100-pixel image (represented by a vector of 100x100 = 10,000 numbers) and attempt to predict the same set of numbers (or pixel values) after passing through a layer with 1,000 hidden nodes. If done efficiently, that smaller set of nodes from the hidden layer could be used as a set of core or important features from the original image. Ultimately, this is another form of dimensionality reduction.

So this model takes a word and “embeds” in into a vector space, as defined by the learned weights. But what is the intuition for why it works? In literature, the **context **of “king” tends to be similar to the context of “man” to the same degree that the context of “queen” is similar to that of “woman”. Recall in the earlier example, the context of “store” and “market” are related. As a result, word2vec’s representation for these words preserves the semantics and allows for solving analogies like “king is to man as queen is to ???”, simply by performing algebraic operations on the corresponding vectors:

king - man + woman = queen

Another example you will find inthe literature is:

Germany - Berlin + France = Paris

Let that sink in. We have a way to manipulate words mathematically while retaining their semantic meaning. And this is just scratching the surface.

How can we capture the context of a band in a way that is similar to how large amounts of text data can capture the context of a word? We know that last.fm’s data gives us some sensible notion of neighborhoods around bands, and we know that word2vec was designed for contextually analyzing sequences of tokens across a large number of documents. Let’s take a cue from previous work around something called DeepWalk, which was applied here, and construct a band graph and aimlessly walk around it.

Taking lessons learned from the work on adjacency, we can take a seed set of bands and query last.fm for all artists similar to them, then apply the process again to the new bands we discover. This defines what is called a **2-hop network** since for each band we expand the network around it by two sequential steps based on the bands connected to it. In this work, I began with a seed set of 1,100 bands, specifically the bands currently on hate5six. Generating the 2-hop network produces a graph with over 218,000 nodes (bands) and over 2.2 million edges.

This network is very dense and difficult to visualize in its entirety. The large mass in the image below is the subgraph conditioned on the 1,100 bands currently in hate5six, and the relations as defined by last.fm.

As stated earlier, we want to produce *sequences *that capture the context of a band. To achieve this, we can *walk* the graph and emit the band name at each step. If we had probabilities attached to each edge then from a given node we could walk along the edge with the highest probability to reach the next band. In the above example we might expect Discharge to be more similar to The Exploited than Terror, and therefore Discharge would have a higher transition probability from The Exploited. From Discharge we’d be more likely to move to GBH than, say, Gorilla Biscuits. In probability theory, this notion of “walking” based on only the current state and the transition probabilities to the next is called a **Markov process** or **Markov chain**.

This initial implementation of Sage assumes the transition probability distribution is uniform. In other words, from a given band there is equal probability of transitioning to any of the bands it is related to. This is called a **random walk **because we will be aimlessly walking around the graph for a fixed number of steps. There is no real reason why random walks were chosen over Markov chains, other than it being a quicker implementation. Last.fm *does* provide similarity scores between bands but they would need to be converted to transition probabilities. This will be left for future research. Regardless, we will see that random walks do yield meaningful results.

The illustration below shows 15 bands that are immediately similar to Inside Out, again according to last.fm. The red edges correspond to one possible random walk we could take:

inside out -> turning point -> dys -> death threat -> killing time

-> caught in a trap -> district 9

At each step we are considering all the possible edges we could take *from* that band and choosing one at random. From Turning Point we randomly chose DYS, and from DYS we randomly chose Death Threat, and so on. Note: each of the bands along the red path have their own networks, but they are not displayed here for simplicity. We could walk the graph indefinitely if we wanted, but the hope is that short walks will keep us in a close neighborhood around the band of interest. Walking around bands allows use to explore other bands on the periphery of the immediate neighborhood. Nothing beats surprise discoveries when seeking new music! Furthermore, because these walks are random we would expect another path to traverse any of the other blue edges first, and proceed accordingly. We do allow loops so it could be possible for the path to return to a band it has already seen, including the starting band.

The following is a sample of actual random walks we could take:

Walk 1: inside out -> 108 -> coliseum -> heiress -> pulling teeth

-> dangers -> cult leader

Walk 2: inside out -> hope conspiracy -> killing the dream

-> poison the well -> la dispute -> brooks was here -> crows-an-wra

Walk 3: inside out -> burst of rage -> corrective measure -> mindset -> chain of strength -> 108 -> sinking ships

Walk 4: inside out -> unbroken -> give up the ghost -> backtrack

-> caught in a trap -> icepick -> casey jones

Walk 5: inside out -> rival mob -> spine -> clear -> line of sight

-> exit order -> odd man out

So where word2vec is trained on a large corpus of millions of words in order to learn their contextual meaning, we are feeding it sequences of band paths and asking it to learn which bands are contextually similar.

Sage is aware of 218,898 bands (and growing!) that it learned from last.fm. For each band, Sage takes a fixed number of random walks, each of which consists of a fixed number of steps. This yields a corpus of 33 million band names. This is fed into a CBOW-flavored artificial neural network and trains for a fixed number of epochs and yields 100-dimensional embedding vectors for each band. A number of experiments were conducted varying the number of random walks, the number of steps to take, the size of the embedding, and the type of architecture (CBOW or skip-gram). Qualitatively, the best results were seen using the current set of parameters, but this is a potential area for future refinement.

### Visualizing the Embeddings

Now that we’ve produced 100-dimensional embeddings for every band, the question becomes: how well do these vectors geometrically capture band relations? Let’s literally look and see.

Because it’s impossible to visualize things in 100 dimensions, we must find a way to reduce the dimensionality while preserving the general **structure **of the space. In mathematics, a **projection** is a mapping from one set or structure into a subset or sub-structure. If you’ve ever made hand shadow puppets using a flashlight, you’ve *projected* the 3D geometry of your hands onto a lower dimensional space, namely a 2D wall, while preserving the key structures.

To visualize our embeddings we’ll use a method called t-Distributed Stochastic Neighbor Embedding, or t-SNE. Note: this “neighbor embedding” should not be confused with the embeddings Sage learned. Loosely, we are going to use this method to embed the embeddings.

Points in 2D space exist on a **surface** or **plane**. Our 100-dimensional embeddings exist on a **manifold**, which you can think of as a surface in high dimensions. t-SNE attempts to map the data into a lower dimensional space and is sensitive to local structure on the manifold. In other words, similar/close objects on the manifold are close together in the projection, and dissimilar objects are further apart.

The following 2D scatter plot shows the results of running t-SNE on the embeddings for a subset of bands.

This graph demonstrates how well we’ve been able to model band similarity with this approach. Bane and Have Heart are close together, there is a cluster of straightedge bands: Chain of Strength, Floorpunch, Project X, First Step, Mindset, Turning Point, etc. There’s a cluster of Dan Yemin’s bands: Paint It Black, Kid Dynamite, Lifetime. Civ, Shelter and Kill Your Idols represent a cluster of more upbeat hardcore bands. Protester, No Tolerance, Freedom, Violent Reaction, and Rival Mob are all neighbors, which makes sense as they are relatively recent hardcore bands with a more punk influence. Limp Wrist appears close to Los Crudos, as expected, and both are near Negative Approach, likely because of their shared tempo and tendency for vocal shrieking.

Sage is good. But Sage is not immune to failure.

We might expect to see American Nightmare close to True Love, as the former heavily influenced the latter, but Sage has failed to learn that.

### Sage in Action

Do you like Incendiary and Indecision? Here are some bands Sage thinks you might like:

If you happen to hate Minor Threat (wtf???) Sage recommends these heavier bands:

If you like your Inside Out with a side of Krishna Consciousness (Prema), Sage suggests 108 and Shelter. I like this result because it coincides with the fact that after Vic DiCara departed from Inside Out and became a Krishna devotee, he joined Shelter and later formed 108. Perhaps this is confirmation bias, but it’s a cool result nonetheless because Sage was not explicitly told about this fact.

Do you enjoy music from the culture vulture, Kid Rock? You’ll get a mix of southern rock, hard rock, hip hop influenced nu-metal, and country:

But what if you hate rap (DMX, for example)?

We’ve essentially shifted the context and removed all bands with a hip hop influence. In other words:

Kid Rock - rap (ie DMX) = more country/southern rock leaning bands

### Conclusions

We’ve been able to leverage publicly available data about communal listening habits across over 200,000 bands and developed a novel model for finding new music. The model has been able to learn fairly robust mathematical representations of bands that preserves their “context”: bands that share members, have similar tempos, are lyrically and thematically related, tend to cluster together in the embedded space. This enables the user to define taste profiles capturing what they do and don’t like, and that corresponds to a well-defined set of mathematical operations on the embedded representations of bands.

Conveying scientific and technical ideas is a tall order, so hopefully this post has been accessible to readers with no experience or prior knowledge of the field. If you’ve been able to follow the gist of things, congratulations! Sage employs some of the most recent advances in machine learning and artificial intelligence and walking away with at least a basic understanding of the mechanics contained herein is commendable. If you have questions or comments, feel free to leave them below or write to hate5sixproductions at gmail dot com.

Sage is wise, but this is just the beginning.