What I Learned from Building a Bad Song Recommender.

Crawford Collins
Jul 25, 2019 · 3 min read

Find it at music-recommender.CrawfordC.com

A failed attempt at creating a different type of music recommendation.

Photo by Joseph Pearson on Unsplash

The Data

I scraped as much of genius.com as I could. I extracted the information from the sidebar and the full lyrics of every possible page.

Problems with the data:

  1. I assumed that genius.com partitioned the data for songs and literature. This was not the case. The app surprisingly recommended a Shakespearean play and a Bing Crosby song after searching for a Wu-Tang Clan song.
  2. The data was extremely imbalanced. Many songs only had the song title and the artist as the available data.

What I should have done.

  1. Tried harder to clean the data.
  2. I should have dropped the data which had no information. If no one is updating information for a song, I assume that means no one cares about it.
  3. I should have incorporated my data, from the genius annotations or playlist data.

The components

My model used a Truncated SVD algorithm to cluster artists. Gensim’s Doc2vec vectorized the lyrics. I combined the results from those two algorithms. I use the Annoy library to find similar songs to the request.

Problems with the components of the recommendation

  1. I generated a network by using CountVectorizer to provide a record of appearance for each artist in each song. The table was sparse.
  2. I don’t think doc2vec worked for me. I did not read up on it enough. I used it assuming it would work.

What I should have done.

  1. I should have tried to use a graph2vec function or spectral clustering to group artists. I have used these before.
  2. I should have used more interpretable text classification models, like Latent-Dirichlet-allocation.
  3. I should have better prepared the lyric data. I took out only the HTML and brackets of the data. Common words were probably the reason for seemingly random predictions.
  4. I think limiting the number of features from SVD and doc2vec might have helped. Fewer numbers could have decreased the noise passed to the model.

Overarching problems

  1. I wanted the computer to find latent connections too much. I should have not been afraid to algorithms that required more human input.
  2. I stubbornly wanted to get the app deployed. I could have resolved many of the problems if I worked harder.
  3. Size of the data. There was a lot of data with this project. Simply reading the data seemed like it could take up to 30 minutes. I had some workarounds, but no hard fixes.
  4. My goal was to find hidden connections between songs. I assumed that the process defaulted to recommending songs by the same artist. This was true in only about half of the operations.
  5. No test set. I wanted to find latent connections. I expected the algorithm return obviously similar songs. Instead, the results look like they were pulled out of a hat.

What went well

  1. The asynchronous web scraper is super fast
  2. The app is fast and has not crashed so far.
  3. Now that the app is working and I have data, a new model should be able to be generated.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade