Find it at music-recommender.CrawfordC.com
A failed attempt at creating a different type of music recommendation.
I scraped as much of genius.com as I could. I extracted the information from the sidebar and the full lyrics of every possible page.
Problems with the data:
- I assumed that genius.com partitioned the data for songs and literature. This was not the case. The app surprisingly recommended a Shakespearean play and a Bing Crosby song after searching for a Wu-Tang Clan song.
- The data was extremely imbalanced. Many songs only had the song title and the artist as the available data.
What I should have done.
- Tried harder to clean the data.
- I should have dropped the data which had no information. If no one is updating information for a song, I assume that means no one cares about it.
- I should have incorporated my data, from the genius annotations or playlist data.
My model used a Truncated SVD algorithm to cluster artists. Gensim’s Doc2vec vectorized the lyrics. I combined the results from those two algorithms. I use the Annoy library to find similar songs to the request.
Problems with the components of the recommendation
- I generated a network by using CountVectorizer to provide a record of appearance for each artist in each song. The table was sparse.
- I don’t think doc2vec worked for me. I did not read up on it enough. I used it assuming it would work.
What I should have done.
- I should have tried to use a graph2vec function or spectral clustering to group artists. I have used these before.
- I should have used more interpretable text classification models, like Latent-Dirichlet-allocation.
- I should have better prepared the lyric data. I took out only the HTML and brackets of the data. Common words were probably the reason for seemingly random predictions.
- I think limiting the number of features from SVD and doc2vec might have helped. Fewer numbers could have decreased the noise passed to the model.
- I wanted the computer to find latent connections too much. I should have not been afraid to algorithms that required more human input.
- I stubbornly wanted to get the app deployed. I could have resolved many of the problems if I worked harder.
- Size of the data. There was a lot of data with this project. Simply reading the data seemed like it could take up to 30 minutes. I had some workarounds, but no hard fixes.
- My goal was to find hidden connections between songs. I assumed that the process defaulted to recommending songs by the same artist. This was true in only about half of the operations.
- No test set. I wanted to find latent connections. I expected the algorithm return obviously similar songs. Instead, the results look like they were pulled out of a hat.
What went well
- The asynchronous web scraper is super fast
- The app is fast and has not crashed so far.
- Now that the app is working and I have data, a new model should be able to be generated.