Create a song playlist from YouTube comments using NLP

Samrudha Kelkar
tech-that-works
Published in
4 min readApr 6, 2020

Where do you listen to your music? Mostly on some apps? And what you listen to? Songs of your favorite artists, albums, creating your own playlist. More often than not, you get hooked to powerful recommendations given by all popular music streaming apps.

What if people take their time and effort to suggest you “songs” that you might like. Yes, What if they type those songs for you?

Here I do not mean creepily tracking their music playlist. So last weekend I came across this YouTube music reactions channel where the YouTuber(his support page) listen to Indian Bollywood songs and react to it. Interestingly, people who love those Indian songs suggest him more songs. Remember these are the heartfelt recommendations users really really want to get featured in reaction videos. And most of these recommendations are simply amazing. So I thought why not use NLP algorithms and extract all these beautiful recommended songs.

If we can do that, that will be one incredibly awesome playlist to have. Here is my experiment to get the playlist from youtube comments

A] Get the damn data: ML on youtube comments is something I always wanted to do. I came across this repository which does the task of using googleapi for you and gives you a nice new line delimited output file.

Data looks something like

B] Let’s do some NLP: NLP tasks mainly involve tokenizing the comments, entity extractions, and POS tagging. We can create the pattern of those comments using the output of the above pipeline.

I explored Stanza, a cool new NLP library in the town. I was thinking of using it for a custom NER pipeline where my entities will be “song” and “artist”. But I found the annotating is a pretty tough job here since most of the comments are in Hinglish, (a mixture of Hindi-English). Also frankly, the documentation of training your own NER for stanza is not that great yet. The data directory structure is hard to crack and not worth the pain for my tiny little data set (2100 total records)

Interestingly, Spacy has open-sourced spacy-stanza. It gives the capability of doing familiar spacy stuff while using stanza models under the wrapper. You have to just import the stanza pipeline.

import stanza
from spacy_stanza import StanzaLanguage

snlp = stanza.Pipeline(lang="en")
nlp = StanzaLanguage(snlp)

c] Algo: The output from POS tagger and resolved entities can then be used as features to run your song extraction algorithm. Here I used spacy’s inbuilt matcher which allows you to create rules which run on the spacy doc.

We see that multiple matches can be found but they share the same match_id. that’s the beauty of Spacy. Simple string matching might solve this task, but using a spacy doc and matcher let us filter duplicates easily using match_id. I kept the longer spans of detected texts in filtering duplicates.

Interested in getting the recommended singers list? We can always use the tag PERSON of the entity tagger to get all the singers.

Results:

The video that I used to extract comments was by the popular Bollywood Indian Music singer Arjit Singh. Here is the list of the few suggested songs that I could extract :

Ayat by arijit singh, Chaap Tilak by Rahat Fateh
Sau dard hai by Sonu nigam, Ayat by arijit
aankhon ke Sagar by shafqat amanat
song Kalank, teri mitti by B praak
song Mai Rahoon, Tum ko, Ek dil, jaage Hain by ar Rahman
Sang Dhol Baje by Shreya Ghoshal
teri mitti’ by B, HARAM BY ATIF ASLAM, Ayaat song by arijit
KAHI by SONU NIGAM
AAYAT by Arjit Singh
OwnMusic song by Deep Chaubey
Teri mitti by B Praak
Kastoori’ By Indian Ocean

Most of these songs are by Arjit Singh. Others look really close to the song in the video in terms of music quality, melody, genre, etc. Totally I could extract 255 good songs (with few duplicates) from close to 2k comments!!

--

--