Published in


Lam Phan

Aug 17, 2020

10 min read

Algo-rhythm: Tech & Music come together

Using song lyrics to predict the gender of the singer

Intuition & Problem Statement

When starting with this project, our initial intuition behind this topic was to look at the factors behind a chosen sub dataset of songs appearing on the billboards for e.g. trying to answer questions like ‘are female artists more likely to be on the billboards than the male artists?’, public sentiment — ‘are songs pertaining to a particular type of emotion like love, sadness, heartbreak, etc. more likely to be on the charts?’. However, as we proceeded, our objective evolved. Thus, our final model takes an input consisting of the lyrics of any size (in terms of the number of words) and uses random forest with top 200 words based on TF-IDF (term frequency-inverse document frequency) score, to predict the gender of the singer.

Technologies used

Data collection

Firstly, we scraped artist names from Billboard top 100 artists, which was successful. Then, we attempted to scrape all songs of those artists from azlyrics, which was in vain as we got blocked from the website temporarily for suspicious activities, and at that point in time, due to time and knowledge constraints, we couldn’t build proxies to prevent the said problem. We resorted to scraping lyrics manually, 1 artist at a time, 4 songs per artist, each member in charge of 25 artists. Gender was also manually added in each time we scraped songs for an artist.

By the end of this stage, we had close to 400 songs, with an unbalanced gender ratio of approximately 70:30 (Male: Female), which would turn out to be a problem we couldn’t have yet foreseen at that point of time.

Data pre-processing

We first decided to use pandas dataframe as the main tool to interact with our data. Before using NLTK, we needed to clean the lyrics due to certain formatting conventions of lyrics on azlyrics. Firstly, for songs with more than one artist, there are name assignments in the lyrics for each artist, which we removed using regex. We also built our own dictionary for converting song slangs and informal contractions into written English language. Next, we transformed all letters into lowercase and removed punctuations as well as special characters.

From there, using NLTK, we tokenized the lyrics by words, removed stopwords in English such as pronouns and modal verbs, lemmatized words and created a function to get Part Of Speech (POS) tag for each word (such as noun, verb, adjective, etc) to improve the lemmatization quality.

Lastly, in this stage, we experienced with building visualisations such as bar charts and word clouds to see what are the highest frequency words used in song lyrics in each gender and try to identify if there are any significant underlying differences. Here are a few interesting word clouds we obtained.

The first set of images is a comparison of the most frequent words before and after the processing of data. As one can see from the pre-processing word cloud, prepositions and pronouns form a major part of the song lyrics, and hence needed to be removed for effective analysis.

Pre-processing of data
Post-processing of data

The second set of images shows a comparison of the post-processing word clouds for male and female singers. For both genders, words like ‘get’, ‘like’ and ‘know’ appear a lot, while ‘love’ dominates one of the genders. Try and guess which word cloud corresponds to which gender! (don’t worry, we have the answer at the end of the article)


Model constructions

We moved on to try decision trees instead. At this point in time, on top of the pre-processing done with NLTK, we decided to use Tfidf Vectorizer to get the term frequency-inverse document frequency (TF-IDF in short) of each word as normalized input into the models.

At this stage, we realized how small our dataset is, which may result in heavy overfitting. We decided to add on 600 songs from Kaggle to the dataset to compensate for this issue. However, another issue arose as we finally identified the imbalance in the dataset, after adding songs Male still exceeded Female by a large number of data points, as we finally decided to trim off Male songs and ended up with a dataset of around 500 songs with a gender ratio much closer to 50:50.

We then experienced different various methods including logistic regression, decision trees, bagging, adaboost, random forest at different top n words (of highest tf-idf score) from the dataset (n = 10, 20, 30… 100, 1000, 6000) with a few different learning rates, n estimators and max depth. With the accuracy score and f1 score of those models as well as KFold for cross validation, we determined that random forest with top 200 words and n estimators = 100 is appropriate (further hypertuning could have been done here to get at a better model, this can be considered our area for improvement should we decide to continue with this project).


In order to deploy our application online, the deployment process consists of several steps.

First, we need to train our model and save it as a file so that our application can make predictions based on our trained model. Every time we update our machine learning models, we can generate new model files and upload it to our application. This is done with the Python pickle module.

Secondly, we need to bring code that performs data processing from jupyter notebook to a separate python file.

Thirdly, we need to create a simple web application that allows us to input lyrics on a form, process it and output the prediction. For this purpose, we use Flask Web Framework which runs on Python. It provides simplicity and flexibility in development. Basically, the process flow is that our application will receive our text input in a HTTP POST request which will be converted into tfidf-vectors by a separate python file before our model can make predictions.

Lastly, we deploy it on a cloud platform. In this case, we use DigitalOcean as it has a flexible pricing scheme and user-friendly interface. Our application is hosted on an Ubuntu virtual machine which has 3 VCPUs, 1GB of RAM and 60 GB of disk storage. This virtual machine costs $15/month and is capable of performing our task almost instantaneously.

Challenges faced & how we overcame them

Another challenge we faced was with respect to cleaning our dataset. While NLTK library makes the whole process quite easy, one limitation that we faced came with the absence of exhaustive dictionaries for informal contractions and slangs (even Python’s pyspellchecker library couldn’t help). Thus, we had to resort to a trial and error method for this — creating our own dictionaries, to the best of our knowledge.

Moving forward we wanted to enter the exciting part of training our model and making gender predictions. This is when we realised the road is not that smooth yet again. Since there is a wide variety of methods available, we didn’t know what to exactly use with our dataset. Hence, we decided to try everything! This included a wide variety of methods including clustering, logistic regression and tree-based methods, complemented by the cross-validation process. While the process was tedious in terms of trying all these different approaches and that too with different model inputs (top 100, 200, 300,…, 1000, 2000,…, 6000 words), we do appreciate the breadth of knowledge we were able to gain. As we progressed, we used more efficient ways of implementing these different approaches (defining functions, implementing for loops) — which was a good way for us to practically apply our knowledge of programming basics to the machine-learning process.

There were some challenges in processing text input from our application. Prior to this, we train machine learning models from multiple TF-IDF vectors which are generated from our data. However, to perform a single prediction for text input from our website, we have to reverse engineer by learning how to generate a TF-IDF vector of a single data point. This took us some time and luckily we managed to pull it through.

Experience with BIA & Key takeaways

Our learning objectives have evolved as we progressed through the different stages of our project but it has definitely been a constant self-learning journey.

Starting off with an aim to work on a project and just getting exposure to different analytics concepts, a substantial amount of time was initially spent on data cleaning and obtaining a well-balanced dataset in terms of the 2 genders.

The project gave us a starting point for our analytics journey — getting to work on processing (NLP), web scraping, how machine learning works, how to come up with a good model and which is the most suitable one amongst a set of models. Initial conceptual introduction to machine learning, bagging, boosting, random forest models, natural language processing, visualization libraries in Python, hyperparameter tuning, measures one looks at to determine the accuracy of different models, and cross-validation.

One of the sentiments we all shared by the end of the DAP journey with BIA is that DAP indeed gave us a starting point to get into analytics. While we are nowhere near experts throughout the span of 10 weeks, we knew how to better navigate ourselves in our learning journey, where to look for the right information and how to go about applying them in our project. Above all, we got involved in a community where we could all share new knowledge and learn alongside each other, which is one of the most valuable things as we move forward to develop our initial interest into greater and clearer goals in the analytics and data science fields. All our personal goals were achieved in terms of getting to work on a project of our choice and figuring out each stage together and learning from each other on the way.

Note: word cloud (a) is for male singers and word cloud (b) is for female singers.