Algo-rhythm: Tech & Music come together

Using song lyrics to predict the gender of the singer

Published in

SMUBIA

10 min readAug 17, 2020

Intuition & Problem Statement

As part of the DAP Journey, our team (Team Algo-rhythm) decided to undertake a natural language processing project involving lyrics to predict a singer’s gender.

When starting with this project, our initial intuition behind this topic was to look at the factors behind a chosen sub dataset of songs appearing on the billboards for e.g. trying to answer questions like ‘are female artists more likely to be on the billboards than the male artists?’, public sentiment — ‘are songs pertaining to a particular type of emotion like love, sadness, heartbreak, etc. more likely to be on the charts?’. However, as we proceeded, our objective evolved. Thus, our final model takes an input consisting of the lyrics of any size (in terms of the number of words) and uses random forest with top 200 words based on TF-IDF (term frequency-inverse document frequency) score, to predict the gender of the singer.

Technologies used

Our project can be roughly divided into 4 main stages: data collection, data pre-processing, model constructions and deployment. The technologies used in each stage will be explained further below:

Data collection

At this early stage of the project, we have two main directions to choose from, between using existing online datasets about song information or scraping the data ourselves. Since existing online datasets do not include gender information of the artists, which is one of our main focus of analysis, and since we are all interested in getting hands-on experience with web scraping, we decided to prepare the dataset ourselves by using mainly the BeautifulSoup package in Python.

Firstly, we scraped artist names from Billboard top 100 artists, which was successful. Then, we attempted to scrape all songs of those artists from azlyrics, which was in vain as we got blocked from the website temporarily for suspicious activities, and at that point in time, due to time and knowledge constraints, we couldn’t build proxies to prevent the said problem. We resorted to scraping lyrics manually, 1 artist at a time, 4 songs per artist, each member in charge of 25 artists. Gender was also manually added in each time we scraped songs for an artist.

By the end of this stage, we had close to 400 songs, with an unbalanced gender ratio of approximately 70:30 (Male: Female), which would turn out to be a problem we couldn’t have yet foreseen at that point of time.

Data pre-processing

In the second stage, we wanted to be able to understand our dataset well and learn how to properly interact with it for subsequent stages. The main tool we used here is NLTK, with some dabbles into matplotlib, seaborn and word cloud in order for us to better visualise the dataset.

We first decided to use pandas dataframe as the main tool to interact with our data. Before using NLTK, we needed to clean the lyrics due to certain formatting conventions of lyrics on azlyrics. Firstly, for songs with more than one artist, there are name assignments in the lyrics for each artist, which we removed using regex. We also built our own dictionary for converting song slangs and informal contractions into written English language. Next, we transformed all letters into lowercase and removed punctuations as well as special characters.

From there, using NLTK, we tokenized the lyrics by words, removed stopwords in English such as pronouns and modal verbs, lemmatized words and created a function to get Part Of Speech (POS) tag for each word (such as noun, verb, adjective, etc) to improve the lemmatization quality.

Lastly, in this stage, we experienced with building visualisations such as bar charts and word clouds to see what are the highest frequency words used in song lyrics in each gender and try to identify if there are any significant underlying differences. Here are a few interesting word clouds we obtained.

The first set of images is a comparison of the most frequent words before and after the processing of data. As one can see from the pre-processing word cloud, prepositions and pronouns form a major part of the song lyrics, and hence needed to be removed for effective analysis.

The second set of images shows a comparison of the post-processing word clouds for male and female singers. For both genders, words like ‘get’, ‘like’ and ‘know’ appear a lot, while ‘love’ dominates one of the genders. Try and guess which word cloud corresponds to which gender! (don’t worry, we have the answer at the end of the article)

Model constructions

Carrying on our work from the second stage, in order to identify the most appropriate model to be used to predict gender based on song lyrics, we decided to explore clustering methods. Trying both K-Means clustering (using elbow method and silhouette method to try to determine the optimal K) and Hierarchical clustering with different linkage types, we recognized that there was no significantly distinct cluster and thus, clustering is not suitable for our purpose.

We moved on to try decision trees instead. At this point in time, on top of the pre-processing done with NLTK, we decided to use Tfidf Vectorizer to get the term frequency-inverse document frequency (TF-IDF in short) of each word as normalized input into the models.

At this stage, we realized how small our dataset is, which may result in heavy overfitting. We decided to add on 600 songs from Kaggle to the dataset to compensate for this issue. However, another issue arose as we finally identified the imbalance in the dataset, after adding songs Male still exceeded Female by a large number of data points, as we finally decided to trim off Male songs and ended up with a dataset of around 500 songs with a gender ratio much closer to 50:50.

We then experienced different various methods including logistic regression, decision trees, bagging, adaboost, random forest at different top n words (of highest tf-idf score) from the dataset (n = 10, 20, 30… 100, 1000, 6000) with a few different learning rates, n estimators and max depth. With the accuracy score and f1 score of those models as well as KFold for cross validation, we determined that random forest with top 200 words and n estimators = 100 is appropriate (further hypertuning could have been done here to get at a better model, this can be considered our area for improvement should we decide to continue with this project).

Deployment

Before this part, we have a jupyter notebook file that does data collection, data processing and appropriate machine learning models.

In order to deploy our application online, the deployment process consists of several steps.

First, we need to train our model and save it as a file so that our application can make predictions based on our trained model. Every time we update our machine learning models, we can generate new model files and upload it to our application. This is done with the Python pickle module.

Secondly, we need to bring code that performs data processing from jupyter notebook to a separate python file.

Thirdly, we need to create a simple web application that allows us to input lyrics on a form, process it and output the prediction. For this purpose, we use Flask Web Framework which runs on Python. It provides simplicity and flexibility in development. Basically, the process flow is that our application will receive our text input in a HTTP POST request which will be converted into tfidf-vectors by a separate python file before our model can make predictions.

Lastly, we deploy it on a cloud platform. In this case, we use DigitalOcean as it has a flexible pricing scheme and user-friendly interface. Our application is hosted on an Ubuntu virtual machine which has 3 VCPUs, 1GB of RAM and 60 GB of disk storage. This virtual machine costs $15/month and is capable of performing our task almost instantaneously.

Challenges faced & how we overcame them

One of the major challenges we faced was getting a balanced dataset for our predictions to be accurate. As mentioned earlier, we realised the imbalance in our dataset (approximately 400 songs with a disproportionately higher number of male songs) at quite a later stage of our project. But we still tried to deal with this problem by extracting more female songs from a Kaggle dataset, while removing some male songs. Now, our dataset (total 500 songs) was more or less balanced, however, we still faced problems. Our entire process of creating the dataset was so tedious (manually entering the gender of the singer, scraping lyrics song-by-song for azlyrics would block us otherwise), that we could not afford to work with a large dataset. This cost us in terms of our test accuracy for our model was not trained on enough data points. Thus, the whole process of working with our dataset firstly taught us the importance of always doing a summary statistics of your data at the very beginning of the project — thus ensuring that you are aware of the data imbalance before you start-off any modelling. Such problems can be then addressed by working with another more balanced dataset or by adopting methods that work well on imbalanced datasets. Secondly, we also learnt that it is very important in projects like these to have a large number of data points to ensure that your model is not overfitted (a problem we will address as we continue to improve our project).

Another challenge we faced was with respect to cleaning our dataset. While NLTK library makes the whole process quite easy, one limitation that we faced came with the absence of exhaustive dictionaries for informal contractions and slangs (even Python’s pyspellchecker library couldn’t help). Thus, we had to resort to a trial and error method for this — creating our own dictionaries, to the best of our knowledge.

Moving forward we wanted to enter the exciting part of training our model and making gender predictions. This is when we realised the road is not that smooth yet again. Since there is a wide variety of methods available, we didn’t know what to exactly use with our dataset. Hence, we decided to try everything! This included a wide variety of methods including clustering, logistic regression and tree-based methods, complemented by the cross-validation process. While the process was tedious in terms of trying all these different approaches and that too with different model inputs (top 100, 200, 300,…, 1000, 2000,…, 6000 words), we do appreciate the breadth of knowledge we were able to gain. As we progressed, we used more efficient ways of implementing these different approaches (defining functions, implementing for loops) — which was a good way for us to practically apply our knowledge of programming basics to the machine-learning process.

There were some challenges in processing text input from our application. Prior to this, we train machine learning models from multiple TF-IDF vectors which are generated from our data. However, to perform a single prediction for text input from our website, we have to reverse engineer by learning how to generate a TF-IDF vector of a single data point. This took us some time and luckily we managed to pull it through.

Experience with BIA & Key takeaways

Throughout the project, Gabriel Sidik has been a good mentor. Not only did he guide us from the stage of ideation and schedule planning to the end of the project but also became very good friends with us. He suggested us directions we could follow and helped us troubleshoot issues swiftly. He questioned the way we implemented our models so that we could understand our models better.

Our learning objectives have evolved as we progressed through the different stages of our project but it has definitely been a constant self-learning journey.

Starting off with an aim to work on a project and just getting exposure to different analytics concepts, a substantial amount of time was initially spent on data cleaning and obtaining a well-balanced dataset in terms of the 2 genders.

The project gave us a starting point for our analytics journey — getting to work on processing (NLP), web scraping, how machine learning works, how to come up with a good model and which is the most suitable one amongst a set of models. Initial conceptual introduction to machine learning, bagging, boosting, random forest models, natural language processing, visualization libraries in Python, hyperparameter tuning, measures one looks at to determine the accuracy of different models, and cross-validation.

One of the sentiments we all shared by the end of the DAP journey with BIA is that DAP indeed gave us a starting point to get into analytics. While we are nowhere near experts throughout the span of 10 weeks, we knew how to better navigate ourselves in our learning journey, where to look for the right information and how to go about applying them in our project. Above all, we got involved in a community where we could all share new knowledge and learn alongside each other, which is one of the most valuable things as we move forward to develop our initial interest into greater and clearer goals in the analytics and data science fields. All our personal goals were achieved in terms of getting to work on a project of our choice and figuring out each stage together and learning from each other on the way.

Note: word cloud (a) is for male singers and word cloud (b) is for female singers.