Create a Rapping AI using deep learning — Part 1, Collecting the data

Max Leander
Analytics Vidhya
Published in
6 min readOct 4, 2019

Greetings all Data Ninjas and ML Warriors!

This is the first post in a series about how I created an AI to generate completely fresh, but believable (!), rap lyrics. From scratch!

(That is, I created the AI from scratch. The AI doesn’t create anything from scratch. It needs to be trained on massive amounts of rap lyrics. Come to think of it, I was probably trained on massive amounts of data about how to train a model using data [recursive mind-explosion] to be able to create this AI. Strictly speaking, no one probably ever created anything from scratch. Anyways… the point is that YOU will be able to create your own Rap Generating AI™ (sort of) from scratch!)

Disclaimer: This will be a very practical, non-theoretical, code-heavy, math-less tutorial. There are plenty of resources on the theory behind language models, and I will try to reference them as I see fit. However, the purpose of this series of articles is to build something that can be used, re-used, tweaked, and hopefully inspire to build other similar applications. Not to teach the theory of Deep Learning or NLP.
phew
And off we go…

The desired result of our project

Part 1 — Collecting the data

The first part of any interesting ML project is to collect data. In general, the more the better.

A rapping AI will naturally be trained on rap lyrics. Fortunately, the world’s biggest collection of song lyrics, Genius, has a free-to-use API that we can utilize to collect rap lyrics. The first thing you need to do is to sign up for a Genius developer account. Go to the Developer page and click Create an API client. Two of the fields are mandatory, App name and App website URL, but the others can be left blank. Next, you will receive a client ID, secret and access token. Store them in a safe place, as they will be needed to access the Genius API and download lyrics.

Keep this info in a safe place

Next, we are going to implement a client script in Python, which will download all the lyrics we want by sending requests to the Genius API. We could use a HTTP client library to send requests directly to the Genius REST API. Luckily though, someone already implemented a wrapper (no pun intended) for us, in the Python library lyricsgenius, so let’s just use that.

To install lyricsgenius using pip, just type pip install lyricsgenius in your terminal.

In the following code snippet, you can see how I created my own wrapper class (pun intended) which uses lyricsgenius to download lyrics for a specific artist, in the order they are found in the Genius database:

Access Genius and download lyrics

As you can see, all you need to access the Genius API is the client_access_token that we generated earlier. All found lyrics will be stored in the root folder of the Python project, in a subfolder called data. And the lyrics of each song will be saved to a separate .txt file in a subfolder named by the artist name.
Example: /rap_generator/data/Eminem/My name is.txt

Running the Genius.scrape_artist method for each rapper you can think of is really tedious. Also, as I said earlier, we want lots of data to train our rapping AI on. Of course, I don’t know how many rappers you can name… but I realized that I will definitely need songs from artists I’ve never heard about to get to the volume of data I need. What I did was I went to this Wikipedia page, copied all of the artist names and created a text file with one name per line. In total, ~1500 rappers. Then I read the names from that file and called Genius.scrape_artist for each name.

Download lyrics for all artists

This code snippet will read names of rappers from the file rappers.txt and try to download lyrics for 50 songs from each rapper. Note that I pass the secret_access_token as a command line argument, which is read using sys.argv[1]. This works, and eventually we will have all ~1500 songs. Problem is… it will take a looong time. Each request to the Genius API has a long round-trip time. Fortunately, there is a solution, called concurrency! Hang in there…

Python is usually blamed for not being the most parallelizable language due to the global interpreter lock and yadayadayada. This is true and all. However, in the case when the interpreter is blocked for the majority of time, due to external processes (like file I/O or web server requests), and these processes are independent of each other, we might as well run all of the processes in parallel. And for this use case, Python shines!

The simplest (and probably most convenient) way of running Python code in parallel, is through the concurrent.futures module. Just check out these very few lines of code:

Download lyrics concurrently

First, we read rappers.txt into a list. Then, we instantiate a ThreadPoolExecutor. (Which sounds way more complex than it is. A Thread is just a piece of code that can run in parallel with other pieces of code, i.e. other Threads, a Pool is just a bunch of Threads, and the Executor is just a thing that can execute the Threads in the Pool.) We have to specify how many threads to run in parallel (i.e. the size of the pool), and I chose 5.

Now, the next line is the cool thing:
executor.map(genius.scrape_artist, rappers)

When we run this line, the executor will take one element at a time from the list, rappers, that was passed as the second argument to executor.map. It will call the function, genius.scrape_artist, that was passed as the first argument to executor.map, and pass each element as the argument to that function. So, the first call will be scrape_artist(rappers[0]), the second will be scrape_artist(rappers[1]) and so on…

The cool thing is that the executor will automatically kick off 5 of these calls in parallel, wait until one of them finishes, kick off a new one, and so on until the entire list is exhausted!

Setting num_workers=5 as I did in this example will yield an approximate speedup of 5 times, since most of the time in this script is spent waiting for the requests to finish. Warning! If you increase this number by too much, you might hit the server-side limit of simultaneous requests from one client. Overriding this limit may even cause a ban, so be careful…

After downloading all of the lyrics, you will probably want to concatenate all of them into one big text file that you will use to train a generative NLP model:

Concatenate all downloaded lyrics into one text file

What happens here is that I list all of the subfolders in the data folder that we created earlier, and for each subfolder I iterate through all of the text files which contain the rap lyrics. I prepend the lyrics for each song with the artist name and song title. My hope is that the AI will be able to come up with artist names and song titles of its own, before the generated lyrics of each new song. How cool would that be?

Finally, I join all of the lyrics with a special divider (50 times the character '*') and write everything to a new file.

Note on exception handling: Since we downloaded a lot of files, some of them might be corrupt for…reasons… which is why I just swallow any errors that may occur. We will hopefully have enough data as it is anyway.

That’s it! We now have a pretty decent dataset consisting of the lyrics for ~40.000 rap songs. Now it’s time to teach an AI how to write new lyrics in a similar style.

What did we learn today?

  • How to download song lyrics from Genius using Python
  • How to concurrently send web requests using Python
  • How to concatenate text files using Python

What will we learn the next time?

  • How to load a pre-trained general-purpose language model
  • How to fine-tune a pre-trained language model using your own dataset

To be continued…

--

--