Rap God or Machine?

Silas Strawn
9 min readDec 18, 2019

--

source: https://robotunion.eu/robots-and-music-how-robotics-is-changing-the-musical-landscape/

Musical generation has been on the forefront of machine learning. As of 2018, The Independent reports that rap is the most popular music genre in the United States. Due to rap’s large presence in our society, we decided to create a rap song generator using machine learning.

To accomplish this, there are two generators that must be created: lyrics and beat.

Lyric Generation

We experimented with three different kinds of models for writing lyrics, each more successful than the last. First, we tried a character-level LSTM-based model. Second, we implemented a word-level LSTM-based model. Lastly, we used transfer learning to train OpenAI’s transformer-based GPT-2 to generate rap lyrics. For all of these models, we used a python library to scrape lyrics from the Genius API. You can find the lyrics we generated in this drive folder.

To give some background, Chris Olah explains the concept of Long-Short Term Memory Networks (LSTMs) as well as attention (relevant to transformers) in great depth here and here. In short, LSTMs are based off Recurrent Neural Networks (RNNs). At its heart, an RNN is different from a normal NN as it has loops which map the output back into the input, enabling information to persist in the model. For text generation, however, prior information introduced fairly early has to be used. Consider the sentence, “I grew up in Germany, therefore I am fluent in German.” It would be hard for an RNN to guess German, due to the distance in the sentence between Germany and German. LSTMs, however, carry through the network an understanding of state, which makes it easier for them to guess information based off of the past. Transformers use an attention-based mechanism to emphasize past keywords to further the accuracy of guessing the next word.

Lyric Generation: Character-Level LSTM Model with Keras

In our research, we found a basic model for text generation here, which we replicated and experimented with. The model uses a Keras sequential model with LSTM layers to generate text character by character.

We scraped Genius for 20 Kendrick Lamar songs to train on. This amounted to about 80,000 characters. After acquiring this raw data and appending the songs together, we created 100-character long sequences. These were generated by sliding a window across the data set, not by partitioning it. Each one of these sequences represents an input-output pair with the first 99 characters being the model’s input and the correct output being the 100th character. As outlined in the source above, word-level models tend to be more accurate than character-level ones. However, since the total number of unique characters in a given piece of text is generally smaller than the number of unique words, training on a smaller dataset may be easier, since words may not repeat many times. In our case though, if you use a fixed-size training text, there are far more sequences of characters generated than words. This leads to a high number of parameters to train, increasing the difficulty of training.

After acquiring data, we set up and trained our model. We used the Keras sequential model with the following layers:

We trained this model with the ADAM optimizer and a categorical cross entropy loss function. When training with just one epoch, the output was nonsensical–just one word repeated over and over. When we trained for more epochs, we were able to produce sensible lyrics. However, we found that this simply reproduced segments of text that were found in the training lyrics. This is obviously non-ideal behavior. We want lyrics that make some sort of sense (grammatically), but are also original and not stolen from other artists. At this point, we decided to explore a word-level model instead.

Lyric Generation: Word-Level LSTM Model with Keras

After seeing that the character-level model just replicated the training data, we started exploring word-level NLP models. We again used the Keras sequential model with multiple LSTM layers. This time, we took inspiration from Jason Brownlee’s article on the topic. We also experimented with adding attention to this model, but the performance seemed to be worsened by this addition.

In contrast to the character-level model, we needed to do quite a bit of data pre-processing before it was ready to be trained on. We again scraped the Genius API for song lyrics. We tried training both with just Kendrick Lamar songs, and with a total of 52 songs from Drake, Kendrick Lamar, Jay-Z, Migos, Logic, and Snoop Dogg. We removed most punctuation from the lyrics and split the input text on spaces. Since we wanted the model to learn to open/close parenthesis, end lines, and end songs, all of these were encoded as words. In order to allow the neural net to work with words, we used a Keras tokenizer to map words to integers. Just as above with the previous model we also generated sequences from the lyrics to use as training data. Instead of being 100-characters long, these were 50-words long. When training with multiple artists, we produced 41,446 total such sequences and had 4,254 unique words.

The network architecture again contained two LSTM layers:

After setting up this neural net we trained ADAM with categorical cross entropy loss for 100 epochs with a batch size of 128. We generate songs by first providing a seed text for the model to continue. This is randomly selected from the song data. I decided to not include this first seed in the song, since it isn’t produced by the model. One you have the generated text, you can feed it in as the new seed, and produce output until the neural net generates text containing the word that marks the end of a song.

The results of this model were mixed. Unfortunately, training on multiple artists, rather than just Kendrick Lamar, worsened the results, in that several parts of the generated song were taken from training data directly for the multi-artist attempt. The model generated from Kendrick Lamar lyrics didn’t just copy the training data, however. This puts the word level-model above the character-level model we tried above. Sadly, the lyrics are not very grammatically correct or coherent. Thankfully, the GPT-2 model we explore below improves drastically on this concern of grammar and coherence.

Lyric Generation: GPT-2

GPT-2 is a transformer-based model. Originally, it was trained on 40GB of text scraped from the internet with over 1.5B parameters. However, this model was not released due to security concerns. However, smaller versions were released — one with 774M parameters, and one with 345M parameters. We chose to use the model with 345M parameters, as the 774M parameter model ran out of memory when we attempted to train with it. Based on the lyrics generated, this seemed to be the most successful model of the three we tried. It seemed to have a better understanding of grammar and context than the two prior models. This performance makes sense because GPT-2 was already trained on a huge amount of data, which we simply could not match in the models we trained by ourselves due to our limited resources. These results demonstrate the power of transfer learning. Unfortunately, GPT-2 also occasionally replicates lyrics found in the training data. However, the original lyrics it generates tend to be of higher quality than those written by the LSTM model.

If you’d like to see the code for the scraper, it is here.

To create a project like this, first, you need a dataset. In our case, we scraped Genius.com for lyrics. First, you need to get an API key using the instructions here. We used the library LyricsGenius to gather various rapper’s lyrics.

If you’d like to see the code for the model, it is here. The lyrics we generated can be found here.

After acquiring data, one can use a pre-trained model (in our case, GPT-2) to generate text. I updated this repo to use the 345M model over the 117M model.

Music LSTM Model

The rap beat is composed of distinct layers including the drum line, baseline, and melody. While we initially trained a classical music model with midi files that compressed all instruments into a singular instrument, we found that it was necessary to isolate the various layers of the song, build separate models, and recombine them to create the most effective beat. However, this proved difficult due to the lack of data on the web, and we recommend creating your own training data for optimal results.

For the actual rhythm and “song” generation, we found a model that made use of Music21 and Keras to generate music using an LSTM. As with the text generation model, the music model also used a Keras sequential model with LSTM layers. This Keras model makes use of the layers listed below:

The model uses Music21 to transform a song (in midi file format) to a sequence of notes and chords. From here the model is created with the layers listed above. More detail about the model can be found here. Our source for midi files is here (and YouTube).

Originally, the generative LSTM model was used for classical music. For our purposes, we trained on 30 rap midi files. We found that the music generated was seemingly garbled and had very little if any affiliation with the structure of rap music. After familiarizing ourselves with the model and listening to rap midi files we created, we realized that the classical music that the original model was using only made use of one instrument (the piano).

We were able to create this sample piece with 10 epochs.

We were able to create this sample piece with 20 epochs.

In our next attempt, we built a model that used only drum beats and controlled for tempo and style. Using GarageBand, we isolated the drum line of some of the original midi files that we had trained on. It proved difficult to find data online of drum lines for rap beats. Ultimately, we only had 10 songs in the training set. For optimal results, we recommend that you generate your own training data in a studio. We then retrained the model above on this modified data set and were able to produce music that sounded like this.

There are two main improvements that could be made on the model. First, change the model to use beats instead of notes as the predictive factor as tempo and beat duration are the defining factors of a rap beat. Second, the model ideally would train on raw wav files instead of midi files such that the timbre can be taken into account when generating the beat. This would require more computing power and restructuring the model as a whole since the Music21 library works best with midi inputs. Similarly, a model that produced wav files instead of midi files would sound more accurate.

Conclusion

There is a lot of potential for creating a rap song generator. For lyrics generation, we found promising results with LSTM-based word models, however the lyrics generated by using transfer learning with GPT-2 were far better. For music generation found that the LSTM model was most effective. We were able to repurpose classical music generation models to create the rap drumline. However, the ideal model would use wav files in place of midi files as both the training and the output data so that timbre is preserved.

--

--