Create a Rapping AI using deep learning

Part 2: Fine-tune a pre-trained GPT-2 instance for a specific task

Max Leander
Analytics Vidhya
7 min readDec 6, 2019

--

Greetings all Data Travellers and ML Superheroes!

Welcome to Part 2 of this series, where I am trying to build an AI that can come up with entirely new rap lyrics in the style of famous rappers, and turn it into an audio track!

Make sure that you have read Part 1 of this story, where we learned how to efficiently collect a pretty big dataset of rap lyrics using concurrent Python.

This week, we are going to train a model to generate new rap lyrics for us, and for this we will use deep learning.

Generative models for natural language have made very fast progress during the last few years. As I warned you in Part 1, this is going to be a very hands-on tutorial. Hence, I will not dig into the theory of NLP models here. But if you want to understand how these models work in detail (and I think that you should), I highly recommend you to read the following easy-to-understand blog posts:

Illustrated guide to Recurrent Neural Networks

Exploring LSTMs

The Transformer explained — Attention is all you need

For our rapping AI, we are going to use a pre-trained version of the GPT-2 model, which is based on the Transformer, and was released by OpenAI in February 2019. The model quickly became famous for not being released to the public…because it was too good! You can read the original paper here: https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

A smaller, 345 million parameter, version of the model was however publicly released. And later they released an even bigger, 774 million parameter, version. In this project, we will load the pre-trained 345M GPT-2 model and fine-tune it using the dataset of rap lyrics that we compiled in Part 1. This can be done on the Google Colab platform in order to utilize GPU training for free! (The 774M model is unfortunately too big for Google Colab at the time of this writing, but make sure to experiment with it if you have access to a beefier environment.)

In layman’s terms, you could say that the GPT-2 has a long-term and a short-term memory, just like the brain. (I know, GPT-2 is based on attention, not LSTM, but I think that the analogy still holds.) GPT-2 is trained in a semi-supervised fashion, which means that both supervised and unsupervised learning is used.

It has first been trained in an unsupervised way for a very long time, on very large amounts of data. The purpose of this was to learn the relationships between different terms and in what way they contribute to different contexts. These relationships constitute low-level language concepts such as grammar, structure and style, and represent the long-term memory in the brain analogy.

By fine-tuning, the GPT-2 will learn the context and topics that are currently important to us, while keeping its long-term memory, i.e. its sense of grammar, structure, style, etc. that it has learned from the unsupervised training. The fine-tuning is done by training the model in a supervised way. I.e. we will hand it specific input/output pairs and later have the model predict the output given new input. In our case, this new input could be the name of a favourite rapper, and hopefully the model will be able to spit out believable lyrics in the style of that rapper!

The first thing you should do is head over to Google Colab and create a new notebook with GPU support. Github user nshepperd has already created a nice repo to make it easy to experiment with the GPT-2 model, so we are going to clone that into our Google Colab workspace by running the following in the notebook:

!git clone https://github.com/nshepperd/gpt-2.git

Next, we need to install all required Python packages:

cd gpt-2
!pip3 install -r requirements.txt

In order to access the dataset from Colab, I have copied the text file with rap lyrics to my Google Drive. It can be easily accessed by mounting it in the Colab environment, using the following command:

from google.colab import drive
drive.mount('/content/drive')

This will bring up a URL that you need to click in order to authorise access.

To make sure that all of our input and output is decoded and encoded using UTF-8, we set the following environment variable:

!export PYTHONIOENCODING=UTF-8

When this is done, it is time to load the pre-trained GPT-2 (or rather, the lobotomised version of it) into memory. Run the following script from nshepperds fabulous repository:

!python3 download_model.py 345M

The argument 345M tells the script to download the 345 million parameter version of GPT-2. Next, it’s time to load the training data which will be used for fine-tuning:

!cp -r /content/drive/My\ Drive/data/rap_training_data.txt /content/gpt-2/rap_training_data.txt

Make sure that you adjust the Google Drive path according to wherever you uploaded the training data.

Now, all we need to do is run the training script from nshepperd:

!PYTHONPATH=src ./train.py --dataset /content/gpt-2/rap_training_data.txt --model_name '345M'

If you can read Python code, I encourage you to head over to the repository, https://github.com/nshepperd/gpt-2/blob/finetuning/train.py, and study the code to find out what actually happens.

Basically, the dataset is turned into input/output pairs, where the input is a sequence of tokens and the output is the probability of the next token in the sequence. Traditional RNN approaches include training on single characters (very slow to train, but can handle any variant of case and format) and training on entire words (much faster, but harder to adjust for case and format, and completely unable to come up with new cool hiphop expressions, yo furr reaaallllzz). The pre-processing of the GPT-2 training data instead tries to compress the text data into the shortest possible byte-combination that includes any grammatical or stylistic meaning. This turns out to be a very good compromise!

During training, samples are generated from our dataset and fed to Tensorflow, which uses the sampled examples to update the weights of the pre-trained GPT-2 model.

Every 100 epochs, the model will save a checkpoint and generate some random output. The generated samples should look more and more aligned with the training data over time, while not being direct copies of any of the content (if that happens, the model is overfitting).

When you think you are done training, just interrupt the kernel in Google Colab. Make sure to save your checkpoint to Google Drive, if you want to continue training later:

!cp -r /content/gpt-2/checkpoint/ /content/drive/My\ Drive/rapping_ai

nshepperd conveniently included a script to interactively sample from the model by giving it custom input and observing the output. Just copy your fine-tuned model to the appropriate folder and run the script:

!cp -r /content/gpt-2/checkpoint/run1/* /content/gpt-2/models/345M/!python3 src/interactive_conditional_samples.py --top_k 40 --model_name "345M"

After training for ~2000 epochs, I fed Yung Lean, the famous rapper from Sweden, to the model like this: <<Yung Lean (since this is how our training data is formatted, see Part 1), and expect the model to fill in a song title and some lyrics in the style of Yung Lean. Here goes:

<<Yung Lean — I Will Rock You Like A Nip>>

[Produced by D-Skeet & D.C. (Puff Daddy & Rick Ross)]

[Verse 1: Yung Lean]

You better believe me

She ain’t just a girl you can be with

You can be with my girl, girl

She’s like a lover for me

She makes my whole world feel small

She make my life so happy

I be on a roll as a girl

Cause I be rolling with you

[Chorus: Yung Lean]

I got a lot of lovers

I get a lot of love for the people that I love

I got people that they couldn’t stand or hate

They got it like we’re a family

It’s a love that’s so deep

I’m so proud I gotta give you a (guitar riff)

[Verse 2: Yung Lean]

I come from a country that’s very rich

I wanna give you something I need, you know (tape roll)

I’m from a country that’s very rich

I go to the same school that all my homies go there

I’m going to grad school and I want to get the world

I don’t know what you wanna give me

But I gotta give it away

Because it’s just so much love, it’s just so much love

[Chorus: Yung Lean]

I get a lot of lovers

I get a lot of love for the people that I love

I got people that they couldn’t stand or hate

They got it like we’re a family

It’s a love that’s so deep

I’m so proud to do the things that I do

That I put it on the shelf

’Cause it’s a love that’s so deep

I’m so proud I gotta give you a (guitar riff)

[Verse 3: Yung Lean]

A lot of love for the people that I love

A lot of love for the people that I love

It’s all love

It’s all love

It’s all love

It’s, everything (guitar riff)

Pretty good!

Actually, I have no idea whether this would pass as a real Yung Lean song. But at least it got the basic structure of a rap song right. And the topic of love is definitely consistent throughout the song.

Now go ahead and create your own dataset (like I did in Part 1) and generate something genuinely new in your domain of interest!

What did we learn today?

  • How to load a pre-trained general-purpose language model called GPT-2
  • How to fine-tune this model for a specific purpose using our own dataset

What will we learn the next time?

  • How to synthesise speech from text
  • How to match syllables to a fixed beat
  • Use deep learning to transfer audio style into your favourite rapper

I realise that the goals of the next lesson are a bit harder to accomplish than what has been done so far in this series, but slow and steady wins the race!

In other words, stay tuned…

--

--