How to build and deploy a lyrics generation model — framework agnostic
You’ll find tons of article about how to build a machine learning model. You’ll find a bit less article on how to consume it intelligently. And you’ll find almost no article about how to serve it from scratch.
I’ll detail the steps that took us to the product you can see above: raplyrics.eu
All codes are open source and available on GitHub.
- RapLyrics-Scraper
- RapLyrics-Back
- RapLyrics-Front
Update: We propose a more cost-effective and more straightforward way to serve your machine learning project in this post.
What?
With a good friend of mine we really love to listen to rap music. Rap music is powerful because it has the power of creating savage punchline with only a few words.
Since it is still hard to generate long texts with RNN, we believed rap music was a great candidate.
How?
The big picture
I won’t be too descriptive about implementation in the post since we tried to be exhaustive inside the code repositories, see READMEs
. I will insist on the tipping points that were challenging for us.
Basic sysadmin knowledge and unix proficiency will help.
1- Data Extraction and processing
GitHub repository: RapLyrics-Scraper
— scraping
First, we need a dataset to train our neural network.
Luckily enough Genius.com has tons of lyrics available online and even a nice API.
It may not be designed to scrap lyrics but with some workarounds we managed to build a lyrics scraper on top of it.
Check the source code or reach out in comments if you need technical details.
After multiple shots, we realized that it’s really important to focus on a high-quality dataset for natural language processing. We decided to focus on the 60 most popular songs of 40 US artists.
✔ That’s it for the scrapping.
— pre-processing
The scraping part provides us with a .txt
dataset. We now have to clean it — i.e. remove non lyrical content: ©, ®, Credits, typos and various spellings of the same word. Think about gettin'
, getting
and stuff like this.
Methodology we followed:
1. Identify patterns to eliminate
2. Craft regex catching those patterns — resource for regex testing: pythex.org
3. Use a text editor to perform those regexes directly on the dataset
If you want to automate regex cleaning, be aware that it is risky. You’ll have to thoroughly consider the order in which you perform your regular expressions.
— augmenting the dataset [optional]
We chose only artists with really meaningful lyrics and we selected their most popular songs. That does not make a huge corpus. Hence, we decided to perform a step of data augmentation to virtually increase the size of our dataset.
📖 Data augmentation means increasing the number of data points. In our context, it means increasing the number of sentences.
We copied our dataset, shuffled all the verses and pasted this back at the end of the original dataset.
You can find a snippet on how to shuffle paragraphs here.
With this trick we double the size of our dataset. This will have a positive impact on the training of the neural networks. Indeed each new batch is different due to the shuffling, so the network weights are updated with different inputs.
✔ That’s it for the data augmentation.
2- Building a lyrics generative model
- GitHub repository: RapLyrics-Back
— dimensioning the text generative model
Many neural networks implementation are available online. We chose one and fine-tuned it to fit our need: textgenrnn — a python project for text-generation using neural networks.
You can find a basic description of the model’s hyperparameters and the training settings in our code repositories READMEs
The purpose of this article is not to deep-dive into neural networks design. The implementation won’t be detailed. You can check in the source code or ping us in the comments.
— training the text generative model
Depending on your dataset and your configuration, you may consider cloud computing to speed-up the training. We used aws — Amazon Web Service.
If you train your model locally — you can skip this part. Otherwise, consider that the following part will get a bit technical.
I will detail our training setup in more details since it is something which took us time to get right.
We launched an aws ec2 spot instance to reduce our cost. We need at least 3gb ram and the 8gb default ssd is enough. The training was not GPU accelerated (a point of amelioration).
How is an ec2 spot instance different from a classical ec2 instance?
You bid for an ec2 instance with certain specs and, as long as your bid is above the average market price, you have an instance behaving like a classic ec2.
If your bid is below the the market price you instance is terminated after a short notice. More info on spot instances.
We made a spot request, it was fulfilled in no time and then we cloned our repo and installed a python3 virtual env with all the project requirements.
Note: You need to enable your instance to write on s3 bucket if you want to save your model checkpoints (as seen 👇)
texgenrnn saves a model checkpoint at each epoch.
- To cope with the risk of instance termination and save our checkpoints in a safe place, we use
aws cli
to copy the checkpoints in an aws s3 bucket.cd
to your checkpoint files and copy them to your s3 bucket.
# run `pip install awscli` beforehand
aws s3 cp my-checkpoint-file.ckpt s3://my-s3-bucket/model-saves/
Note: To make this possible, you need to grant write access to your ec2 to instance. To do this, add a role to your ec2 instance with s3 full access and ec2 full access policies as described in the screenshot below.
There are many sneaky details with policy handling, don’t hesitate to ask us in comments.
— testing the text-generation
Once you have trained your model you can use the Jupyter notebook RapLyrics-Back/exploration/sample_explorator.ipynb
to generate your first AI-powered lyrics.
3. Serving the text generative model
For the purpose of providing users with better lyrics, we use a custom generation function.
We serve the app using gunicorn over Flask. The idea is not to reload the model at each API call — which would lead to long response time.
We restore the session only once at the app initialization and it persists between API calls.
Demo of the call to the API and its response.
If you haven’t implemented the model yet, feel free to call our API:
curl 'https://raplyrics.eu/apiUS' -X POST -H "Content-Type: application/x-www-form-urlencoded" -d "input=struggle"
See the get_model_api_us
function in api/serve_us.py
on how we setup a persistent tensorflow session. Simply run gunicorn app:app
from the shell to launch the app. The model is served on 127.0.0.1:8000
by default.
You can now clone RapLyrics-Back on the machine that will be used as a web-server.
4- Plugging in the front end
GitHub repository: RapLyrics-Front
This article describes the necessary steps for an apache web-server. If you don’t have it sudo apt-get install apache
.
Move all the files to be served from RapLyrics-Front to /var/www/html/
Remember to update the "url"
settings of your endpoint in the index.html
That’s it, you’re done (kind of).
You can now access the website by accessing your server ip
in a web browser.
— production set-up [optional]
These are the next steps if you want to have the front-end and back-end on the same machine with an https connection.
1. Let’s encrypt our website 🔒 → follow the steps in How To Secure Apache with Let’s Encrypt (Digital Ocean has really awesome tutorials)
2. Our index.html
served by apache calls raplyrics.eu/apiUS
when the user submits an input. In fact, there is no /apiUS
route on apache. We need to redirect this call to the gunicorn
server running on this very same machine. This is what is called reverse-proxying.
Let’s handle these two steps.
Since the code is related to apache configuration, it is not version controlled.
- Go to
/etc/apache2/sites-available
You should see a 000-default.conf
and a 000-default-le-ssl.conf
file. They are template files handling configuration on how apache will serve your http and https (le-ssl
) website.
We make a copy of them for our website. (replace raplyrics.eu
with your domain name 👇)
sudo cp 000-default.conf raplyrics.eu.conf
sudo cp 000-default-le-ssl.conf raplyrics.eu-le-ssl.conf
1. Redirect traffic from http
to https
Edit raplyrics.eu.conf
to include the rewrite conditions below:
2. Reverse proxy the API call
Edit raplyrics.eu-le-ssl.conf
to include the proxy reverse instructions.
It is here that we handle the proxy pass from raplyrics.eu/apiUS
to the local gunicorn server at 127.0.0.1:8000
Now we tell apache to update the website configuration:
sudo a2ensite raplyrics.eu.conf
sudo a2ensite raplyrics.eu-le-ssl.conf
Finally, sudo systemctl restart apache2
to take the changes into account.
That’s it, you’re in production. 🚀
You can check ours on raplyrics.eu
References 📚
Interesting blog post on Serving a python app on heroku (Heroku dynos could not handle our app — not enough ram) steps are well described.
Digital ocean reverse proxying on apache, again Digital Ocean does a much appreciated documentation job, very detailed.
Very interesting SO post on How to build a CI / CD pipeline with GitLab and AWS
Inspiration for the neural networks parameters fine-tuning
Interesting Cross-Validated post on fine tuning batch-size training parameter.
Google Brain paper proposing a set of hyper parameters for a text-generative LSTM (especially the 4. Experiment and 4.1 Language modeling for our use case.)