SchlagerAI — Automatically generating pop lyrics using language models

Daniela Heinz
Oct 11 · 14 min read

At Motius, we are working hard to expand our knowledge on the latest advancements in Natural Language Processing (NLP). We used our Discovery hackathon to experiment with pre-trained German language models (provided over Hugging Face by the MDZ Digital Library team at the Bayerische Staatsbibliothek). For this, we built an AI which promised to produce the next hit you’ll soon be partying to in the beer tent.

Here at Motius, we strive to stay up to date with current trends in technology. Our platform to do so is called Discovery, and it consists of a 4-day hackathon-style session where we do hands-on experiments with emerging technologies. This is where SchlagerAI was born.

Back in mid 2019, we made our first attempt at building a model for generating Schlager lyrics. For those of you who are not familiar with Schlager, it is a style of European pop music especially famous in Germany. You can normally find it at Oktoberfest, in Après-Ski bars, and on the beaches of Majorca. It is fun, catchy, and often has simple lyrics covering themes of love, partying, and mainstream current events. During our first SchlagerAI project, we did not really spend as much time focusing on the actual model itself as we did on building up a training, evaluation, and deployment pipeline. In our most recent discovery session, we wanted to revisit our work on SchlagerAI.

More specifically, we leveraged open-sourced, pre-trained language models published on HuggingFace’s model repository in order to generate Schlager songs in the German language. In the following sections, I will walk you through the process of building your own lyrics generation model.cements in Natural Language Generation (NLG).

Step 1: Find a pre-trained model

In recent years, the Transformers library by HuggingFace has gained lots of popularity for their open-sourced implementations of State-of-the-Art Natural Language Processing (NLP) architectures. In addition, they also provide a hub for sharing model weights and data sets. This has made building NLP applications that apply recent advancements more accessible to those in industry and research alike.

Due to the massive amount of data and compute power required to adequately train a language model, we decided to search the model hub for implementations of text generation models trained on a German corpus. Luckily, the model hub provides filters for tasks (Text Generation) and languages (de), making it quite easy to narrow down our search field.

After exploring our options, we decided to go with the model provided by the Digital Library team of the Bavarian State Library’s Munich Digitilization Center (DBMDZ). Since we chose to work with the library (see Steps 4 and 5), we needed a gpt-2 style model, as the library works well with gpt-2 style models. The model weights provided by DBMDZ were produced by training on "a recent Wikipedia dump, EU Bookshop corpus, Open Subtitles, CommonCrawl, ParaCrawl, and News Crawl." The data set consists of 16GB of data and has over 2 billion tokens.

This language model helps act as a starting point for building our SchlagerAI model. When prompting the model for text, the output does not currently look much like a Schlager song.

<pre><code>
Prompt:

[Intro]
Wir stehen an der Bar

Text:

[Intro]
Wir stehen an der Bar:
- Das Lokal ist vollgestopft mit Musik und Getränken, aber wir können auch noch Bier trinken.
Es wird eine Menge neuer Freunde geben…
- Ein Pub hat sich für einen Besuch eingerichtet -
wir werden bald nach London reisen!
Und wo kommt das hin?
- Das Pub ist eine halbe Stunde entfernt von den Theatern in Brighton.
Am Samstag Abend haben wir einen sehr interessanten Vortrag über die Geschichte des Londoner Clubs gehalten;
er war ein hervorragender Redner, er hat alles gegeben, was man sich für Theaterunterricht wünschen konnte.
Das Essen sollte echt toll sein, die Bar wäre super organisiert gewesen und würde uns sicherlich gefallen.
Auf jeden Fall wird es wieder ein Konzert am Sonntag morgen machen.
Mit dem Zug zurück nach London
Als wir die Stadt verlassen, fährt sie Richtung Westen ab:
in dem Viertel von Brighton, das wir “The Roadman” nennen, fahren wir auf die A2 und dann weiter bis zu den U-Bahnhöfen Kensington & Chelsea.
Nach 15 Minuten sind wir auf der Kings Road angekommen.
Hier gibt’s eine ziemlich große Auswahl an großartigen Kneipen und Pubs.
Außerdem hat man hier einige der besten Restaurants auf der Welt und
</code></pre>

Clearly, we need to do some work to make the generated text look and sound like a song. In the next steps, I will walk you through the process of tuning the model to generate text which could pass as a song.

..but what is a language model?

In its simplest form, a language model is just a method for producing a probability distribution over a sequence of tokens. A token can be a word, a sub word, or even just a character. When used for NLG, a language model will take as input a sequence of tokens and output a probability distribution for the next token(s). You can then leverage this distribution for stochastically generating text (see Step 5).

Current state-of-the-art approaches for creating language models leverage the transformer architecture (examples include the BERT and GPT architectures and their derivative works). For more information about how they work, check out this illustrated blog by Jay Alammar which provides an exceptionally good visual introduction to the topic.

Step 2: Build a fine-tuning data set

In order to make our output more Schlager-like, we need to gather a large set of lyrics from Schlager songs to help fine-tune our model. However, if you do not already have in mind the complete discography of the top Schlager hits from the past decades, it is going to be difficult to manually generate a list of songs large enough to work as a data set. To handle this, we turn to Spotify and Genius for help.

Spotify provides developers with a Web API that allows us to automatically scrape information about playlists, artists, and songs. Using spotipy (a python wrapper around the developer API), we can build a scraper that will help us to generate a set of Schlager artists which we will then pass to Genius to get the lyrics from their top songs using LyricsGenius. The following walks you through step-by-step in generating your own lyrics data set.

A. Register for Spotify Web API

Register for the Spotify Web API and create an app to get the login credentials. Follow the steps in the documentation for more info.

B. In a python script, initialize the Spotipy API

<pre><code>
import spotipy

client_id = “<insert client_id>”
secret = “<insert_secret>”

sp = spotipy.Spotify(
auth_manager=SpotifyClientCredentials(
client_id=client_id,
client_secret=secret,
)
)
</code></pre>

C. Find a playlist on Spotify and get the playlist id

In the Spotify web player, navigate to playlist that you want to scrape. The URL will be of the format

.

D. Scrape the songs and artists from a playlist

<pre><code>
def scrape_spotify_playlist(playlist_id):
pl = sp.playlist(playlist_id)

songs = []
for item in pl[“tracks”][“items”]:
if not item[“track”][“track”]:
continue
songs.append(
{
“artists”: [artist[“name”] for artist in item[“track”][“artists”]],
“name”: item[“track”][“name”],
}
)

return songs

songs = scrape_spotify_playlist(“<insert_playlist_id>”)
</code></pre>

E. Register for Genius API

Register to get your access token at https://genius.com/api-clients

F. Initialize

<pre><code>
from lyricsgenius import Genius

token = “<insert_token>”

genius = Genius(token)
genius.skip_non_songs = True
genius.timeout = 10
genius.retries = 3
</code></pre>

G. Scrape songs for an artist

<pre><code>
def scrape_genius_artist(artist_name, num_songs=7):
artist = genius.search_artist(
artist_name,
max_songs=num_songs,
include_features=False,
get_full_info=False,
)

for song in artist.songs:
song.save_lyrics(f”data/raw/{song.id}”, sanitize=False, overwrite=True)

artists = {song[“artists”][0] for song in songs}
for artist_name in artists:
scrape_genius_artist(artist_name)
</code></pre>

… but what makes a good data set?

  • Relevance. Lyrics should be from German Schlager songs. Of course, some Schlager does include lines/phrases in other languages, so it is not the worst thing if there are some songs with English/another language. Nevertheless, you do not want to include lyrics from other genres like German metal or Swedish folk music, as they aren’t representative of Schlager lyrics.
  • Quantity. In the ML field, there is a rough rule of thumb stating that your model should train on at least an order of magnitude more examples than it has trainable parameters. For our problem, this is challenging, as our model has over 125 million trainable parameters. Luckily, since the model is already trained on a data set of over 2 billion tokens, our fine-tuning data set can be magnitudes smaller, given we freeze enough parameters in our model (see Step 4 for an intro to fine-tuning). However, assuming you follow the other data set quality rules, a larger data set is not going to harm the performance of the final model.
  • Breadth. Only having songs from one specific artist would be interesting if we were building a model for a single artist. But this project is called SchlagerAI, not HeleneFischerAI. Therefore, we need examples from many different artists in order to imitate the industry as a whole.

Step 3: Clean our data set

Now that we have the lyrics to all the top Schlager songs, we need to do some cleaning of our data set to ensure high-quality results.

A. JSON to text

Our models cannot take raw JSONs as input, so we need to apply some transformations to make it more usable.

<pre><code>
from pathlib import Path

raw_dir = Path(“data/raw”)
for filename in raw_dir.glob(“*.json”):
with open(filename) as f:
print()
song = json.load(f)
lyrics = song[“lyrics”]

print(lyrics)
</code></pre>

Save all lyrics to single txt file:

A song will look like the following:

<pre><code>
[Songtext zu „Ischgl-Fieber (Husti Husti Heh!)“]

[Intro]
Holla-ladi, jadi-jodi-jäh
Holla-ladi, jodi-jäh
Holla-ladi, jadi-jodi-jäh
Holla-ladi, jodi-jäh

[Strophe 1]
Komm, Baby, gib mia an Zung-Zung-Zungenkuss
Hier in Tirol gibt’s koa Sünd’, do ist niemals Schluss
Hey honey, Dirty Dancing all night long
I bin deine Kontaktperson
Paznaun-Girl, i leck di ob
I hob di lieb bis nei ins Grob
Im Lift zur Greitspitz ist’s passiert
Du host mi sexy infiziert

[Pre-Refrain]
Hey DJ, leg des Liadl auf
A Jeda hot es eh, husti-husti-heh!
41 Grad, Schatzi, mir wird worm
Und jetzt steckt ma olle an

</code></pre>

B. Clean tags

As can be seen in the above song, Genius will sometimes add headers at the beginning of each section of the song, indicating if it is a verse, refrain, bridge, etc. However, Genius is not consistent in the tagging of different sections. For example, some songs use [Strophe] and others use [Verse]. As a result, it is more difficult for the model to learn the connection between different sections. Therefore, we look to standardize the different tags using regex substitutions.

<pre><code>
import re

# Clean tags
lyrics = re.sub(
r”^(\[|\(|).*(Strophe|Stophe|Strofa|Strohe|Schtrofä|Verse|Vers).*(\]|\)|:)”,
“[Verse]”,
lyrics,
flags=re.MULTILINE,
)
lyrics = re.sub(
r”^(\[|\(|).*(Bridge|Brdge|Brugg).*(\]|\)|:)”, “[Bridge]”, lyrics, flags=re.MULTILINE
)
</code></pre>

We use many more regex expressions than are showed here. See our source code for the full set of examples.

C. Remove lyric headings

Genius will sometimes add to the start of a song. We don't want that in our training data, so we also use regex to remove such instances.

We now have a nicely clean data set which can be used in the next step for fine-tuning the model.

… but why do we need to clean the data set?

Raw data that you collect from external sources is not always going to be immediately usable by a model. Normally you must do some preprocessing to get it into a standard format which is expected by the model. For example, most models only work on real-valued inputs. Furthermore, you can make it easier for the model to learn on your data set if you remove some variance/noise in the data set before passing it through the system.

For example, by standardizing the song tags in the above cleaning steps, we were able to use our domain expertise to make it easier for the model to learn that [Strophe] and [Verse] or [Refrain] and [Chorus] are referring to similar concepts. Instead of first having to learn the connection between [Strophe] and [Verse] and then connecting the different verses together, the model could focus on learning the structure of different verses.

Step 4: Fine-tune the model using our data set

Now that we have a high-quality data set of Schlager lyrics, we can finally start to fine-tune our model. For this process, we turn to the library aitextgen which is built upon HuggingFace Transformers, PyTorch, and PyTorch-Lightning.

<pre><code>
from aitextgen import aitextgen
from pytorch_lightning.loggers import TensorBoardLogger

hf_model = “dbmdz/german-gpt2”

# Load model with aitextgen
ai = aitextgen(model=hf_model, verbose=True)
ai.to_gpu()

out_dir = “results/”

ai.train(
train_data,
n_gpu=1,
seed=27,
num_steps=5000,
generate_every=100,
output_dir=out_dir,
loggers=[TensorBoardLogger(out_dir)],
freeze_layers=True,
num_layers_freeze=10, # 12 layers in GPT-2 model
line_by_line=False,
header=False,
)
</code></pre>

Depending on your fine-tuning data set, you may need to play around with the hyperparameters such as the number of steps, number of layers to freeze, etc. Use the to track the different experiments and keep the model which performs the best.

… but what is fine-tuning?

Training a model from scratch using only Schlager Lyrics is challenging, especially since German is a complicated language with many different syntax and grammar rules. You need to expose your model to billions of examples of the German language to learn all these rules. However, constructing a data set only consisting of German Schlager will not get you enough examples to work with, and the variety may not be enough for the model to pick up the more subtle nuances of the language. Luckily, there is an approach called transfer learning which can help with this problem.

The idea of transfer learning is that we first train your model on a more general, all-purpose data set from a wide variety of sources. In our case, this helps the language model form a strong, base understanding of the language. Using this pre-trained model, we can then fine-tune it by further training on a more task-specific data set. In doing so, we can shape the output to be focused on our domain, without losing the underlying understanding of the language.

Normally, this process is accomplished by freezing most of the lower layers of the network and only allowing the weights in the top couple of layers to be adjusted during fine-tuning. As a result, we have a much more powerful model than if we were to just train from scratch on our own data set; not to mention we save a lot of time and compute resources.

Step 5: Generate Text

With our fine-tuned model, we can again use the aitextgen library for quick and easy text generation.

<pre><code>
from aitextgen import aitextgen

out_dir = “results/”

ai = aitextgen(model_folder=out_dir, verbose=True)
# ai = aitextgen(model=”dbdmz/german-gpt2", verbose=True) # HuggingFace Model
ai.to_gpu()

prompt = input(“Prompt:\n\n[Intro]\n”)

output = ai.generate(
prompt=”[Intro]\n” + prompt,
seed=27,
# Model params
n=10,
min_len=None,
max_len=256,
temperature=0.8,
do_sample=True,
use_cache=True,
# Custom model params
early_stopping=False, # whether to stop beam search when at least num_beams sentences are finished
num_beams=1, # num beams for beam search, 1 = no beam search
top_k=50, # num highest probaba tokens to keep for top-k filtering
top_p=0.95, # float < 1 if most probable tokens with probs that add up to top_p are kept for generation
repetition_penalty=1.2, # penalty for repetition. 1.0 = no penalty
length_penalty=1.0, # < 1.0 shorter, > 1.0 longer
no_repeat_ngram_size=0, # > 0, all ngrams of that size can only occur once.
num_beam_groups=1, # num groups to divide num_beams into to ensure diversity
diversity_penalty=0.0, # value subtracted from beamscore if generates token same as any beam from other group
remove_invalid_values=True,
# output
return_as_list=True,
lstrip=False,
skip_special_tokens=False,
)

for i, text in enumerate(output):
print(“\n==============”)
print(f”Text: {i}\n\n”)
print(text)

print(“\n==============”)
print(“==============\n”)
</code></pre>

Using the above code snippet, we can generate results such as the following, which looks a lot more like an actual song:

<pre><code>
Prompt:

[Intro]
Wir stehen an der Bar

Text:

[Intro]
Wir stehen an der Bar, in der Nacht tanzen bis alles rein brennt
Doch dann feiern wir noch lange auf die Party drauf
Und vergessen all den alten Scheiß
Er ist gut und drüber weg
Nur du weißt wie schön das Leben ist
Ich seh’ nur dich, nur dich

[Refrain]
Wie immer freut er sich darauf!
Wenn ich am Boden liege
Bis gestern war’n wir Freunde
Haben soviel durchgemacht und viel verlor’n
Alles kann man nur so sparen
Das Schlimmste sind wir hier
Alle Sachen hat sie gesehn
Die Achtung vor jedem einzelnenBeat
Verzweifelt ihr niemals sich im Kreise drehen
Wenn sie ihre alten Lieder hört
Dann glaub’n mit uns zusammen

[Bridge]
Man sagt dir helle Augen
Du siehst aus wie ein Sternenzelt
Dein wunderschönes Mund der mir sagen: „Was soll’s?“ Wir sehen beides
Ganz einfach und charmant aus

[Refrain]
Wie immer gefreut er sich auf jeden Fall
Wenn ich am Boden lag
Bis heute war’n wir Freunde
Haben viel riskiert und viel verlieren
Alles kann man nur so sparen
Das Schlimmste sind wir hier
Alle Dinge hat siegesehn
</code></pre>

Note: due to a bug that we found in the model, our base model was not properly trained using the EOS token, meaning the model never knows when to stop generating text. This issue has been addressed and was recently fixed by the maintainers. However, because of this issue, our final product at the end of the 4-day Discovery session could only produce either never-ending songs or songs which abruptly stop mid-sentence after hitting a token limit.

… but how do you generate text using a language model?

As mentioned previously, the output of language models is a probability distribution over the different tokens in the vocabulary. To generate text, we follow a basic loop.

In the loop, the prompt is first passed through a tokenizer and the output is then passed to the language model, resulting in a probability distribution. We then decode the distribution (i.e., select the next token) and then append the new token to the sequence. The process repeats until we hit some end criteria.

There are many ways to decode the probability distribution, ranging from the simple to the quite complex. A few of the most common approaches are as follows:

  • Greedy. The next token is chosen by taking the argmax of the probability distribution (select the token with the highest probability). This is the simplest of approaches and results in a deterministic output.
  • Beam Search. keep a set of “most probable” sequences and select the one at the end with the highest probability. This approach can be improved by enforcing diversity among the different sequences. Like the greedy approach, the results here are also deterministic.
  • Sampling. Randomly sample the probability distribution. You can also shape the probability distribution using a temperature parameter to make the distribution flatter (high temperature) or more peaky (low temperature).
  • Top-K. Randomly sample the top K probabilities in the distribution. Makes the output more predictable, however, it is difficult to select a value K that works for every distribution.
  • Top-P. Randomly sample the tokens whose probabilities sum up to the top P% probabilities in the distribution. This helps deal with the problems Top-K has with super peaky or super flat distributions.

Furthermore, you can create a custom decoding method by adjusting the probability distribution to encode domain-specific knowledge. Some applications of this in SchlagerAI can be seen in the next section.

The future of SchlagerAI

After our 4 days of working on building up SchlagerAI, the results were quite promising, however, there is still a long way to go to make the perfect songwriting AI. Improvements include the following:

  • Better integrate song structure into the generation method. Thanks to the tags provided by Genius, our model appeared to start to understand the ideas of a [Verse] or a [Refrain], however, there is still a lot more expert knowledge we can look to provide it with. For instance, we know that a [Pre-Refrain] should always come before a [Refrain], or that a [Refrain] usually shows up multiple times with similar lyrics.
  • Rhyming and meter. In order to fit musically, the different lines in a song normally follow some sort of meter/syllable structure. There are also rhyming schemes to make the song flow together better. We can encode this knowledge into our generation method by adjusting the probabilities of tokens if they fit into the predefined structures of the song. An example of this approach can be found in https://github.com/summerstay/true_poetry.
  • Setting themes. As mentioned in the intro, Schlager music normally follows certain themes/topics. We can help shape the generation output in those directions by integrating keyword generation, an example of which can be found in https://github.com/minimaxir/gpt-2-keyword-generation.
  • Integrating current events. Schlager songs sometimes reference mainstream current events in the lyrics. Our models are trained on a static, not so current data set, meaning it does not have knowledge of recent events. Facebook AI recently announced BlenderBot 2.0 which can integrate information from current events/internet searches into the text generation process. It may be possible to take some ideas from that and similar research to allow the generated lyrics to be more topical.

You can find the code in our GitHub repository if you want deeper insights into how we built SchlagerAI.

If you find this in-depth article about SchlagerAI interesting, then go and check out the recorded clip of the live SchlagerAI presentation at our most recent Discovery Conference! Furthermore, if you are interested in keeping up to date with similar projects, look out for our next Discovery Conference!

Motius.de

We are an R&D company that is specialized in the newest…