AI Lyric Generation: How I learned to rap like Mao

Michael Barron
Indulgentsia
Published in
6 min readJan 30, 2020

--

Lyric generation, Social Commentary, And Text Based Style Transfer

Whereas before we asked the question what would Maoist Jesus look like from an AI’s perspective, we now wished to examine the equally pressing question, what would Maoist Jesus sing like?

Previously, we explored the parallels of the Chinese social credit system, and medieval indulgences. Particularly, we sought to survey how to express the relationship through the use of image generation. In this post, we take a familiar, but distinct approach using lyric generation.

The Goal

At a high level, the intent of this piece was to examine and blend themes from different bodies of work. What kind of texts can one generate to symbolize ideas such as the social credit social system, social influence and the themes of manipulation? Similarly, how could one thematically represent the advent of AI, and its influence on our society?

AI taking our jobs

Technologies Explored

Once again we returned to the usage of neural networks, particularly for use in text generation. Much like in our post “GANs and Plans”, Neural Networks have also been used for text generation for quite some time and have become more compelling with examples such as GPT-2.

In the context of text generation, neural networks follow much of the same fundamental design, except for the key distinction of capturing and representing time. Whereas in the case of image generation, one might sample some images to create a new one, for text, there must be some form of memory.

Abstraction of a LSTM unit

While there are various architectures, and structures (LSTMs, RNNs, Seq2Seq, Autoencoders, Attentional networks — see this review paper if you want a technical review) the essence of the idea is: “to know what I should say now, I need to know what was said before”. In this way, these networks take thousands or millions of documents, lyrics, or any input with some consistent syntactic and semantic structure and learns to emulate and create similar patterns.

Why we chose it

Our interest in using neural networks for text generation is ironically for their imperfections. Without extensive tuning, and architecture design, typically these networks only capture short term dependencies.

Rather than create a coherent piece of text, our idea was to capture a feeling along with a distinct style. To that end, we used an LSTM network to create new text that syntactically sounds correct, with allusions to particular themes — it creates an almost aphasic-like quality. In other words, the syntax and style is largely captured, but the meaning is not.

For example we generated the following blurb:

the surplus — value of the industry of the old fashioned appears to be a rise in the value of the of the world.

An explanation of what Aphasia does

Data

As the famous rapper Data D once said Mo data Mo problems, however, in our case, the issue was what data should we use to express and mix different themes?

For instance, if one were to select a large corpus of R&B, the generated text would have an R&B flavor, yet we also want to incorporate references to AI centralized control, or say China’s social credit system.

Snippets of different text corpora we sought to mix

To create this effect, we examined various sources of text such as:

  1. Video Transcripts of B.F. Skinner: A psychologist focused on methods of operant conditioning to influence behavior
  2. Communist Manifesto: Objectives of the communist party
  3. Chinese Social Credit System Benefits: The goals and objectives of the Chinese Credit System
  4. Kanye West: A famous musical producer and songwriter
  5. R&B Lyrics: A data set of popular R&B lyrics from various artists
  6. Trump Speeches: Speeches by President Trump
  7. Bossy Boss: A melodramatic Chinese drama

Each of the listed sources provides a distinct flavor in both what they discuss thematically, and how they sound and feel .

Mixing the Texts

A neural network can emulate how two texts sound, but if the datasets are too distinct it won’t necessarily learn how to combine them. It only learns how to transition from the style and content of text to the style and content of the other.

Here you can see how the beginning source lyrics transition abruptly into more rigid political thought

Instead we sought to create a blend. To create the effect of mixing the style of one author/artist with the content of the other we extracted relevant noun objects and verbs from one text and swapped it into the other text based on their similarity.

Word2Vec example
*NOTE* There are ways to achieve this effect that lead to a more coherent mapping, but this was not our goal.

With this blended dataset/corpus, we then trained a word to word LSTM to produce new variations.

Results:

After exhaustively examining the of various mixtures trained via different LSTM networks, we begun the process of selecting generated text that could be compelling.

A small example of generated text trained from R&B lyrics mixed with B.F Skinner transcripts

We ended up selecting a fairly interesting blend consisting of R&B and B.F. Skinner transcripts:

Lyrics

Intro:
We are beautiful like diamonds in the sky, we might fall.
So don’t you say no tonight
You know you belong to me
Your individual physiology
shows your beautiful factors
I have you sequenced
You’re keeping me alive
I’m Keeping you alive
The stimuli is your eye
Your stimuli is my dissonance

Verse 1:
You changed my sympathy
I can’t face the questions
I’m just a behavior
I don’t think about her cause I believe in love forever
So darling I love you and I don’t need no conditioning

Chorus:
I want to wake up from me
I think I’m a process
This is the purpose
I will always be your kind contrast
Love is on the selection

Verse 2:
Stay with me
With your practical fire
Contact the culture that we used to be

Bridge:
Shine your autonomy on me
I hear the initiation coming
You made me feel culture

Chorus:
I want to wake up from me
I think I’m a process
This is the purpose
I will always be your kind contrast
Love is the only selection

Finally, merging these lyrics with a speech RNN, we generated audio! So wow, but, also, not true. We actually had a singer put their take on it. And without further ado. *This had to be taken down until after another gallery release, so look forward to it again in late April*

Lyrics performed and the final video: https://vimeo.com/375265410

Conclusion Final Thoughts

There were a lot more ideas and technologies we wished to explore, but, due to the great equalizer of time, we had to stop. Still, the full piece is currently being shown at the UCCA Centre of Contemporary Art.

--

--