How AI created a Eurovision song and music video (with the help of some humans)

Annelies Termeer
VPRO Broadcast
Published in
16 min readMay 12, 2020

A look into the process of making Abbus

Watch the end result here

At VPRO Medialab, we are all about exploring the creative and narrative potential of new technologies. And, on the list of technologies that are here to stay (or, some would say, to take over the world) AI pretty much sits at the top. So we thought it was about time to explore the creative potential of AI. Could AI inspire people? Could it collaborate with artists? Or would AI create stuff so cool, that these same artists could be out of a job soon?

To find out, we decided to take part in the AI Song Contest: an international contest in which teams from all over Europe and Australia compete, attempting to create the next Eurovision hit with the help of artificial intelligence.

We would try and make a Eurovision song with AI. And then, of course, an AI music video to match it. In this article, we will share what we’ve done, and what we learned during the process. To quote from Abbus: ‘Are you ready to go, let’s go!’

Willie Wartaal, Arran and Janne at work in the studio

1. THE AI SONG

We assembled our song team in the fall of 2019. Our team, named Can AI Kick It, consisted of computational musicologists of University of Amsterdam, Utrecht University and Dutch rapper Willie Wartaal, well-known in The Netherlands as part of De Jeugd van Tegenwoordig (full credits are at the bottom of this article). To start, all participating teams were given a data set by the organisation of AI Song Contest of 250 Eurovision songs.

A few things we decided on early on in the process:

  • We really wanted to see if AI could create a cool, popular sounding song (so not an obscure AI-sound)
  • We needed to add extra data sets for the music and lyrics generation. The extra data sets we chose were a mix of popular pop/rock songs (the LAKH midi data set) and typical Dutch folk songs (Meertens Liederenbank) — because when you say Eurovision, you say folk songs.
  • The music was going to be leading in the creation of the song, lyrics would follow
  • We would leave the matching of the text to the music up to our human artist Willie Wartaal
Team Can AI Kick It marvelling at generated music samples

The music

Team member and UvA AI student Arran Lyon had already developed a custom, deep neural music-generating algorithm for several performances, called Musaic. For our Eurovision song, this model was retrained on some 5000 songs from the LAHK pop MIDI dataset, as well as the 250 Eurovision song MIDI files. Extra weight was given during training to the Eurovision data.

After training on the data set we generated 450 main melodies and basslines to be ranked for catchiness/Eurovision-ness by the Hit Predictor algorithm (see below). The highest ranked musical segments are the ones we made available for Willie Wartaal to use in the final song.

Screenshot from Musaic

The music was produced according to the rules we set ourselves to create the final song: no notes were muted, all notes from the algorithm are intact, no cuts were made (we received 4-bar samples). The aim of our self-imposed rules was ‘to not lose the AI’ in the process of creating the final song.

The lyrics

For the song’s lyrics, team members Yannick Gregoire and Janne Spijkervet worked together to finetune OpenAI’s GPT-2 “345M” model with this metrolyrics dataset, with 250.473 unique songs, from 18.231 unique artists, for 40.000 steps. After that, the model was shortly fine-tuned again on lyrics in the Eurovision dataset which Janne created last year. The code for fine-tuning the model can be found here.

Janne also created a web interface for Willie Wartaal to intuitively compose the lyrics with the AI.

Given a title, or a single line as a prompt, the model generates the rest of the song given the parameters. The “Lijpe slider”, which actually is the sampling temperature, scales logits before sampling prior to softmax, essentially forcing the model to come up with newer predictions (approaching 0, the model becomes more deterministic)

An example input in the Lyrics generator. The prompt was the title ‘Love Is In The Air’, and the first sentence, ‘What would love be’

And the theme of the song? Willie Wartaal had always liked the classic “robot wants to be human” scenario, so he took out some generated sentences that fit that subject. Janne and Willie then put these sentences into Janne’s Lyrics Generator (see above) and then magic happened. It turns out that AI wants to kill the government and make sure everyone on Earth has food…

These lyrics caused The Financial Times to write this about Abbus:

“ Imagine assembling a crack team of musicologists to compose the perfect Eurovision hit, only to end up with a song that crescendos as a robotic voice urges listeners to “kill the government, kill the system”. That was the experience of a team of Dutch academics who […] inadvertently created a new musical genre: Eurovision Technofear.”

Read the full lyrics at the bottom of this article.

Our team’s progress was followed in this YouTube series for NPO 3FM. See how AI came up with the word Abbus in episode 4!

Predicting hit potential

To judge the endless number of outputs from the music generating algorithm, we needed an AI tool that could judge that output. Therefore, team members John Ashley Burgoyne and Berit Janssen built a hit predictor based on the Eurovision voting data, using the files from the Eurovision dataset. Extracted lead and bass lines were fed into the FANTASTIC feature toolbox and voting data was pulled from all of the finals for the years represented in the Eurovision dataset. Using each country’s individual rankings, they computed a preference score for every song based on the Plackett–Luce model. Finally, we trained a LASSO regression model to predict preference scores based on FANTASTIC features.

All together now

It was up to human artist Willie Wartaal and human producer Janne Spijkervet to assemble all the AI building blocks into a song: the generated melodies, the generated bass lines, a tool to generate lyrics, a kit with AI generated drum kicks and a synthesised voice of Willie. The generated melodies and bass lines had been judged by the custom built AI hit predictor tool, so we knew which of the hundreds of generated melodies and bass lines were the most promising.

The melody of the vocals was improvised by Willie, inspired by the material from the AI. Janne composed the beat to fit the AI-generated material and the genre of the production. She also used drum samples that were generated by an AI model called WaveGan.

At the end, the ‘computer Willie voice (as we like to call it) was added. These vocals (you can hear them at 2’33 in the song) are entirely generated by AI. The voice was created by team member Bence Halpern using the Mellotron text to speech synthesiser in combination with style-content based voice conversion.

And… Abbus?

Abbus was a word invented by AI, that stuck in our heads straight away. After the song was done, we thought it would be fun to ask our lyrics generator “What is Abbus?” to see what it thinks it means (since it invented this word!). Apparently it means ‘a nascent cloud’. Google defines nascent as “just coming into existence and beginning to display signs of future potential”, or “freshly generated in a reactive form”. Rather appropriate.

2. THE AI MUSIC VIDEO

Needless to say: an AI song needs an AI music video. What would that look like? To generate concepts, VPRO Medialab organized the 2-day hackathon But AI am an artist! in February 2020 at the amazing, just-opened Forum Groningen. Five teams of designers, filmmakers, journalists and coders spent two full days working with AI tools like Runway, Processing and PoseNet to come up with visuals for our AI Eurovision song (which was far from done at this point in time, so we provided the teams with sample AI music from our algorithm to work with). At the end of day 2, the teams pitched their concepts and prototypes for a professional jury, and team Untapped Potential was chosen as the winner.

Team Untapped Potential at work during the But AI am an artist! hackathon

Their concept combined an AI performer trained to move by dances on TikTok, with an AI stage built on the basis of crowd-sourced images of how other countries perceive The Netherlands.

In two months following the hackathon, we produced the actual music video The team tried to answer these questions: ‘What elements do we need for an awesome AI music video in the style of the Eurovision song festival? And how can AI create these elements for us?’

We came up with the following three elements:
1. A super candidate
2. A dream stage
3. A killer dance performance

The story of the video would be about AI learning from humans and from data, so evolving more and more throughout the video.

  1. A super candidate: 3D Willie

We knew our perfect candidate was going to be embodied by artist Willie Wartaal. But we did not need him physically — all our computer model needed was a 3D scan of his body so that we could manipulate it freely. So we sent Willie to a scanning studio on a Friday morning and had two 3D scans made — one with a neutral face and one with a smiling face.

Smiling Willie 3D scan

Face: In order to get the most out of our virtual Willie we sourced as many images of the actual Willie that we could from the web. These were used later to train a style GAN model in the hope that it would give us something workable.

The images that we managed to find amounted to a total of around 500, which as it turned out, was far from enough. Although training with these files rendered some interesting visual results, the images produced by this model did not look much like Willie (or any other human for that matter).

We then tried a different approach by training a new model with more consistent images (5000 still frames from an interview with Willie Wartaal). This worked a bit better as it generated an output that did resemble the original Willie, but that was not very varied, or visually interesting.

Willie generated from interview

Hence we conducted more training sessions where we experimented with mixing various forms of image content and training cycles. In the end we managed to arrive at an output that we deemed to be consistent and interesting enough for us to use it.

The images produced by this model were then used as the texture of our 3D model’s face, as well as processed in a number of ways in order to make it interact and sing.

Singing face: Tracking our own singing face and using a facial tracking model (Runway: first-order-motion model) we made Willies GAN face sing the most important words of the song: “Telling the world that I was living, living, living”..

The output of these models was very low in resolution so we tried to use other AI models to upgrade the material. A model called SRGAN promised to increase the resolution four times. This would allow our images to be razor sharp. The test results of the model were promising.

Image from https://github.com/tensorlayer/srgan illustrating what SRGAN can do.

Unfortunately our images generated by the first-order-motion model did not gain any resolution from the SRGAN. A possible explanation for this is the architecture of the first-order-motion output compared to real photographs. After all, the SRGAN is trained with realistic photos and not with absurd images of cheap CGI faces with abstract mouth positions. It was a reminder of the importance of understanding the characteristics of the training data. Apparently, AI models are generally trained for very specific tasks and any diversions can result in a useless tool.

We then tried some pre-existing style GANs in the hope that this would give us something more workable but the results of this seemed either far too slick and cartoon-like for our taste, or to chunky or just plain weird.

In the end we managed to get around this problem through some good old manual labour and ‘video voodoo’, which consisted of upscaling the image to 4k and add some high resolution noise to the low resolution image, before downscaling the image again to our desired resolution, adding blur and processing the whole thing through Google’s DeepDream algorithm. This worked beautifully, and also seemed fitting as this algorithm is something of ‘the original styleGAN’.

2. A dream stage: Amster-GAN

In the tradition of the Eurovision Song contest, we also wanted to represent our country in the music video: the beautiful Netherlands. What could be a better stage for 3D Willie to perform in than the most beautiful locations of our country. To decide what locations to use, we asked connections from different countries to send us their top 10 Google image results when searching for ‘the Netherlands’. The results were (surprise surprise) highly saturated images from Amsterdam and tulip fields, and hence this was our chosen direction for the virtual environment.

Colourful Google images for ‘The Netherlands’

In order to create the virtual environment, we needed far more images. But since a Google search of ‘Amsterdam’ renders millions of such images, we had an easier time training GAN models than we had with the face. We quickly managed to gather around 5000 images which we used to train on, playing with the settings, resulting in some interesting styleGANs.

As we already had our virtual Willie in 3D, it also seemed suitable for his world to have three dimensions. This was achieved by extracting 3D geometry from Google Maps using a debugging tool for the graphics processor, which gave us scale models of actual locations in Amsterdam.

From a storytelling perspective we agreed upon a story for the video that the AI is learning all the time. We wanted to do the same thing with the city environment, visualise that the computer is learning what a world should look like. Since a lot of AI applications in the real world (ie self -driving cars) use a point cloud based world to help navigate, we decided to use that as a visual metaphor.

The transitions scene of point-cloud to a photorealistic city were made with Unity. Multiple layers were animated and transitioned in the final edit.

The specific locations were chosen based on the geo-locative metadata of our original Amsterdam images.

So now we had a world for our digital Willie, but it was kind of grey and not at all as attractive as the images of Amsterdam that we had collected. But after we had applied our previously created styleGAN, Willies’ world turned sunny and full of color.

Grey vs coloured city scape

3. A killer dance performance: TikTok-style

For our 3D model to learn how to dance, we needed to find some great dance moves. And the place to find them nowadays is TikTok. So we cut the ABBUS song in 5 parts and asked people to film while dancing to the song, TikTok-style.

We made a choreography from these moves and tracked their bodies with PoseNet, a vision model that can be used to estimate the pose of a person in any image or video by locating key points of body joints in a 2D plane. In this way, our model of Willie could learn to dance from them.

Pose lifting: Although the locations of the joints look very promising they barely say anything about the 3D position of a joint. Applying these key points on the joints of the 3D model of Willie resulted into extremely wonky poses. Not an option…

To fix this we had to somehow lift the poses from 2D to 3D. Luckily this is a common problem in the process of pose estimation in many practices, so there has been a lot of research on this so called pose lifting.

This pose lifting resulted in quite accurate 3D pose estimations when applied to a stick figure.

Unity: Now that we had a dancing stick figure, the next step was to link the movements of this stick figure to the movements of the 3D model. This was easier said than done. The generation of the stick figure is based on the positions of the key points, whereas our 3D model of Willie is generated separately and the key points had to comply with this body. In the end, by attempting to minimally simulate physical motions of rotating human joints, the following result was achieved. There are probably more clever ways of depicting the pose estimations, especially by taking into account the accuracy of our stick figure movements, but these jittery, incidental head spinning movements of willie come with a particular aesthetic, which we happily embraced.

Human skills

After all this hard work, we combined all the AI elements into the final music video. As with creating the AI song, this final step in the process turned out to be a very human task. Creating the storyboard, choosing context and meaning, combining the elements, building up a dynamic, defining the tempo, deleting, filtering, editing…. This is something that still took a lot of human work and creativity.

TO CLOSE: WHAT WE LEARNED FROM ABBUS

If you ask — can AI write a Eurovision song?
Our answer, based on this project, would be: NO, not yet
(even though, with the recent release of OpenAI Jukebox, this answer might change soon. So far, it doesn’t have the Eurovision genre available).

If you ask — can AI create a music video?
Our answer would also be: NO, not yet.

A lot of hard work from creative professionals went into arranging and combining all the AI generated building blocks into an appealing, nice-sounding song and good-looking music video.

However.

If you ask: can AI create inspiring, unexpected material for artists to work with, to spark ideas, to have professionals think outside the box? The answer is YES. Can working with AI challenge your fixed ideas and make you think afresh about your own creative process and even your human-ness? YES, absolutely.

And with the speed we see these AI tools developing, it is probably worth asking these questions again very soon. At VPRO Medialab, we sure will.

CREDITS

AI Song Team: Can AI Kick It

Utrecht University: Anja Volk, Iris Ren, Manon Blanke, Anne van Ede, Thijs Hendrickx, Otto Mättas, Thijs Ratsma

University of Amsterdam: John Ashley Burgoyne, Janne Spijkervet, Arran Lyon, Bence Halpern, Berit Janssen

VPRO Medialab: Yannick Gregoire, Annelies Termeer

Artist: Willie Wartaal

AI Music Video Team: Untapped Potential

Hannes Arvid Andersson: audio-visual artist, 3D animator and electronic music producer (as “othernode”). http://hannes.world

Chantalla Pleiter: media artist, VR and interactive installations and immersive theatrical experiences. www.chantallapleiter.nl

Teackele Soepboer: creative programmer http://teackelesoepboer.nl/

Vincent Bockstael: producer / musician/mathematician

Ties Gijzel: journalist/economist

VPRO Medialab

Creative director: Annelies Termeer
Producer: Rens Mevissen
Intern: Viviënne de Wolff

www.vpro.nl/medialab

Lyrics Abbus

[Intro]

Look at me
Revolution, ya
Ey ey (ey ey)

[Verse 1]

It’s gonna feel good (2x)
It’s gonna feel good, good, good
We want revolution

[Pre-chorus]

There will be a day
I look you in your eyes
And I’ll hold you, hold you, hold you closer than I ever did
(There will be a day)

[Chorus]

Look at me
Look at me
We’re coming with the
Look at me
Look at me
Coming with the
Abbus abbus
Abbus abbus
Abbus
Abbus abbus
Abbus abbus
We’re coming with the

[Verse 2]

I am so sick of being lied to
But the Lord is not a saint
Is there something to believe
I am asking you
Are you right

[Pre-chorus]

Headed for something
We’re headed for something
Are you ready to go
Let’s go
Headed for something
We’re headed for something
Are you ready to go
Let’s go

[Chorus]

Look at me
Look at me
We’re coming with the
Look at me
Look at me
Coming with the
Abbus abbus
Abbus abbus
Abbus
Abbus abbus
Abbus abbus
We’re coming with the

[Bridge]

I tried to write an honest song
About the things that I do
And I pray to God that I be a success
And they all said
That the Lord would soon answer
But it wasn’t to be
So I took the Lord’s advice
And I went on my way with the songs that I had written
Telling the world that I was
Living, living, living

[Chorus]

Look at me
Look at me
We’re coming with the
Look at me
Look at me
Coming with the
Abbus abbus
Abbus abbus
Abbus
Abbus abbus
Abbus abbus
We’re coming with the

[Outro]

We want revolution
Constant change
Give to everyone
Food and clothes
We want revolution
Revolution
We want revolution
Kill the government
Kill the system
Kill the government
Kill the system

--

--

Annelies Termeer
VPRO Broadcast

Creative director at @vpromedialab. Storytelling x Technology. Besides: film fan, language nerd, beach bum, yogi.