Bewildering Brain

Writing songs like Bob Dylan using machine learning.

Alex Ingberg
Towards Data Science
13 min readApr 24, 2019

--

By, in the dead just before the child,
Everybody gotta have to come see her hand,
But there’s too much to hum,
Father, when I could tell you down…

This verses were not written by a poet or by a musician. They were written by a neural network trained over the complete lyrics of Bob Dylan.
Can a robot create art? I leave that question to philosophers. In this little experiment I’ll only try to imitate as accurate as I can the style of Dylan’s songs using machine learning.

Regarded as an idol to many and an incomparable bastion of American music, Robert Allen Zimmerman, a.k.a. Bob Dylan, is arguably the biggest popular musician to have come out of the States ever. With a career spanning 6 (!!) decades and 38 studio albums, it’s easy to see why. Oh, and don’t forget the fact that he has sold more than 100 million records, making him one of the best-selling music artists of all time.
What gathers us here today, though, is his lyrics. According to Wikipedia, his lyrics incorporate a wide range of political, social, philosophical, and literary influences, defied pop-music conventions and appealed to the burgeoning counterculture.
Sounds interesting, right?

This particular background and his unique and characteristic style has brought him worldwide recognition and a myriad of awards. The most prestigious one being the 2016 Nobel Prize in Literature “for having created new poetic expressions within the great American song tradition”.

His signature style is unmistakable.

Or is it?

He not busy being born is busy dying”. It’s Alright, Ma (I’m Only Bleeding) is one of his most celebrated masterpieces.

Technology

For this little experiment I based myself around Tensorflow, a Python framework to handle neural networks.

It is worth mentioning also that as environments, I used Jupyter Notebook and Pycharm and that everything was developed using Python 3.6.

Included in the database are all the songs from the self-titled debut album to 2012’s Tempest. I took the file from Matt Mulholland’s website. This saved me a lot of time, so thanks Matt!

I used two algorithms/techniques in particular: Markov Chains and RNNs (Recurrent Neural Networks). I propose a comparison and analysis of which one outputs better results and how they both perform.

Markov Chain

Markov Chains are stochastic processes whose main characteristic is the fact that the probability of an event depends pure and exclusively on the previous event. This lack of memory as a property is known as Markov’s Property.
It sounds complicated, but it’s super easy to understand.

If we understood rain as a Markov process, the probability of rain only depends on if it rained yesterday.

There’s two states: rains or doesn’t rain.

If we are in a sunny day, the probability that it rains the next day is 40%, so there’s 60% chances the next day will be sunny.

If it’s raining, there’s a 30% chance that the next day it’ll rain again and 70% it’ll be sunny.

And just like that, day after day we could calculate the probabilities for rain only basing ourselves in the previous day: no matter if its wet season, no matter the global warming and no matter the geographical region.

Now back on subject, one can understand text as a Markov process, where every word has the probability of appearing depending only on which was the previous word written. This way, “predicting text” becomes something possible.

To apply Markov to the lyrics of Dylan, I took the work of Hans Kamin where he predicted Logic’s lyrics as a big reference. The idea was to create a bigram Markov Chain which represents the English language. More specifically, creating a tuple dictionary of consecutive words. Because it’s bigrams we are utilizing, and not single words (unigrams), we obtain a better precision in our predictions and, above all, better readability in the lyrics created by the model. This means, the next word of a sentence is predicted with the two previous words instead of a single last word.

Using Pandas series, I iterated over all the lyrics in the database and used special tags for the line start, line end, and newlines (\n): "<START>", "<END>", and "<N>" respectively.

To predict, we start in the dictionary with the key (None, "<START>")(must be the first link of the chain), and then we sample in a random way — but respecting the distribution — a word in the list connected to that key; we move the key towards the word we just sampled after. We continue this process until we get to "<END>".

Here are some of my favorite phrases that the algorithm spat out:

You angel, you, you’re as f — got me under your wing, The way you walk into the room where my pencil is at.

I’m bound to get up in the Great North Woods, working as a parting gift. Summer days, summer nights are gone, I know I just wouldn’t have a temperature?

I went to tell you, baby, Late last night I dream of bells in the dirt in the wind. I walked out to the higher calling of my confession,

I went into the cup above a blind man at the mouth, he began to think I could, I can’t let go of my lambs to the heart

How I wandered again to my knees, Don’t need a woman I meet, Putting her in hope not to me, baby, I’ll do anything in this telephone wire.

How’m I supposed to get ill. Then they bring them clothes! Woo-hoo! Well, I don’t know by now, It ain’t me you’re looking for, babe.

Come, Look out your window ledge. How long’s it gonna take away my highway shoes.

To anyone passing by, There was no more letters, no! Not unless you mail them from Desolation Row.

Like its obvious, the lyrics, even though they have a clear Dylanesque style, feel like a cutout of reality. Like a copy-paste of his work, a collage of different verses to create a new song. Many phrases are identical to the ones written by Bob.
It was to be expected, as when we search for higher readability while using bigrams, we also reduce the variance in the words predicted. As a result, we get 3 or more words that come from the same verse. Using unigrams isn’t the answer either since the meaning would be lost in not respecting the syntactic and morphological order of the words: it’d be a word soup in random order.

Markov chains bring the advantage of being easy to implement and using less variables to predict results: but it came hand by hand with bad predictions. To counter this, I jumped to a more complex model: a recurrent neural network, or RNN.
Another detail worth mentioning: the algorithm will continue predicting, no matter the length, until reaching the end of the chain (the tag <END>). This means that a run of the algorithm can output 2 verses and the next one 50.

“There must be some kind of way outta here” — Said the joker to the thief.

RNN (Recurrent Neural Network)

Recurrent networks are a type of artificial neural network that recognizes patterns in data sequences like text, genomes, handwriting, oral language or numerical time series originating from sensors, stock market or governmental agencies. These algorithms take time and sequence into account; they have a temporal dimension.

In comparison to Markov Chains, recurrent networks have memory.

To understand recurrent networks, first you need to understand the basics about common networks: feedforward ones. Both types of networks are called neuronal because of the way in which they channel information through a series of mathematical operations which are carried out in the nodes of the network. One carries information until the end without touching the same node more than once, while the other runs cycles over the same network. The latter one is called recurrent.

In the case of feedforward networks, feeding them with input you get an output. In supervised learning the output would be a label. This means data gets mapped raw to a category recognizing patterns which decide, for example, if an image used as input can be classified as cat or dog.

Recurrent networks, on the other hand, take as input not only the current example being seen, but they also consider what has been perceived previously in time.

The decision taken in timet-1is going to affect the one being taken a moment after, at time t. So recurrent networks have two sources of input: the present and the recent past, which combine to determine how to respond to new data — just like a human brain.

We need to also analyze the concept of Long Short-Term Memory (LSTM) to finish understanding the complete process.
¿What would happen if I feed my network all the songs by Dylan and tell it to finish the title of this song?

Subterranean Homesick ….

We know the next word is Blues (if you don’t know the song, it’s of the utmost importance you listen to it now). The network won’t know, as this information doesn’t repeat many times, or its occurrence is not close enough to be remembered.

For humans, it’s pretty obvious and intuitive that if a phrase appears in the title of a book, it must be an important part of the plot; even in cases when it’s the only time it appears. Contrary to a RNN, we would definitely remember. Even though they fail to remember, there’s techniques like LSTM networks which handle this kind of situations in a successful way.

While RNNs remember everything up until a certain limited depth, LSTM networks learn what to remember and what to forget.

This allows LSTM networks to reach and use memories that are beyond the RNNs range. Memories that because of their perceived importance where remembered in the first place by LSTM networks.

How do we achieve this? Recurrent networks, in general, have a simple structure where modules are repeated, where data flows through. The simple ones are commonly created with layers which use simply hyperbolic tangents (tanh) as activation functions.

On the other hand LSTM networks come with a more complicated structure; combining these with several other functions, among them sigmoids. They not only count with input and output gates, but also a third gate which we could call forget gate. It receives if the information is worth remembering. If negative, the information is deleted.

How does the decision making work? Each gate is associated with a weight. For each iteration, the sigmoid function is applied to the input and as output we receive a value between 0 and 1.
0 means nothing gets through and 1 means everything does.
Afterwards, each value of each layer gets updated with a mechanism of back-propagation. This allows the gate to learn, with time, which information is important and which one is not.

To implement all this, I based myself on the text-predictor from Greg Surma. I implemented some small changes to the model, adapted it to Python 3 and played a little bit with the hyperparameters until I got satisfactory results.

The model is character-based: all the unique characters were calculated with their respective frequencies. The tensors were created replacing each character with their indicated frequency. The length of the output is predefined by a parameter. In this case it’s fixed.

For more detail, you can check my code in my GitHub account.

Enough with technicalities: let’s check the results!

The interesting point in this is not only seeing the final result, but the learning process the algorithm had in each iteration. How it went from a concoction of characters to formed verses in just a couple cycles.

We can also appreciate the learning curve, where we can regard at how the loss function gets minimized until it gets established around an asymptotic value near 0.6 in less than 90k iterations.

Iteration 0

5p4:HStdgoTxFtBy/IBBDJe!l5KT
HldE42:lt(-cdbA2lB2avwshp-w,M)cKyP]8 1arOsfsJSA
cWU6sU6E"JV54X9bfrxFV1EEnajkozd'Tk3iQkuUp02oekrQI-,,rAt-(PyfE6z-v7]8utBnD/Nxv:m;0Mw!)cYbnug
qo7t MXQhnq?X7qBTgKp9dJAojO2.87cN?:/0SJq:k
BS
yKDaj5G
0"U466;y8'7cxNLsYXVTUIxqbW0i0bZh8okns) Hf2?2R2hxddb;zXfv3J4iLfv-qOK4y[gaQuImW!XUyyBch9)GgcFB5f[Ri6?FaGno pBMQl hD ;tPUnyWuxg!B Qd6ot30tAlnLg2n?tLctfXTaz:9pIC3Z;fnA]A?q9k"B2r
m"eHTI"miA!d/iimz!/ndfzSKd.W[ALoLxE[l;PQI:PG ]EtUM4?(x4zBB-[wH;
GJT/JYA
zFGK9x05J1Ch[z2(/L4P?Ki
TYNK,7m

You know nothing, RNN model. After a cold start, the model gets initialized with trash and random characters.

Iteration 1000

temple at 
I hand you up to laby,
You set you on always hole as madoo and use unknear,
And thinking
I want his dista,
Coom on you make you." "What want
Everybody was on
Ira," ain't may bold by you.
And the pend.
Honey, day you don't eway you say"
I mad in
Game,
No, contaw woman,
How, way,
Pryie you don't know, and couse
I love are stone is sute curt suck block on
Haye?
Now, a make for etcide is lord,
Walles
And he lad feel,
Take, blace
And mave wease with nothing,
But you

The model learned, in a few iterations, which are the indicated characters to create a Dylan song. Also which is its shape: the sizes of verses and the basic punctuation rules like capital letters before starting sentences and the usage of commas.

Iteration 2000

how. 
You never you been todred,
Just crying her face to the night.
Oh, uh, sang to time in you.
Timb friend carbed as lace.
We'll be the better does of my beantains,
The mightenmed to cheat twist and you'll asy dressed them loves?
With the mough seen of the facing gold,
Take er can
Man, wanded like mind for your morning the night up the feet the wond pring,
Take did a grost ever neum.
Pounsta fleason just comeless, them bads of me see there a womes of as too lotten up to turn,
You

Some words are already real words and the morphological relation between words starts to show: adjectives and article as modifiers to nouns. Circumstantial modifiers, objects and predicates after verbs.

Iteration 4000

I world must be lady, babe, 
Ither didn't matked, don't remember helled things.
They'll eter came life, mamber
And the company together
That
I thinking for you, though protaured in the dance please
Follower,
I ain't never the one?
Well, it air awa paries because a north that her in day you only think cannot the ground, her not a roll a mause where they're looked awhile,
Can the
Lad-eyes and the confesed white wiced to come me.
You're in two if it is, slele noners,
Ain't mes was blow

Errors in word predictions start to be less frequent: less vocabulary mistakes.

Iteration 35000

with their highway.  
I cannon cloaked in a picture,
It diamondy his man
I'll see you even day he'd come there across, the moon after the parking,
I'm dressed,
I'm a bad line.
Sanalured like a coller standing in a woman.
I'll be banked inside,
She sees -
Shere road-luck in the dust he's well never know.
With degreeing on a whole farms, but don't think twice and
I took forwlet
Johanna
I never crash.
I'm going to the jelf.
All
I never been don't know what
I
Night -
Don't mean

Reaches an excellent morphological understanding of the verses: even if the the words don’t make much sense; it has the form of a poetry or song lyrics.

Iteration 102000

guess he wat hope this nose in the last table if 
I don't mean to know.
Well,
I'm puts some dirty trouble for you in the law.
Go dishes -
And all the way from a cow,
They'll stone you when you knew that, you are, you're gonna have to guess that

I after and flowing on you, laws are dead, might as
I read is changed.
I've taking for you, yesterday, a
Martin
Luther game was tried.
He was without a home,
Let the basement deep about hall."
Well,
I'm lose again the real land,
Throw my

The amount of errors has diminished noticeably. It definitely looks like it was written by a human. Maybe that human is over the recommended dose of Xanax, but is human after all.

Iteration 259000

guess in the confusion come with nothing in here together wrong. 
I saw sold, he puts in my bed.
Going through the branchummy,
There's an ended on the factiful longer and pierce of
Blind expense
And the wind blow you went,
I've shine,
Bent
Before the world is on the snowfur warn - now, handled, your daughters are full of old, goes for dignity.
Oh, you got some kid no reason to game
Just and it's light and it evonces, 'round up
Indian.
Well, the bright the truth was a man patty

Our network learned to write songs like Bob Dylan.

OK, alright. I accept it. Still we have some vocabulary errors and the lyrics may not have much sense.
Even if lyrics generated by an artificial intelligence still have these little flaws, we can certainly see that the model learned correctly to copy the style of the provided dataset.

If we consider the fact that the network learned how to do all this from scratch, and that in the beginning it didn’t have any understanding of what a letter or a word (without even mentioning the English Grammar) are, we can agree that the results are surprising. We were capable of detecting logical patterns in a dataset and reproduce them: and in no moment the network had any input on what the language was, or which was the rule set on it or even if what was being processed were clinical images from medical patients or the corpus of Shakespeare’s works.

Conclusions and future steps

In this article I planned on comparing two very different methods to predict texts. On the one hand, Markov Chains bring the advantage of being easy to implement. Big theoretical or technological knowledge is not needed to develop them: but predictions come out pretty basic and fell short of expectations. The future of this field is clearly in the use of RNNs, even when implementing, running and testing them take a long time, processing power, space in disk and memory for the tensors, and specially an advanced technological and theoretical knowledge.

To further improve precision, we could contrast the output predictions in posterior stage with a dictionary. This could be made of unique words taken from the database or the English Language Dictionary. In this way, if the predicted word is not present, we could eliminate it or exchange it for the word with the greatest similarity (less distance).

Again, if you wish, you can check my code in my public GitHub account.

Let me ask you one question, is your money that good?
Will it buy you forgiveness, do you think that it could?

How much longer until artificial lyrics arrive? Who will be the first to squeeze the juice out of this market? There has already been holographic tours of late musicians. Roy Orbison, Michael Jackson y Tupac are just some examples worth mentioning. Will this new era of “music after death” be the perfect excuse for artificial lyrics?

Thanks for reading!

Sources

--

--

Data scientist, computer engineer and music fanatic. Here I’m half music recommendator and half data adventurer. https://www.linkedin.com/in/alexingberg/