Teaching machines to speak and improvise blues jazz

For this week’s piece, I’d like to focus on examples of long-short term memory (LSTM) models, as they’re one of the most elegant concepts I’ve come across in deep learning. They also power Google Translate, help Uber predict demand during extreme events, make Amazon Echo sound like a real human, teach robotic surgeons to tie knots, and even compose new Blues jazz.

Challenge: My dog has four…

Suppose I asked you to guess the next work in the sentence “My dog has four…” You’d probably say “legs”. In this case, the context to the last word is within the immediately preceding words. You wouldn’t need any context other than knowing that the subject of the sentence is “dog”. Recurrent neural networks (RNNs) are particularly good at making these kind of predictions, when the gap between context and prediction is small.

But what if instead of the above, I said:

“I have a dog called Charlie. Charlie likes to run after sticks, chase cats, and eat my shoes while I’m out. Charlie has four…”

You’d also guess “legs”. But that’s only because you remember the relevant context, namely that “Charlie” is my “dog”. That relevant context wasn’t part of the immediately preceding words, but was further back at the beginning of the story. For these kind of problems – when the gap between context and prediction is large – RNNs break down quickly. This is where Long Short Term Memory models come in.

What does your memory do?

Think about your own memory for a moment. It effectively does three things:

  1. Records new information (input gate) – “I came home and put my keys by the oven”
  2. Forgets some information (forget gate) – forgets that the keys are by the oven
  3. Passes remaining information forward (output gate) – I came home and put my keys somewhere

LSTMs use the above three functions to provide context to the thing it’s trying to predict. It then takes in small groups of words at a time (e.g. “I have a dog…”) to (a) predict the next words (“called Charlie”) and (b) remember the context of the sentence (“dog”). Then, when it needs to predict the next word in a later part of the sentence (“Charlie has four…”) it relies on the memory to inform it that we’re talking about a dog here, and hence the likely answer is “legs”.

LSTMs have proven to be extremely effective at retaining relevant contextual information over long periods of time.

A memory cell. It takes as an input 1) new information + 2) outputted memory from an earlier cell. It then forgets some of its information. Finally, it outputs 1) a prediction + 2) the input into the next memory cell. Source: deeplearning.net

Giving human speech to machines

If you’re reading this on a Mac / iOS, try this: highlight this paragraph and go to Edit -> Speech -> Start speaking (OSX) or tap Speak (iOS).

Alternatively, here’s an example of what you’d hear:

A basic text-to-speech (not using LSTMs). Notice how it sounds monotone and not like a human. Source

While you can understand the literal words, it clearly doesn’t sound like a human. It’s monotonic and the voice doesn’t grasp the same intonations that a human would. At a high level, you can think of human speech as a combination of:

  1. the words you say
  2. the pitch you use
  3. the rhythm you enunciate

(1) is easy to do, as it usually isn’t contextual. The caveat is a heteronym (two words with the same spelling but different pronunciations and meanings, e.g. “We must polish the Polish furniture” or “Please close the door you are close to”).

But (2) and (3) (pitch/rhythm) are highly contextual, based on what you’re trying to communicate (imagine if Martin Luther King, Jr.’s “I have a dream” speech was read by Apple’s monotone Safari reader). Advanced language speech systems, such as Baidu’s or Amazon’s Polly (the voice behind Alexa) solve this by encoding the pitch and rhythm of human voice, and apply LSTMs to predict not just the next word, but the pitch and rhythm of that next word too.

Here’s an example of two English sentences read using different TTS systems of increasing complexity. You’ll notice the third – Google’s WaveNet, which uses LSTMs – sounds much more humanlike.

Composing blues jazz with LSTMs

Eck and Schmidhüber applied LSTMs to music composition, by training their models on randomly selected melodies that fit the blues-style musical form. Specifically, they trained their LSTM to learn how to generate new melodic structures that “fit” with a chord structure. This was the first time a neural network was used to capture global musical structure in the music (i.e. contextual memory from earlier parts of the song) rather than local musical structure (as RNNs would be able to do).

Here is a sample output. While it’s no B. B. King, it’s pretty good, and goes to show how close we are to generating highly advanced jazz with LSTMs.