In our Western languages, speech melody hovers over all our conversations, giving them fine emotional meaning — “It’s not what she said, it’s how she said it.” We are, with speech melody, in an area of human behavior where music, meaning, and feelings are completely fused.
- Steve Reich
While transcribing The New York Public Library’s Community Oral Histories, I became preoccupied with the question of what was lost from the translation of spoken word to text: the subtle pauses, speech cadences, evolving dynamics, the speeding up and slowing down, the stumbles, the stammers, the um’s, and ah’s.
This led me to pursue the idea of translating audio of human speech into a sheet of music with “lyrics”, which may help preserve the rhythms, pitches, and dynamics of speech.
This idea isn’t new by the way. Composers throughout history, including Béla Bartók, Leoš Janáček, and (most famously) Steve Reich (see: It’s Gonna Rain, Come out, Different Trains, The Cave) all used spoken language as a compositional basis for their music. Since I am a computer programmer and not a musician or composer, I will use software rather than my ears to transcribe speech to musical notes. I will outline my process in this post.
Still I Rise
Steve Reich, an American composer and pioneer of minimal music, first became interested in spoken language as a basis for music while trying to set a poem by William Carlos Williams to music. He was unable to do so, stating that:
The failure was due to the fact that this poetry is rooted in American speech rhythms, and to “set” poems like this to music with a fixed meter is to destroy that speech quality.
This impediment led Reich to conclude that speech quality (its melody, timbre, and dynamics) is inextricably tied to the meaning of the words being spoken.
I decided to start my experiment with the poem Still I Rise by Maya Angelou as performed by Maya Angelou. The art form of poetry often makes playful use of language which can lead to multiple interpretations of meaning. After listening to Angelou’s animated performance of one of her well-known and revered poems, I believed it would be an ideal candidate to capture as sheet music. Perhaps by analyzing Angelou’s whimsical speech rhythms and dynamics, we can cull additional meanings and insight into how Angelou intended the poem to be experienced.
I would like to think of this as not only a translation of her poem, but a portrait of a performance and celebration of an inspirational and influential individual and voice. Here’s the performance that I will start with:
For those who have some programming experience and would like to replicate my process exactly, head over to my code repository (everything is free and open-source.) The first thing I do is align the audio to the words of the poem. I use the open source Gentle Forced Aligner for this task which takes an audio file and text, and aligns them down to the phoneme. Now I know exactly where in the audio she says a particular phrase, word, or syllable.
The next step is to analyze the audio itself and extract pitch data (for musical notes) and volume/intensity (for note selection and dynamics.) I use Praat, a free computer program for analyzing speech in phonetics. From this I’m able to do pitch detection which will act as the basis of the musical notes.
The next step is a bit tricky. I would like to convert each spoken syllable into a note or set of notes. Let’s start with the simplest case. The syllable “may” is short and spoken with little pitch variation, so simply taking the pitch with the highest intensity (usually towards the beginning of the syllable) will suffice. In this case, it’s D-sharp.
But sometimes Angelou “slurs” a syllable, e.g. she goes from a high note to a low note in one breath. To use a slur in musical notation, I’ll have to look for syllables that clearly increases or decreases in pitch. The syllable “down” is spoken with a falling pitch.
I tweaked the note-selection algorithm to take into consideration vocal intensity, the peaks and valleys in pitch, and vocal continuity. Assuming that I now have note candidates for each of the syllables, the next step is to finally put these notes into musical notation. This is where one would have to find a good balance between the aesthetics and accuracy of the sheet music.
One issue with converting speech to sheet music is that humans don’t speak neatly in measures, quarter notes, or eighth notes (as Reich found.) To be completely accurate one will probably need to break the song into 1/32nd notes or smaller which would be quite unreadable.
To get around this, I will set the tempo (bpm or metronome mark) to be very fast. For example, with a bpm of 120, each quarter note is half a second, but a bpm of 240 will make each quarter note a quarter second. This would allow me to make more readable yet still fairly accurate sheet music.
To make it even more readable, I made the the eighth note the smallest note allowed in the sheet music. I then did some basic analysis on the audio dynamics which will translate into dynamic marks such as forte or pianissimo. This I can do rather naively by just looking at the intensity data of each syllable.
I thought the final step of rendering the actual sheet music would be the hardest step, but it turns out to be one of the easiest thanks to the amazing and free music engraving program called LilyPond. Lilypond allows you to use special text-based syntax to generate professional-looking sheet music. It figures out all the complicated spacing issues automatically, and even splits notes across measures when applicable. So all I needed to do was convert my musical note data into LilyPond’s syntax, and let it do the rest. Here it is!
It’s not perfect, but I think one can get a better idea of how Angelou performed this poem compared to just the text. There’s a rabbit hole of additional improvements one could make: crescendos, articulation marks, or chords to name a few.
The Next Steps
For me, this opens up other interesting avenues to pursue. As someone who does not have a music composition background, this makes process-driven musical composition accessible for me, allowing me to generate melodies and rhythms from any audio of spoken language.
Perhaps this can be a bridge between oral history, sound archives, and the modern DJ. Loops and catchy beats dominate and propagate today’s culture through Hip-hop and electronic music. What happens if we can make history that is embedded in the melodies of spoken word as infectious and influential as a Beyoncé single?