GSoC 2019 Phase 2: Hyphenation & Alignment

8 min readJul 27, 2019

Now that the second phase of GSoC has just concluded, here are some of the things I’ve been working on, and some of things I’m looking toward in the final push.

Overview

The general aim of this phase was to start to align audio with lyrics. I’m looking for a line-to-line alignment for now, meaning lines of lyrics from Genius.com get marked with timestamp denoting when in the song those lyrics are sung. Doing this is a multistep process:

First, we do self-similarity comparisons and divide the audio into sections, then we label these sections depending on how similar they are to each other. In this way, we can more easily identify major structures like choruses, verses, and so on.
Next, we strip the audio of its background instrumentals, leaving us with just the vocals or as close as we can get to it.
Third, we count the number of syllables in each section of the song both in the audio and in the lyrics.
Finally, we use this syllable data as well as the segmentation results to match each section in audio with each section in the lyrics.

I also added unit tests and integration into Travis and codecov, but I won’t get into that here.

Vocal Isolation

Since I had set up MaD TwinNet in phase 1, I pretty much had vocal isolation to the degree we wanted ready. However, the authors’ models did pretty awful in a lot of songs that I tested it on, so I set about retraining the model on a dataset of relatively newer songs, in MedleyDB V2, which I ran in CCExtractor’s server as it took over 10 days without GPU. This new model does better, although there are some things we can look to improve:

When voices are autotuned or are generally electronically processed, the model tends to think those voices are instruments and cuts them out. It also tends to think the high frequencies of snares and hi-hats are also voices, so it leaves them in. There are hyperparameters used to train the model that I had left the same as the paper version; this may be optimized.
In general, the training process ran over 100 batches, which makes me wonder if the model is now overtrained and deferential to the types of tracks in MedleyDB. Perhaps a more diverse range of tracks as well as a more conscientious training process would improve it as well.

TwinNet is one major area of improvement, since the rest of alignment basically hinges on how well TwinNet can separate the vocals.

Syllable Counting

There are two major sides to syllable counting: one is syllable nuclei detection, or SND, which I talked about in my last post, and the other is counting the number of syllables in the words in the lyrics.

For the former, we made a little headway by modifying the parameters used by the Praat script, especially by raising the decibel threshold for what noise counted as silence from -25 dB to -15 dB. This allowed me to filter out a lot of the background music that TwinNet wasn’t able to get rid of. I also modified the script to return a list of timestamps that each syllable fell on, rather than just a single count of how many syllables there were.

For the latter, I had to start from scratch. I checked out a ton of hyphenation programs on Github, including Pyphen and Syllabipy, as well as a more naive version that counted syllables based on some fundamental rules regarding vowels and diphthongs. All three did decently for common words, but they struggled as soon as they hit stranger words, like “subpoena.” In the end, I settled for my own implementation of Marchand and Damper’s syllabification by analogy algorithm (SbA), which requires huge corpuses of previously syllabified words. Fortunately, these corpuses can also be used just as a dictionary to check whether or not we had prior syllabification data on a word. The current textual syllable counting mechanism first checks the massive CMUdict and NETTALK datasets to see if we have prior data, performs SbA if not, and then goes to the naive version if SbA fails. Then it adds that word to the dictionary.

SbA is super successful and is accurate in a large majority of cases, but truth be told most of the words we need are already found in our corpuses. When words that are abbreviated, like “shakin’,” or in slang, then SbA kicks in. It’s not very fast, but in most songs it’s not needed very often.

Segmentation/Labelling

Originally, my plan was to use Spotify’s API, which can quickly return some audio analysis on the track, including predictions as to where it thinks the track can be split into different sections. The main problem with this was that there was no labelling available, in that Spotify wouldn’t tell us which sections were most similar to each other. The second problem is that, just from listening to a few samples, Spotify isn’t that good at splitting a song into sections.

Thus, I turned to MSAF, an actively developed collection of segmentation and labelling algorithms that the author had either compiled from their specific repositories or implemented himself. With a few tests, I concluded that the best segmentation algorithm was OLDA, which stands for ordinal linear discriminant analysis, and the best labelling algorithm is FMC2D, which stands for 2-dimensional Fourier magnitude coefficients. The labelling algorithm would return, given a set of timeframes returned by the segmentation algorithm, a list of numerical labels (think 0, 1, 2, …) that denoted which timeframes were most similar to each other: segments that were all labelled 3, for instance, would have similar sounds.

I don’t know exactly how exactly these algorithms work, to be honest, and I don’t have a very rigorous way of demonstrating that OLDA and FMC2D are in fact the best algorithms for this project, but MSAF makes it easy to switch between different algorithms for easy testing. For now, I settled on this pair, and they work very impressively for a wide variety of genres that I tested them on and tagged manually.

Alignment

I decided to focus on just segmentation-based alignment; that is, rather than matching each line from the song to each line in the lyrics, I would first identify which portions of the audio corresponded to sections in the lyrics like choruses, verses, bridges, and so on. If this kind of alignment is successful, it should be a lot easier matching the lines together within each segment. In the end, I didn’t have time this phase to go any further than this.

The algorithm I settled for was rather straightforward: a DP approach proposed by researchers at Gracenote. Essentially, I would have to segment both the audio and the lyrics, make a guess as to which sections were choruses, verses, etc., and then traverse both lists of segments given the probability that our guesses earlier were correct or incorrect.

For the audio, I’ve already detailed MSAF and it’s precision in splitting the audio into sections. For the lyrics, fortunately Genius has a formatting policy of using [brackets] to header each section with “Chorus,” “Verse,” and so on, so I didn’t have to do any processing on that end myself. Some problems that I ran into with this though, and these are problems I have yet to address, are that many less popular songs don’t have these headers since not that many people care, and even when headers are labelled, they might be labelled with weird things like “Breakdown” in Daniel Powter’s “Bad Day.” Eventually, I’ll have to develop a more robust parser of Genius’ lyrics.

After figuring out segmentation, the rest of the algorithm was relatively simple to implement. The most difficult part was deciding which label from MSAF to assign the label of “chorus.” I decided to go with a twofold metric. First, for each label, I generate a syllable per second density for each section of audio with that label using the syllable timestamps from SND. The sections with a density < 0.7 would be marked as “instrumental” instead and not considered in the DP alignment process. With these densities, I calculated their standard deviation and thus how similar the sections within a particular label were to each other on a lyrical level. Second, I normalize the syllable counts both for each section given by Genius and for each section in the particular label in the audio and see which label had a normalized syllable count closest to the actual normalized syllable count in the lyrics. Thus, I can use both these metrics to get a Euclidean distance measure; the label with the least distance would be designated the chorus.

Results

Putting everything together, the basic pipeline for a song would look something like this:

I realize that it’s a bit hard to explain exactly what’s going on, but you can check out the code at my repository, in autosynch/align.py.

For an example test case, done on the Avett Brothers’ “Head Full of Doubt/Road Full of Promise,” we can get an output that looks like this:

Song: Head Full of Doubt, Road Full of PromiseArtist: The Avett BrothersGenre: folkSegmentation/labeling: olda/fmc2dSection boundaries (sec):[0.0 - 0.09287981859410431] 
Instrumental[0.09287981859410431 - 14.675011337868481]
Instrumental[14.675011337868481 - 35.80517006802721]
Verse 1[35.80517006802721 - 63.85487528344671]
Verse 2[63.85487528344671 - 95.66621315192744]
Chorus[95.66621315192744 - 130.21750566893425]
Verse 3[130.21750566893425 - 159.79972789115646]
Chorus[159.79972789115646 - 181.44072562358278]
Instrumental[181.44072562358278 - 195.2798185941043]
Instrumental[195.2798185941043 - 201.73496598639457]
Instrumental[201.73496598639457 - 236.9364172335601]
Chorus[236.9364172335601 - 248.77859410430838]
Instrumental[248.77859410430838 - 288.25251700680275]
OutroTime taken: 168.76940298080444

Now I haven’t fully formatted the output to look like this yet, but essentially this is the result from running our full process on the mixed audio file of this song. If you listen to the song on Spotify, you can see that the alignment here is almost perfect, which is really exciting! The time taken, as you can see, is roughly 2 minutes and 49 seconds for the alignment process alone; if you add on the time taken by TwinNet, the total amount of time would be around 3 minutes if using a GPU (with non-GPU, it can take more like 4 minutes and 40 seconds). The song itself is 4 minutes and 48 seconds, which is promising since this suggests that it’s possible to perform this task in near real-time if divided into smaller batches and thus apply it to Spotify’s streaming.

Next Steps

Unfortunately, there is a long way to go. Although this song happened to do well, on some of the other songs I tested it on, like Ariana Grande’s “God is a woman” or ScHoolboy Q’s “Man of the Year,” the results are confusing. The unsatisfactory results are largely a combination of problems from TwinNet not separating out vocals satisfactorily, SND overcounting syllables, mistakes in segmentation and labelling, and poor formatting on Genius’ part. These things can all be optimized, however, and one of my major goals for the final phase is to optimize all the components of my current segmentation-based alignment process. After this process becomes more trustworthy, I can start moving on to line-by-line alignment, which seems very doable. My initial thoughts are to use SND’s results to roughly split each section into lines and then to use phoneme analysis to line things up more finely. We’ll see how that goes!

Relevant Links

autosynch repository: https://github.com/chriswang030/autosynch/tree/phase2

MaD TwinNet weights: https://zenodo.org/record/3351632