GSoC 2019 Phase 3: Overview & Optimization

Chris Wang
Aug 24 · 5 min read

So now we’re finally at the end of GSoC! It’s been a crazy and incredibly educational experience, and I’m both sad and lowkey relieved that it’s about to conclude. Here’s a summary of what I worked on during this last phase.

Overview

After phase 2, the biggest problem was one of optimization: how can we align the segments of the song as precisely as possible? Once optimized, line alignment would follow pretty easily, which for now is basically dividing the section timing by the number of lines scaled to the relative number of syllables in that line. This is a list I made at the beginning of phase 3 for the optimization I would want to implement eventually:

  1. Retrain TwinNet on different parameters; perhaps less epochs so as not to overtrain it on the training set
  2. Test all the different combinations of segmentation/labeling algorithms to see the best ones
  3. Improved instrumental categorization, by making the syllable density cutoff relative to the rest of the sections
  4. Improved distance metric, to make the two axes of measurement more normalized to each other
  5. Add verse/intro prediction using syllable counts on top of chorus prediction
  6. Optimize error matrix for DP traversal
  7. Test different SND parameters, perhaps relative to song
  8. Displays the alignment output in a readable format

Out of this list, I accomplished #1, 2, 4, 6, and 8.

Optimization

The biggest component of optimization was vocal isolation. TwinNet was good, but it wasn’t good enough. I rented a GPU-equipped VM on the Google Cloud Platform to retrain our model on a bigger and better dataset, and with varying number of training epochs. My concern was that we had either undertrained the model or overtrained it on our dataset, so I wanted to see what worked. Getting PyTorch to work on this GPU was a huge pain, and in the end the improvement has been marginal. TwinNet currently does a much better job cutting out the background instruments, but it also cuts out a lot of female voices and “oo” vowels, which is problematic. I suspect it has something to do with autotune, at the very least, but I’m not certain. I plan on training it on some better datasets like iKala, which seemed to work well for some authors. I also want to try out a totally new isolation algorithm that was just published, that looks promising. TwinNet is definitely the biggest source of error, so any improvement here will be a big step up.

Our other optimizations were relatively esoteric, consisting of small but influential tweaks to the parameters of the DP alignment algorithm and the syllable nuclei detection algorithm. These tweaks were far more successful than retraining TwinNet. Using an evaluation metric that compared our alignment output to a manually tagged alignment by calculating the percent of the time that our predicted timing for a certain section correctly overlapped with the actual timing of that section, these tweaks saw a jump from an average of 30% coverage to more than 47%. I was testing on a random list of 8 popular hits in the last couple of years, but when I chose songs that were better for TwinNet (male voice, fewer empty sections), percent coverage reached 70–80%, which is phenomenal.

Playback

The last thing I did this phase was write a simple command-line user interface for autosynch. Given an arbitrary audio file of a (at all) popular song with lyrics available on Genius, autosynch will play that song and display the line of lyrics that is predicted to match that period of time in the song, as well as the name of that section (‘chorus,’ ‘verse,’ etc.). It uses the PyAudio package to stream the sound data, plus it can now take in .mp3 files as an input by using SoX to internally convert it to a .wav file with the correct specifications. Here are two short demos of autosynch in action.

Demo on Finesse by Bruno Mars
Demo on We Are Young by Fun.

As you can see, the alignment is not perfect, but it gets pretty close. What’s mostly throwing it off right now are instrumental sections. The detection procedure for whether a section is instrumental or not is still being developed, so that bit of instrumental music at the beginning of both demos messes up the line alignment initially, but after a few lines the audio and the lyrics synchronize relatively well, if I do say so myself!

What’s Next?

At the end of the day, I think I roughly achieved what I set out to do for GSoC, which was to develop a line-by-line alignment process for any arbitrary song. The project went in a very different direction than what I initially proposed, but I think that’s okay. And there is still a lot of work to do.

  1. Vocal isolation: The algorithm just has to be better. Failing to do well on songs with female singers is a massive deficit, but I think with smarter training improving TwinNet is definitely doable.
  2. Instrumental sections: There needs to be a better way of detecting if a section is instrumental or not, so that autosynch doesn’t incorrectly try to shove a lyric into that time slot. Fixing this will also be a big step, since a single error at the beginning of a song can throw off the entire thing.
  3. Smarter line alignment: Currently, lines are given timestamps through a simple ratio of syllables calculation. We can do better using the phonemes in the singing voice, similar to what Saurabh Shrivastava did with CCAligner for GSoC 2017.

One final thing to consider is linking this up with SwagLyrics and Spotify. Ideally we can calculate all this alignment info using a short buffer so that we can display the lyrics as the song is being streamed from Spotify, but right now we’re still way too slow for that (alignment takes anywhere from 2 to 5 minutes per song). We could possibly set up a database that just saves all the processed songs’ alignment info so that it doesn’t have to be recalculated when someone else plays it in the future, or maybe we can offer autosynch as a service for playlists/queues, using the time it takes for the previous song to play to process the next song for alignment.

The groundwork has basically been all laid out. It’s fully usable (see the Github README to find out how) as it is. What’s next is improving our model to get it working more accurately on more songs.

Relevant Links

autosynch repo: https://github.com/SwagLyrics/autosynch (it’s changed from the last post!)

demo Youtube playlist: https://www.youtube.com/playlist?list=PLTgeK9gogQ4RM7jO_xZP_YoXD19aPV8Tb

Chris Wang

Written by

columbia university — i study math but also do some programming

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade