GSoC 2019 Phase 1: Vocal Isolation

6 min readJun 20, 2019

The first phase of GSoC 2019 is about to wrap up at the end of this week, and I wanted to document some of the progress I’ve made as well as the process leading up to this point.

Background

The organization I’m working for is CCExtractor, an open-source software service that provides subtitling services for videos. I work on a small offshoot of the organization, the brainchild of my mentor Aadi Bajpai, called SwagLyrics for Spotify, which displays the lyrics of whatever song you’re listening to on the Spotify app via the lyric website Genius.

My project, which I’m currently calling autosynch, attempts to synchronize those lyrics to their location in the audio. In essence, we want to display lines of those lyrics at the same time they are being sung, and ideally, this could be done in real-time regardless of the song being played. Imagine the old Musixmatch service Spotify used to provide, except now the synchronization is done automatically instead of manually by Musixmatch users.

The basic idea I had was to use a syllable nuclei detection algorithm (SND) to roughly estimate the number of spoken syllables in a chunk of music, then match that up with the results of a hyphenation algorithm on the lyrics themselves, and finally match up these syllables.

Vocal Isolation

Most of the work I’ve done during the first phase of GSoC is researching. My original plan for this phase was to implement SND in Python. This task proved much easier than I thought: the algorithm, originally written as a script for PRAAT, a standalone audio analysis program, was easily translated into Python through a wrapper called Parselmouth. It took me about 2 days to fine-tune and complete this task.

The more difficult problem, I soon discovered, was getting the inputs for SND. The audio file of a song would not fit the bill, as the background elements like percussion, bass, guitar, and so on, would critically confuse my SND algorithm. So the problem is one of vocal isolation — how can we remove all or nearly all of the background instrumentation so that we get a relatively clean isolated track of just the vocal, lyrical part of the music, which we can then input into SND?

Algorithms

I had anticipated this problem before I began the project, although I had not anticipated how difficult this problem would be. I had originally planned to use an algorithm called REPET. It looked promising — it was designed for full-length tracks, faster than many alternatives, and could possibly be integrated into real-time processing.

Results were not looking so hot. I had a few test files I found on Youtube (more on finding a test set later), and none of the multiple versions of REPET seemed to isolate the vocals to the degree necessary for SND to be remotely accurate. Furthermore, REPET took more time than I would have thought — for real-time processing to be a possibility, the processing time would have to be at least less than the duration of the song itself, but this was not the case with REPET. My initial naivete regarding the vocal isolation process turned out to be a deep rabbit hole, and I began researching dozens of other vocal isolation algorithms online.

I won’t go into each one I looked into, but this process was rife with obstacles. For one, most of the better-looking algorithms did not have open-source implementations available. I did not think this was a huge problem, but I found myself having a frustrating time implementing the technical language of signal processing. Even when I could implement the algorithm, they required hyperparameters to run, which the papers never specified. If I chose some default values, the outputs would be horrible.

I finally found an article providing a huge overview of the vocal isolation research community. I learned which types of algorithms were good for full-length tracks (which we needed), which ones were faster, which ones created more distortion or less audio artifacts. I also found nussl, a library implementing several of the most popular algorithms so far published.

Nussl was a godsend, although it wasn’t perfect. One major problem was deprecation: nussl was written in Python 2 and used older versions of Numpy, Scipy, and PyTorch. I spent a long time updating those functions and changing those division operators from true division back to classic division (thanks, Python 3). Some of them seemed to completely break no matter how much I updated — the deep clustering and PROJET algorithms would hang for hours without me knowing whether they were still running or if there was something wrong. Finally, after sorting all that out, I wrote some short Python scripts that you can find in the masterbranch of my repository to run each of those algorithms on my few test files and compare results.

However, the algorithm that really impressed me was not even in the nussl library. While nussl made it really easy, the results those algorithms gave weren’t really what I was looking for either. I found what I needed in a deep learning module called MSS PyTorch, along with an enhancement called MaD TwinNet. TwinNet is the one I’ve been working with most recently. It requires a model — the authors provide a pre-trained one, but I’m training my own model currently — but it’s extraordinarily fast and much more accurate than anything I had tried earlier. I tested it on Liz Nelson’s Rainfall, admittedly a lowball with clear vocals and only acoustic guitar in the background, but it far outperformed previous algorithms, and it ran 7 times faster than MSS PyTorch, taking a little over 100 seconds to process a 284-second song. After modifying the code a bit, I created a pipeline from the original mixed audio file → MaD TwinNet → isolated vocals → SND. Although these vocals still aren’t perfect, with some optimization I believe they can be enough to satisfy our needs. You can find this code and specifications in the phase1branch of my repository.

Datasets

I also want to talk briefly about the datasets I used. I wanted a large set with both full-length, mixed tracks (since that would best simulate what we would get from Spotify) along with the corresponding unmixed vocals. At first, I was using a freely downloadable dataset called the MUSDB100, which contains 100 full mixed tracks and their vocal files, among other things, but I realized that the tracks were not actually popular songs but tracks the authors had gotten from SoundCloud. While this is mostly acceptable, and this is the dataset on which the authors of MaD TwinNet had trained their model, I wanted a model with commercial popular songs so that we could automatically get lyric data.

Thus, I turned toward MedleyDB, a dataset compiled by professors at NYU of commercially available, licensed songs and their isolated vocals. The dataset requires an approved request to access, however, due to the nature of the license by which these popular songs are granted for research purposes. Currently, I’m training the MaD TwinNet module on the first version of MedleyDB, which includes about 100 full-length tracks. I’m awaiting access for the second version, which has 55 more tracks that were released more recently. Hopefully, the model resulting from this set can produce better results for modern music.

Next Steps

The next phase of autosynch, I think, will focus primarily on optimizing these vocal isolation algorithms, implementing a textual hyphenation algorithm, and creating a rough model to synchronize the outputs of these two algorithms. Other than testing the different models we’re generating for MaD TwinNet, I also wanted to test these vocal isolation algorithms, as they had good results according to the journal article I linked above. I’m not too sure how I plan on enmeshing the audio syllabic analysis with the textual syllabic analysis since the data I have is so sparse and poorly documented. I might look into getting a short Musixmatch subscription so I can get manually tagged data. In any case, I haven’t thought too hard about it yet, but I’m sure CCExtractor has resources to help me with that. Here’s to a good phase 2!