Vocal reduction and isolation are always in high demand among DJs, sound producers, and musicians of all kinds, professional and amateur alike. Where there is demand, there will be supply — the market is full of various vocal removing services. However, most of the time their isolation process takes ages, and the results are often subpar. The ever-continuing onrush of technology and strong need for modern, better solutions prompted the advent of the vocal extracting software powered by artificial intelligence. Spleeter by French streaming platform Deezer is one of the most well-known examples of this AI-wave, and for good reasons.
For one, Spleeter demonstrated a significant improvement in comparison with the DAW (digital audio workstation) plugins. The introduction of neural networks and AI was, without a doubt, a step forward but Spleeter still leaves much to be desired, especially convenience-wise. It’s hardly a user-friendly vocal isolator since it’s a Python program that requires being installed in a certain way and launched as a command-line tool within a proper Python environment. Moreover, it always outputs 44.1kHz/16bit WAV files no matter the format of the uploaded file. Users are basically forced to use additional service for conversion to the original format that they intended to receive from the vocal remover.
We wanted to step in and create a service for audio splitting that provides both superior performance, and ease of use for everybody. This is how Lalal.ai, the first user-friendly AI-powered online vocal remover, came to life. No tedious installation steps, no third-party software involvement, we’ve made sure Lalal.ai is a no-brainer to use. Open it in your browser, drag and drop an audio file, then download the separated vocal and instrumental tracks or listen to them right on the page. The audio format you feed Lalal.ai is the format you receive after the processing. If you upload an MP3 file for splitting, you get the results in MP3 — the same goes for FLAC and other formats.
In order to substantiate our statements regarding Lalal.ai’s superiority over Spleeter, we’ve run several tests that starkly illustrate which AI does a better job at audio tracks splitting. The processes and results of these tests are described in detail in this article.
Disclaimer: The quality and precision of vocal removing aren’t fully dependent on the services in question, and on any audio splitting service in general. The results can have imperfections caused by poor mastering. In the mixed track, you most likely won’t notice the issues (otherwise sound producers would be fired), but in separate stems they are distinct.
Several songs were picked to be test-splitted by Spleeter and Lalal.ai. We decided to go only with losslessly compressed audio sources to avoid any influence of lossy formats. Audacity was used for the frequency spectrums analysis. The specifications and original frequency of the chosen tracks are given below.
The last one is purposefully an instrumental track. We were curious to see how the splitters would manage to process a vocal-less track. The song was also trimmed to one-minute length, otherwise the file would be 130MB which is too big to be uploaded to Lalal.ai.
Another manipulation we did with this track is resampling it to 44100Hz just for the sake of analysis, otherwise, the frequency spectrum would look almost empty.
Let’s take a look at the frequency spectrums of each track.
Again, we need to mention that no matter what, Spleeter always outputs 44.1kHz 16bit WAV files. Lalal.ai preserves not only the original container but also encoding parameters. If you feed 192kHz/24bit as input, you get 192kHz/24bit as output! Let’s look at what’s inside though.
For the sake of clarity, we arrange the frequency spectrums side by side and add the conclusions we could draw from what we see.
Right from these figures, we see that despite the fact that Spleeter outputs at 44.1kHz, the effective frequency range is below 11kHz which corresponds to the 22kHz sampling rate. Lalal.ai doesn’t keep up with the original sampling rate either, it resamples to 44.1kHz and therefore the maximum frequency is 22kHz. But this is much more than Spleeter’s 11kHz.
Basically, the same thing — an obvious drop at 11kHz in Spleeter. This drop is subjectively perceived as ‘muffling’ and lack of air in the output.
This comparison demonstrates some very expected results at this point. Spleeter drops all the frequencies above 11kHz, Lalal.ai shortens at 22kHz, whereas the original frequency set continues up to 35kHz.
As it can be clearly seen on the Figure 9 spectrums, the original frequency range is below 6’ish kHz, thus making it bearable for both Spleeter and Lalal.ai. However, you may notice a strange frequency shift in the 6–8kHz range on the Spleeter spectrum, which is nowhere to be found on the spectrums of the original and Lalal.ai’s tracks. We would say that the shift was created artificially by Spleeter. As a listener, you won’t hear a distinct shift but rather weird squeaks that are still noticeable.
Initially, we intended to ignore the vocal part of the output because the composition only features piano, and is instrumental. But eventually decided to handle it the same way as other parts. And boy, were we in for a surprise!
Vocals stem produced by Spleeter contained signal. The levels are low, they are below -78dB, but they are there. This is a pure artifact. The stem produced by Lalal.ai is empty, on the other hand, just as it should be.
As a result of our microstudy, we can draw three main conclusions after exploring the frequency charts of the test tracks:
- Spleeter cuts all the frequencies above 11kHz which is audible as muffling and lack of air. This is most likely a consequence of resampling to 22kHz done internally.
- Lalal.ai cuts all the frequencies above 22kHz. This is due to internal resampling to 44kHz.
- Both services introduce some audible artifacts to the output stems, however, in the case of Spleeter, they are actually seen on the frequency spectrums.
Gains and losses
After running the test, we couldn’t help but wonder what would happen if we mixed the separated stems back and compared them to the original track. The following steps were taken:
- Mixing the stems
- Inverting the original track
- Mixing again the stem mix and the inverted original track which is the same as subtraction.
We’ve used the ‘Right Left Wrong’ track. The frequency spectrums are given below.
The result is both interesting and expected. Since Spleeter seems to internally resample to 22050Hz, it misses all the frequencies above 11025Hz which is perfectly seen on Figure 12. At the same time, the graph below 11kHz is absolutely flat which means the mix is identical to the original track in this frequency band.
With the naked eye, you may notice that the Diff levels for Lalal.ai are lower, but the waveform is distributed more evenly over time. Let’s take a look at what is in the frequency spectrum.
The energy is distributed more or less evenly over the whole frequency range, but the levels are much less: -63dB max for Lalal.ai versus -50dB for Spleeter. The difference is induced by advanced filtering techniques aiming to improve the perceived quality of separated stems.
As we already mentioned sound production is often done in a way that eliminates some information from mixed tracks. While this information loss almost never leads to audible issues in the mixed tracks it engenders flaws in separated stems. The flaws may be made a bit less distinct with the previously mentioned advanced filtering.
If you spend a bit of time and explore what Spleeter can do you may find that it has a much more advanced mode which is called spleeter-16kHz. It does the same but this mode samples at 32kHz and therefore outputs frequencies up to 16kHz. The frequency chart is below and there is no surprise in it.
Subjective comprehension of 16kHz model’s output is much better than for the default 11kHz one and subjectively can be compared to what Lalal.ai offers.
The short-term plans of Lalal.ai include not only improving the overall split quality but also giving our users more control over how their music is processed.
We understand that what sounds good for us may not sound good for others. We will let our users decide how those advanced filters apply to their compositions and see what works best for them.