Librosa: A Python Audio Libary

by: David Kaspar, Alexander Bailey, Patrick Fuller

6 min readMay 28, 2019

This is by no means the complete guide to Librosa, but may hopefully be a helpful place for getting started.

Installing Librosa:

I am using anaconda and had no trouble installing Librosa the following code as per the instructions from Librosa’s documentation.

conda install -c conda-forge librosa

Loading in a song:

Librosa’s load function will read in the path to an audio file, and return a tuple with two items. The first item is an ‘audio time series’(type: array) corresponding to audio track. The second item in the tuple is the sampling rate that was used to process the audio. Eg:

import librosa
data, sr = librosa.load('/Users/patrickfuller/Downloads/Led Zeppelin - Stairway To Heaven.mp3')print(data.shape)
>>>(21188736,)print(sr)
>>>44100

The resulting array plotted might look like the below:

Scaled amplitude plotted over ~10 million samples.

The default sampling rate used by Librosa is 22050, but you can pass in almost any sampling rate you like. Some common sampling rates can be found here. Beware: resampling may add a non-negligible amount of run time for the load function(depending on your task). I found the run time for loading Stairway, a 8 min song to be about 16 seconds with default settings.

start = time.clock()test_array_default, _ = librosa.load('/Users/patrickfuller/Downloads/Led Zeppelin - Stairway To Heaven.mp3')print(time.clock()-start)>>> 15.907617

If you want to change the sample rate for your resulting audio time series, to 11000 for example, you can set the sample rate parameter in librosa.load() as:

# New Sample Rate
start = time.clock()test_array_default, _ = librosa.load('/Users/patrickfuller/Downloads/Led Zeppelin - Stairway To Heaven.mp3', sr=11000)print(time.clock()-start)>>> 14.952034

If you need to reduce load time, you can pass in an optional parameter to the load function, ‘res_type’, short for resample type(default is ’kaiser_best’). Options are ‘scipy’ and ‘kaiser_fast’.

start = time.clock()test_array_default, _ = librosa.load('/Users/patrickfuller/Downloads/Led Zeppelin - Stairway To Heaven.mp3', res_type='kaiser_fast')print(time.clock()-start)>>> 4.631903

When using the default sample rate, the ‘kaiser_fast’ noticeably reduced the load time (~16 down to under 5), but the ‘scipy’ actually added time (85 seconds!!). When passing in sample rate as 11000, the ‘scipy’ it took about 67 second and ‘kaiser_best’ still came in fastest at about 4.85 sec.

The kaiser methods are code from the ‘resampy’ module, where as ‘scipy’ uses code from the ‘scipy’ module. The ‘kaiser_best’ touts to be a higher quality. I cannot as of yet, back that claim scientifically.

There is however another parameter for sample rate, that can potentially improve both load time, and quality. If you pass sample rate as None, instead of resampling the audio, it will use the native sampling rate. This method resulted in a run time of about 1.4 seconds and had a noticeably higher quality playback. The track I was testing this with was stored with a sample rate of 44100 and that information is encoded into the file. You can right click on a file in finder to get info>more info from finder, or get song info>file info from iTunes to find this. If using windows, I believe right clicking in explorer and selecting properties might provide this information.

This native sampling functionally keeps the original sonic features in the output array instead of mashing features together. The size of the array gets smaller with a smaller sample rate. The longer array (more detailed audio) can add to processing time later on in workflow. For example computing a mel spectrogram the largest array took almost 4 seconds versus about 1.6 for the resampled smaller array. This additional time is much smaller than the margin(15s to 1.4) from load time, but if your running multiple operations it may add up. If you’re running 100 operations on the array the size of the array the may run up a bigger tab than the load time and you may want to resample to a smaller array.

Mel Spectrogram:

What is a mel spectrogram? Well first let’s start with the mel. A mel is a number that corresponds to a pitch, similar to how a frequency describes a pitch. If we consider a note, A4 for example, its frequency is 440 hz. If we move up an octave to A5 its frequency doubles to 880 hz, and doubles again to 1760 at A6. So thats a jump of 440 between the A4 and A5 and 880 between A5 and A6, but the problem is that the human ear doesn’t hear that way. The difference between two notes feels the same whether we jump from C to D or from F to G. But the logarithmic relationship gives different hz values for these different intervals. The term mel comes from the word ‘melodic’ and the mel scale is intended to regularize the intervals between notes. Unfortunately there does not seem to be one uniform mel scale. The code used by Librosa is a bit cryptic can be found here. In principle the mel is used to display pitch in a more regularized distribution.

Librosa includes a function to exctract the power spectrogram (amplitude squared) for each mel over time as well as a function for easy display of the resulting mel spectrogram. This display function is not automatically imported with librosa and must be imported on its own as such:

spec = librosa.feature.melspectrogram(y=data, sr=sr)import librosa.display
librosa.display.specshow(spec,y_axis='mel', x_axis='s', sr=sr)
plt.colorbar()

Well thats not a lot of information. The Librosa documentation includes a tutorial that covers this very issue. Librosa has a function to convert the amplitude squared to decibels.

db_spec = librosa.power_to_db(spec, ref=np.max,)
librosa.display.specshow(db_spec,y_axis='mel', x_axis='s', sr=sr)
plt.colorbar();

Harmonic Percussive Separation:

Librosa can also separate the initial audio series into harmonic and percussive components.

data_h, data_p = librosa.effects.hpss(data)spec_h = librosa.feature.melspectrogram(data_h, sr=sr)
spec_p = librosa.feature.melspectrogram(data_p, sr=sr)
db_spec_h = librosa.power_to_db(spec_h,ref=np.max)
db_spec_p = librosa.power_to_db(spec_p,ref=np.max)

And when plotted we get:

Around 140 seconds into the song the intensities for both harmonic and percussive low frequencies increase. For the harmonic this is likely the bass guitar, and for percussive this would be the kick drum. Now that the percussive features are separated out we can extract which pitches are present as notes from the harmonic features.

chroma = librosa.feature.chroma_cqt(y=data_h, sr=sr)plt.figure(figsize=(18,5))
librosa.display.specshow(chroma, sr=sr, x_axis='time', y_axis='chroma', vmin=0, vmax=1)plt.title('Chromagram')
plt.colorbar()plt.figure(figsize=(20,8))
plt.title('Stairway To Heaven: Chroma Spectrogram')
librosa.display.specshow(chroma, sr=sr, x_axis='s', y_axis='chroma', );

It looks like this song has a lot of A. It’s seems likely that this is the key of the song(any guitar players out there?). It’s a bit too much information in one graph though. Let’s take a look at just the first thirty seconds.

first_thirty_seconds = librosa.time_to_samples(30, sr=sr)
intro = data[:first_thirty_seconds]intro_harm = librosa.effects.harmonic(intro)
intro_chroma = librosa.feature.chroma_cqt(intro_harm, sr=sr)plt.figure(figsize=(20,8))
plt.title('Stairway To Heaven: Chroma Spectrogram')
librosa.display.specshow(intro_chroma, sr=sr, x_axis='s', y_axis='chroma', )
plt.colorbar();

How fast is this song?

Last but not least, Librosa’s beat.tempo function will estimate the tempo of the audio sample in beats per minute.

print(librosa.beat.tempo(data, sr=sr))
>>>[139.67483108]

This is just an estimation, as this is a long song with multiple different parts with tempo up and down but it is potentially interesting for comparing songs.

These are only some of the many operations that Librosa can perform but hopefully they will provide a jumping off point to explore the rest of them and/or other musical analyses as well!

References: