In Emotion Recognition, the voice is the second most important source of affective data, after the face. The voice can be characterized by several parameters. The pitch of the voice is one of the main characteristics, but in the field of acoustic technologies the correct name of this parameter is the fundamental frequency.
The fundamental frequency is directly related to what we call the intonation. And the intonation, for example, is associated with expressive characteristics of the voice.
However, estimation of the fundamental frequency is not a trivial task, which has its interesting ins and outs. In this article we will discuss the features of the algorithms for its estimation and compare the existing solutions on audio recordings.
To begin with, let’s remember what the fundamental frequency is and in what tasks it may be needed. The fundamental frequency, which is also referred to as F0, is the vibration frequency of the ligaments when pronouncing voiced sounds. When pronouncing unvoiced sounds, for example, by whispering or uttering hissing and whistling sounds, the ligaments do not vibrate, which means that this characteristic is not relevant.
*Note that the split on voiced/unvoiced sounds is not equivalent to the split on vowels and consonants.
The variability of the fundamental frequency is quite large, and it can vary greatly not only between people (for lower average male voices the frequency is 70–200 Hz, and for women it can reach 400 Hz), but also for one person, especially in emotional speech.
F0 is used in a wide range of solutions:
- Emotion Recognition
- Sex determination (male/female voices)
- Speaker deterioration, or splitting the speech into phrases
- In healthcare, detection of the pathological characteristics of the voice (for example, using the acoustic parameters of Jitter and Shimmer): F0 could be used for the detection of signs of Parkinson’s disease ; Jitter and Shimmer can also be used for Emotion Recognition 
However, there is a number of difficulties one faces while estimating F0. For example, it can be often confused with harmonics — this can lead to so-called pitch doubling/pitch halving . And in an audio record of poor quality, it can be difficult to estimate F0 because the desired peak at low frequencies almost disappears.
By the way, do you remember Laurel and Yanny auditory illusion? Differences in the words people hear when listening to the same audio record arise because of the differences in perception of F0, which is influenced by many factors: the age of the listener, the degree of how tired the listener is, the quality of the audio system. So, when listening to the recordings in the systems with high-quality low-frequency reproduction, you’ll hear Laurel, and in audio systems where low frequencies are poorly reproduced, Yanny. The transition can be noticed on one device as well, for example, here. In this article the audio track’s listener is a neural network (instead of a human); another article explains the Yanny/Laurel phenomena from the speech production perspective.
Since the detailed analysis of F0 estimation is quite extensive, this article is supposed to provide a brief overview of the topic that will be useful for the reader’s further investigation.
F0 estimation methods
The methods of F0 estimation can be divided into three categories: based on temporal dynamics of the signal, or time-domain; based on the frequency structure, or frequency-domain, and hybrid methods. This article provides a detailed account of the methods.
All of the discussed algorithms consist of 3 main stages:
1. Pre-processing (signal filtering, splitting it into frames)
2. Searching for possible values for F0 (candidates)
3. Tracking: the choice of the most probable F0 trajectory (which is important as for each moment we have several simultaneously competing candidates)
First, the basics. Before applying time-domain methods, the signal is filtered in order to leave only low frequencies. The thresholds are set: minimum and maximum frequencies, for example, from 75 to 500 Hz. F0 estimation is only conducted for harmoniс speech. Those sections with pauses or noise specks are not only pointless to observe but they can also alter the neighboring frames and lead to errors when interpolation or smoothing are applied. The length of the frame is chosen so that it comprisesat leastthree intervals.
Autocorrelation is the main algorithm, which subsequently created a whole family of algorithms. The approach is pretty simple: you calculate the autocorrelational function and then defineits first maximum that will reflect the most salient frequency component in the signal. So, what are the obstacles in using autocorrelation and why the first maximum cannot always correspond to the desired frequency? Even in almost perfect conditions in high quality recordings, the algorithm is prone to errors due to the complex structure of the signal. In the conditions that are close to real, the number of mistakes drastically increases. Moreover, in the intially poor quality and noisy recordings, we can face, among other things, an absence of desired peak.
Despite the errors, the method is convenient and appealing because of its basic simplicity and logic. It is widely used in many algorithms, including YIN. The name itself refers to the idea of the balance between the convenience and inaccuracy of autocorrelation: “The name YIN from ‘‘yin’’ and ‘‘yang’’ of oriental philosophy alludes to the interplay between autocorrelation and cancellation that it involves.”
The creators of YIN tried to fix the problem. The first thing they changed was the use of Cumulative Mean Normalized Difference function that was supposed to lower the sensitivity of the signal to the amplitude modulations and make the peaks more apparent.
YIN tries to avoid the mistakes that come up when the length of the window function is not exactly divisible into the fluctuation period. For that matter, parabolic interpolation to approximate the minimum is applied. At the last step of the audio signal processing Best Local Estimate function is used to avoid rapid fluctuation of the values. (Whether it is good or bad — it is hard to tell.)
When we speak about frequency domain, the most prominent aspect appears to be harmonic structure of the signal. In other words, spectral peaks at the frequency that is divisibleinto F0.You can turn this periodic pattern into an obvious peak with the help of cepstrum analysis. Cepstrum is a Fourier transform (FFT) of the logarithm of estimated power spectrum. The cepstrum peak corresponds to the most periodic component of the spectrum. (Read about it here and there)
Hybrid F0 estimation methods
The next algorithm has quite a telling name — YAAPT — Yet Another Algorithm of Pitch Tracking. It is classified as hybrid as it uses both temporal and frequency data. You can find full description in this article, while we will talk about only main stages of the method.
YAAPT comprises several stages,starting with pre-processing. At this stage the valuesof the initial signal are squared to get a second version of the signal. This step follows the same goal as Cumulative Mean Normalized Difference function in YIN: amplifying and restoring the “jammed” peaks of autocorrelation. Both versions of the signal are filtered, usuallyin the spectrum of 50–1500 Hz, or 50–900 Hz.
Then, based on the spectrum of the transformed signal, the basic trajectory of F0 is calculated. The candidates for F0 are determined with Spectral Harmonics Correlation (SHC) function.
The definition of voiced-unvoiced frames is done based on power spectrum as well. Then we search for the most optimal trajectory, taking into account the possibility of pitch doubling/pitch halving [3, Section II, C].
Then, for both initial and transformed signals the candidates for F0 are determined with Normalized Cross Correlation (NCCF), instead of autocorrelation.
The next stage is to evaluate all possible candidates and calculate their weight. The weight of the candidates depends not only on the amplitude peak of NCCF, but their proximity to the F0 trajectory that was determined by the spectrum as well. Therefore, the frequency domain is considered quite blunt yet stable [3, Section II, D].
For all of the remaining candidate pairs Transition Cost matrix is calculated — the cost of transition that allows to find the best possible trajectory [3, Section II, E].
Let’s put it into practice on actual recordings. We will use Praat as our starting point, quite a popular tool among speech researchers. Then we’ll implement the results of YIN and YAAPT into Python and compare them.
We took several excerpts with male and female voices, both neutral and emotionally colored, and combined them together. Let’s look at the spectrogram of our signal, its intensity (orange) and F0 (blue). You can open it in Praat with Ctrl+O (Open — Read from file), and then press View & Edit.
It’s clear from the recording that the pitch is higher in both men and women voices when they are speaking emotionally. Besides the F0 of male emotional speech is kind of similar to the female’s F0.
Choose Analyze periodicity — to Pitch (ac) in the menu to estimate F0 by means of autocorrelation. The settings window allows to set 3 parameters for F0 candidates estimation,as well as 6 for path-finder algorithm that builds the most probable F0 trajectory among all of the candidates.
Full description of the algorithm can be found in the article.
The path-finder results can be seen by clicking OK and View & Edit Pitch file. The picture shows that apart from the chosen trajectory there were other prominent candidates with lower frequencies.
What’s about Python?
Let’s take two libraries that offer pitch-tracking. First, aubio, where the default algorithm is YIN. And AMFM_decomposition with YAAPT algorithm. Insert the F0 value from Praat into a separate file (PraatPitch.txt) (you can do it manually: choose the audio file, click View and Edit, select the whole file, and choose Pitch-Pitch listing in the menu above).
Now let`s compare the results from all three algorithms (YIN, YAAPT, Praat).
import amfm_decompy.basic_tools as basic
import amfm_decompy.pYAAPT as pYAAPT
import matplotlib.pyplot as plt
import numpy as np
import sys from aubio
import source, pitch # load audio
signal = basic.SignalObj('/home/eva/Documents/papers/habr/media/audio.wav')
filename = '/home/eva/Documents/papers/habr/media/audio.wav'# YAAPT pitches
pitchY = pYAAPT.yaapt(signal, frame_length=40, tda_frame_length=40, f0_min=75, f0_max=600)# YIN pitches
downsample = 1
samplerate = 0
win_s = 1764 // downsample # fft size
hop_s = 441 // downsample # hop size
s = source(filename, samplerate, hop_s)
samplerate = s.samplerate
tolerance = 0.8
pitch_o = pitch("yin", win_s, hop_s, samplerate) pitch_o.set_unit("midi")
pitch_o.set_tolerance(tolerance)pitchesYIN = 
confidences =  total_frames = 0
samples, read = s()
pitch = pitch_o(samples)
pitch = int(round(pitch))
confidence = pitch_o.get_confidence()
pitchesYIN += [pitch]
confidences += [confidence]
total_frames += read
if read < hop_s:
break # load PRAAT pitches
praat = np.genfromtxt('/home/eva/Documents/papers/habr/PraatPitch.txt', filling_values=0)
praat = praat[:,1]# plot
fig, (ax1,ax2,ax3) = plt.subplots(3, 1, sharex=True, sharey=True, figsize=(12, 8))
ax1.plot(np.asarray(pitchesYIN), label='YIN', color='green')
ax2.plot(pitchY.samp_values, label='YAAPT', color='blue')
ax3.plot(praat, label='Praat', color='red')
ax3.legend(loc="upper right") plt.show()
As we can see, with the default parameters, YIN falls behind showing a very flat trajectory with lower values relative to Praat. Moreover, it completely loses the transitions between the male and female voices, as well as between emotional and non-emotional speech.
YAAPT failed at high pitch in emotional speech of females, howeverall in all showed better results. We cannot say for sure why it works better, but we can assume that it has something to do with extraction of candidates from three sources and more accurate calculation of their weight than in YIN.
Since almost everyone who works with the sound faces the question of fundamental frequency (F0) estimation in one way or another, there are many ways to approach it. The question of accuracy and uniqueness of the audio material in every situation define whether you need to thoroughly choose settings, or you can just get away with the basic stuff like YAAPT. Taking Praat as a standard algorithm for speech processing (as it is widely used among the researchers), we can conclude that YAAPT is more reliable and accurate than YIN, though our test was quite difficult for it.
Written by: Eva Kazimirova, Researcher at Neurodata Lab, Speech-Processing Specialist.
- Rusz, J., Cmejla, R., Ruzickova, H., Ruzicka, E. Quantitative acoustic measurements for characterization of speech and voice disorders in early untreated Parkinson’s disease. The Journal of the Acoustical Society of America, vol. 129, issue 1 (2011), pp. 350–367. Access
- Farrús, M., Hernando, J., Ejarque, P. Jitter and Shimmer Measurements for Speaker Recognition. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2 (2007), pp. 1153–1156. Access
- Zahorian, S., Hu, HA. Spectral/temporal method for robust fundamental frequency tracking. The Journal of the Acoustical Society of America, vol. 123, issue 6 (2008), pp. 4559–4571. Access
- De Cheveigné, A., Kawahara, H. YIN, a fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America, vol. 111, issue 4 (2002), pp. 1917–1930. Access