Speech Analytics Part -1, Basics of Speech Analytics

Priya Sarkar
Analytics Vidhya
Published in
8 min readAug 16, 2020

So what is a sound wave ?

Speech signals are sound signals, defined as pressure variations travelling through the air. These variations in pressure can be described as waves and correspondingly they are often called sound waves.

Sound wave can be described by five characteristics: Wavelength, Amplitude, Time-Period, Frequency and Velocity or Speed

  1. Wavelength — The minimum distance in which a sound wave repeats itself is called its wavelength. That is it is the length of one complete wave. It is denoted by a Greek letter λ
  2. Amplitude — When a wave passes through a medium, the particles of the medium get displaced temporarily from their original undisturbed positions. The maximum displacement of the particles of the medium from their original undisturbed positions, when a wave passes through the medium is called amplitude of the wave.

3. Time Period — The time required to produce one complete wave or cycle or cycle is called time-period of the wave.

4. Frequency — The number of complete waves or cycles produced in one second is called frequency of the wave. The S.I unit of frequency is hertz or Hz.

5. Velocity — The distance traveled by a wave in one second is called velocity of the wave or speed of the wave.

What is Windowing and Sampling in Sound Data?

Sampling the signal is a process of converting an analog signal to a digital signal by selecting a certain number of samples per second from the analog signal. Can you see what we are doing here? We are converting an audio signal to a discrete signal through sampling so that it can be stored and processed efficiently in memory.

The key thing to take away from the above figure is that we are able to reconstruct an almost similar audio wave even after sampling the analog signal. The sampling rate or sampling frequency is defined as the number of samples selected per second.

Windowing is a classical method in signal processing and it refers to splitting the input signal into temporal segments. The borders of segments are then visible as discontinuities, which are incongruent with the real-world signal.

A spoken sentence is a sequence of phonemes. Speech signals are thus time-variant in character. To extract information from a signal, we must therefore split the signal into sufficiently short segments, such that each segment contains only one phoneme.

Next step we need to extract information from this wave form which is accomplished by changing the speech waveform to a form of parametric representation at a relatively lesser data rate for subsequent processing and analysis. This is usually called the front end signal-processing.

Feature Extraction and Visualizing a sound wave -

  1. Time Domain — Here, the audio signal is represented by the amplitude as a function of time. In simple words, it is a plot between amplitude and time.
  2. Frequency Domain — In the frequency domain, the audio signal is represented by amplitude as a function of frequency. Simply put — it is a plot between frequency and amplitude.
  3. STFT and Spectogram -

Pitch refers to our perception of the frequency of a tonal sound. The Fourier spectrum of a signal reveals such frequency content. It maps a length N signal xn into a complex valued frequency domain representation Xk of N coefficients as

However, since spectra are complex-valued vectors, it is difficult to visualize them as such.

By windowing and taking the discrete Fourier transform (DFT) of each window, we obtain the Short-time Fourier transform (STFT) of the signal.

A further parallel with a spectrum is that the output of the STFT is complex-valued, though where the spectrum is a vector, the STFT output is a matrix. As a consequence, we cannot directly visualize the complex-valued output. Instead, STFTs are usually visualized using their log-spectra, 20log10(X(h,k)). Such 2 dimensional log-spectra can then be visualized with a heat-map known as a Spectrogram.

Sound Segment
Magnitude of DFT Speech Segments converted to Power
LogSpectrum of Speech Segment

Spectogram -It’s a 2D plot between time and frequency where each point in the plot represents the amplitude of a particular frequency at a particular time in terms of intensity of color. In simple terms, the spectrogram is a spectrum (broad range of colors) of frequencies as it varies with time.

Spectogram

4. Cepstrum and Mel Spectogram and MFCC -

We now see that the log-spectrum has plenty of structure. It is a more or less continuous signal, owing to a large part, to the smoothing effect of windowing. It also a periodic structure, which corresponds to the harmonic structure of the signal caused by the fundamental frequency.

Specifically, we can take the discrete Fourier transform (DFT) or the discrete cosine transform (DCT) of the the log-spectrum, to obtain a representation known as the Cepstrum.

It is worth repeating that the cepstrum involves two time-frequency transforms. The cepstrum of a time-signal is therefore in some sense similar to the time-domain. The x-axis of a cepstrum is known as the quefrency-axis and it is expressed typically in the unit seconds.A second useful piece of information in the cepstrum is the harmonic structure of the log-spectrum. Recall that the fundamental frequency is visible as a comb-structure in the log-spectrum.

Sound Wave -> Spectrum -> Cepstrum

Importantly, it has also a macro-level structure; by connecting the peaks of the harmonic structure, we see that the signal forms peaks and valleys, which correspond to the resonances of the vocal tract. These peaks are known as formants and they can be used to uniquely identify all vowels.

To further improve on the cepstral representation, we can include more information only concerned within human perception.This can be done using scales like equivalent rectangular bandwidth (ERB) scale, the Bark scale, and the mel-scale.

Mel Spectogram is given by —

MFCC — Finally, by taking the discrete cosine transform (DCT) of the parameters, we obtain the representation known as mel-frequency cepstral coefficients (MFCCs). The benefit of the DCT at the end is to approximately decorrelate the signal, such that the MFCC coefficients are not correlated with each other.

Other Feature Extraction Techniques — LPC, LPCC, PLP

(You can Read more about them here)

What are Deltas ?

We describe speech as a sequence of phonemes. So how do we define a phoneme in a sound wave ? A common method for extracting information about such transitions is to determine the first difference of signal features, known as the delta of a feature. Specifically, for a feature fk, at time-instant k, the corresponding delta is defined as

Now once we have extracted features from the wave. It is very important to clean it.

Transforming and Pre-Processing a SpeechWave:

  1. Filters — Like the word suggest if only allows frequencies above or below a cutoff frequency

Low-pass — Low-pass filters pass through frequencies below their cutoff frequencies, and progressively attenuates frequencies above the cutoff frequency.

High-pass — A high-pass filter does the opposite, passing high frequencies above the cutoff frequency, and progressively attenuating frequencies below the cutoff frequency.

2. Masking — Masking refers to unwanted sound in your soundwave (Noise). Masking sound reduces or eliminates perception of sound. Apply masking to a spectrogram in the time domain refers to Time Masking.Masking a spectogram in the frequency domain refers to frequency masking

3. TimeStretch — Stretch a spectrogram in time without modifying pitch foa given rate.

4. Amplification / Gain — It increases amplitude or attenuate to the whole waveform. In short it increases the loudness of your sound wave.

5. Dither — It increases the perceived dynamic range of audio stored at a particular bit-depth.

6. Equalize — It is used for nnormalising the sound waves within a running window frame. It helps to remove noise from the sound

Thus in this Blog we have learned

  1. What are the important features of a sound wave
  2. How to visualize and extract information from a sound wave
  3. How to pre — process a sound wave

These are the fundamental steps which has to be accomplished. Now once we are finished with these tasks this wave can be used for your Speech Analytics project such as :

  1. Speaker Identification
  2. Speech To Text Conversion
  3. Speech Modulation
  4. Music Recommendation

We Would be looking into the Practical Aspects in TorchAudio Library in the Next Part………………………..

References:

1.https://wiki.aalto.fi/display/ITSP/Deltas+and+Delta-deltas

2.https://www.intechopen.com/books/from-natural-to-artificial-intelligence-algorithms-and-applications/some-commonly-used-speech-feature-extraction-algorithms

3. https://www.analyticsvidhya.com/blog/2019/07/learn-build-first-speech-to-text-model-python/

--

--

Priya Sarkar
Analytics Vidhya

Data Scientist JP Morgan, graduate from IIT Bombay. Love creating new products and trying new technology. https://www.linkedin.com/in/priya-sarkar-60248171