Introduction to Speech processing: Deep Learning (Part 1)

7 min readAug 2, 2022

Speech is the most natural, efficient, and preferred mode of communication between humans. Therefore, it can be assumed that people are more comfortable using speech as a mode of input for various machines rather than such other modes of communication as keypads and keyboards. If a machine or computer program can recognize phrases and words in the form of spoken language and can transform them into machine-readable form, Simply this is called Audio processing.

With recent development, this mode of speech communication is being used to interact between humans and machines with great advancements such as Google’s “Google Assistant”, Apple’s “Siri” and Amazon’s “Alexa” and also these technologies have generated a huge impact on the industry in such as home automation, handheld devices, content captioning for videos and hands-free devices in automotive.

Sound and Sound Waveform

Simply Sound is a form of energy that travels from one point to another through a medium(eg: Air, Gas, Solid, or Liquid) like a wave or can be called a mechanical wave(requires a medium for propagation).

In detail, each sound wave is produced by a distinct source. As an example, consider a vibrating tuning fork (source). The fork tines vibrate back and forth and they push on neighboring air molecules. The forward motion of a tine pushes air molecules horizontally to the right, while the backward retraction of the tine creates a low-pressure area allowing the air particles to move back to the left.

Because of the longitudinal motion of the air molecules, there are regions in the air where the air molecules are compressed together called compressions, and other regions where the air particles are spread apart called rarefactions. The compressions are regions of high air pressure while the rarefactions are regions of low air pressure. ‘figure 1’ shows a sound wave created by a speaker and propagated through the air in an open tube. A sound wave consists of a repeating pattern of high-pressure and low-pressure regions moving through a medium, also this can be called a pressure wave.

Properties of Sound Wave

To gain a better understanding of the properties of a sound wave, I plotted a simple sign wave using Python. But keep in mind naturally sound waves do not follow such a simple sine pattern. Definitely, Sound waves have more complex repeating patterns. However, if you like to know how to create a sign wave using python, you can try the following code.

import numpy as np
import matplotlib.pyplot as pltstart_time = 0
end_time = 1
sample_rate = 1000
time = np.arange(start_time, end_time, 1/sample_rate)
theta = 0
frequency = 8
amplitude = 1#equation for genarating sign wave
sinewave = amplitude * np.sin(2 * np.pi * frequency * time + theta)plt.figure(figsize=(10, 5), dpi=80)
plt.plot(sinewave)

Wavelength

The wavelength is the horizontal distance between any two successive equivalent points on a wave. also can be called the horizontal length of one wave cycle.

Period

The period of a wave is the time required for one complete cycle of the wave to pass by a point. So, the period is the amount of time it takes for a wave to travel a distance of one wavelength.

Frequency

The total number of waves or wave cycles produced in one second. Measured by Hertz(Hz).

Pitch

How the brain interprets the frequency of an emitted sound is called the pitch. We already know that the number of sound waves passing a point per second is the frequency. The faster the vibrations the emitted sound makes (or the higher the frequency), the higher the pitch.

Amplitude(loudness)

The wave height represents the amplitude of a sound wave. The amplitude of a loud sound is high, while smaller amplitudes represent softer sounds. A decibel (dB) is the scientific unit for measuring sound amplitude.

Timbre

Timbre distinguishes the sound produced by a voice or different instruments such as a guitar, a saxophone, and so on, even when they are playing the same frequency at the same amplitude.

Analog to Digital Conversion

Naturally, the sound is a mechanical wave or can be called an Analog signal. You may be familiar with the sounds produced by a piano, a guitar, and including our own human voice and all of these are producing Analog signals. But machines can’t understand those analog signals directly. So how can these analog signals convert into the machine-understandable format? By looking at the previous sine wave you will understand, that both amplitude values and time values are continuous.

Analog to Digital conversion consists of two processes.

Sampling

Sampling is used to convert the time-varying continuous signal to a discrete sequence of real numbers. Originally Analog signal has infinite continuous values. In sampling, we locate the sample data points using a sampling period and make the wave discontinuous. The distance between two successive discrete sample points is called as sampling period(T) measured in seconds. sampling rate(sr = 1/T) is the attribute measured in Hz and can be described as the number of samples per second. The most common sampling frequencies are 8000Hz, 16000Hz, and 44100Hz.

As an example, let's take an audio file that is recorded(sampled) with a sampling frequency of 44.1kHz. In other words, while recording this file we are capturing 44100 amplitude values every second.

so obviously higher sampling frequency results in less information loss or less error rate but higher computational expense, and lower sampling frequencies have higher information loss or higher error rate but are fast and cheap to compute.

Quantization

Quantization replaces each real number of the sampled data point with the closest amplitude value that is available. In other words, quantization is the process of reducing the infinite number precision of an audio sample to a finite precision as defined by a particular number of bits. You can clearly identify that by looking at the above diagram. we quantized sample data points ranges in 8 amplitude levels and then we can find the resolution or bit depth of the signal. The above signal bit depth is 3bit because 2³ = 8.

For example, in CD(sample rate=44100Hz, bit depth=16bit), samples are taken 44100 times per second each with 16bit sample depth, i.e. there are 2¹⁶ = 65536 possible values of the signal: from -32768 to 32767.

Loading an Audio File with Python

Librosa is a well-known Python library heavily used in audio processing. As you can see, the sampling rate of 22050Hz(This is the default sample rate which is given by Librosa when loading the audio file) means that every second can include 22050 data points. There are only 13029 amplitude values or samples in this audio file. So we can calculate the duration of the audio by dividing the sample length by the sample rate, and the result is given as 0.59 in seconds (22050 x 0.59.. = 13029).

In the domain of speech recognition mainly we can identify mainly three different types of speech recognition systems according to the type of speaker mode that has been used in many research studies.

a. Speaker-independent

Speaker-independent systems are developed to recognize multiple speakers. Such systems are not trained for a particular user and are one of the most complex types of systems to design. These systems may be less accurate than other methods, but they are more adaptable and have a wider range of applications in the real world.

b. Speaker-dependent

Speaker-dependent systems are designed to recognize a single user or a group of previously trained users. These systems are simple to train and have higher accuracy than the speaker-independent mode. However, they will not be able to produce the same level of accuracy for voices outside of the user pool on which they were trained.

c. Speaker adaptive

Speaker adaptive mode lies somewhat in between speaker-independent and speaker-dependent. These systems are trained in such a way that they can learn new speech patterns whenever a new speaker presents itself.

Also, we can categorize different speech recognition systems according to the recognition capability of different speech units, words, and the collection of words they pose(ex: Speaker-dependent continuous system). Some of these speech recognizers are as follows

a. Isolated Speech

This is the simplest of all speech recognition systems. In this case, isolated words are recognized individually. For this a pose between each word of the utterance is necessary. Endpoints or word boundary becomes easy to recognize when using this recognition system.

b. Connected Speech

The connected speech recognition system is analogous to the isolated speech recognition system. This method can be used in different continuous speech applications. Very small pause is allowed in between individual units, and decoding of the whole sentence is done by concatenation of the individual model for each word.

c. Continuous Speech

This synthesizer uses the natural stream of speech units to build the recognizer. It can even recognize speech in the absence of pauses or other delimiters. Continuous speech recognizers can easily recognize a huge number of utterances compared to other methods.

d. Spontaneous speech

This recognizer is a result of the effective development of speech and information processing. This is a challenge for the Automatic Speech Recognition process as it has to deal with bogus starting, repetitions, extra pauses, etc. As a whole spontaneous speech processing system has to work with an impulsive real-time speech that was not rehearsed earlier.