Basics Of Human Speech Production System & Some Important Speech Features ( with recommended books, youtube channels, and online materials )

5 min readJul 10, 2021

Human speech production system — Image source: Carynannerisly.wikispaces.com

Human speech production starts from the lungs where the air is pressed out through the glottis ( area near the vocal cord or vocal folds). It then continues through the vocal tract (starts from glottis and ends at lips) and exits out from the oral and nasal cavities. The position of the articulators (lips, jaw, tongue, etc.) alters the shape of the vocal tract, thus affecting the air flowing from the lungs on its way out. This works as a filter and is responsible for the production of different sounds.

Different resonant frequencies in our speech are associated with different shapes that the vocal tract takes on. These resonant frequencies are called formant. These formants can be seen as the fundatmental components of human speech and become extremely useful to detect different basic sounds called phonemes. Different sounds are formed by varying the shape of the
vocal tract. Thus, the spectral properties of the speech signal vary with time as
the vocal tract shape varies. Any language is consists of these phonemes which are considered as the building blocks of language construction.

The time-varying spectral characteristics of the speech signal can be displayed two-dimensionally, which is known as spectrogram.

In spectrogram, the vertical dimension corresponds to frequency and the horizontal dimension to time, and the brightness of the colour pattern is proportional to the signal energy.

Source: https://auditoryneuroscience.com/vocalizations-speech/formants-harmonics

In this figure spectrograms of the words “hot”, “hat”, “hit” & “head” are shown that were spoken once with a high-pitched voice (top), and then again with a lower-pitched voice (bottom). The regions of frequency space where speech sounds carry a lot of energy indicate the formats.

However, a problem working with spectrogram is that we humans can easily distinguish sounds in lower frequencies but not in higher frequencies. Therefore a more convenient way of representing spectrograms are with mel scales (a logarithim transformation) known as mel-spectrograms. Although other than this there exist numerous speech features that are used in varieties of speech processing processes ( such as speech classification, speaker classification, speech synthesis etc. ).

The other important speech features that are widely in use for various speech processing applications are fundamental frequency (F0 frequency), log F0 frequency, spectral envelope, Mel frequency cepstral coefficient (MFCC), Mel cepstral coefficient (MCEP), Perceptive Linear Prediction (PLP) features, aperiodicities, etc. All these features are basically divided into two categories i.e. the temporal features or time-domain features ( like zero-crossing rate, maximum amplitude, minimum energy, etc.) and the spectral features or frequency-based features which are already mentioned.