Basics Of Human Speech Production System & Some Important Speech Features ( with recommended books, youtube channels, and online materials )

Sandipan Dhar
5 min readJul 10, 2021

--

Human speech production system
Image source: Carynannerisly.wikispaces.com

Human speech production starts from the lungs where the air is pressed out through the glottis ( area near the vocal cord or vocal folds). It then continues through the vocal tract (starts from glottis and ends at lips) and exits out from the oral and nasal cavities. The position of the articulators (lips, jaw, tongue, etc.) alters the shape of the vocal tract, thus affecting the air flowing from the lungs on its way out. This works as a filter and is responsible for the production of different sounds.

Different resonant frequencies in our speech are associated with different shapes that the vocal tract takes on. These resonant frequencies are called formant. These formants can be seen as the fundatmental components of human speech and become extremely useful to detect different basic sounds called phonemes. Different sounds are formed by varying the shape of the
vocal tract. Thus, the spectral properties of the speech signal vary with time as
the vocal tract shape varies. Any language is consists of these phonemes which are considered as the building blocks of language construction.

The time-varying spectral characteristics of the speech signal can be displayed two-dimensionally, which is known as spectrogram.

Spectrogram

In spectrogram, the vertical dimension corresponds to frequency and the horizontal dimension to time, and the brightness of the colour pattern is proportional to the signal energy.

Source: https://auditoryneuroscience.com/vocalizations-speech/formants-harmonics

In this figure spectrograms of the words “hot”, “hat”, “hit” & “head” are shown that were spoken once with a high-pitched voice (top), and then again with a lower-pitched voice (bottom). The regions of frequency space where speech sounds carry a lot of energy indicate the formats.

Mel-Spectrogram

However, a problem working with spectrogram is that we humans can easily distinguish sounds in lower frequencies but not in higher frequencies. Therefore a more convenient way of representing spectrograms are with mel scales (a logarithim transformation) known as mel-spectrograms. Although other than this there exist numerous speech features that are used in varieties of speech processing processes ( such as speech classification, speaker classification, speech synthesis etc. ).

The other important speech features that are widely in use for various speech processing applications are fundamental frequency (F0 frequency), log F0 frequency, spectral envelope, Mel frequency cepstral coefficient (MFCC), Mel cepstral coefficient (MCEP), Perceptive Linear Prediction (PLP) features, aperiodicities, etc. All these features are basically divided into two categories i.e. the temporal features or time-domain features ( like zero-crossing rate, maximum amplitude, minimum energy, etc.) and the spectral features or frequency-based features which are already mentioned.

Different Important Videos related to GAN: Structure Preserving GAN (https://www.youtube.com/watch?v=z9ISUhCY-6I)

( Books )

The two most important books I recommend to explore this domain of speech processing are Digital Processing Of Speech Signals by Lr Rabiner, Ronald W Schafer and Discrete-Time Speech Signal Processing: Principles and Practice by Thomas F. Quartieri. Another additional book that I would like to recommend is Fundamentals of music processing ( Audio, Analysis, Algorithms, Applications ) Book by Meinard Muller (https://link.springer.com/content/pdf/10.1007/978-3-540-74048-3.pdf).

https://jontallen.ece.illinois.edu/uploads/537.F18/Book/main-all.pdf

Link to the Generated Samples: https://sites.google.com/view/clotgan-vc/home

(conferences, journals and workshops)

All the (most of the) conferences, journals and workshops of Speech Processing: https://docs.google.com/document/d/1Igbbqq7ThdR_QXlUPNn6F-ynoXU3SxZ97IJiHJ7Hz4Q/edit?usp=sharing

(Research Labs on Speech Processing )

https://www.isca-speech.org/iscaweb/index.php/liaison/speech-laboratories

The other suggested materials:

. Basic Understanding of Speech: https://www.youtube.com/watch?v=2f_4kxC4cOw

. [ mDOT Center: (https://youtu.be/OWc2slRAWcU) ]

( YouTube Channels )

(Important: https://www.youtube.com/@cognitivephonetician/playlists )

  1. NPTEL Videos of Dr. S.K Das Mandal ( Faculty of IIT Kharagpur ).
  2. The Sound of AI YouTube Channel of Dr. Valerio Velardo.
  3. Deep Learning for Audio Classification YouTube channel of Seth Adams.
  4. Krishna DN YouTube channel on Digital Speech/Sound Processing Papers.
  5. UNSW eLearning YouTube channel, lectures by Professor E. Ambikairajah.

6. Listen Lab YouTube channel of the University of Minnesota.

7. The Virtual Linguistics campus YouTube channel.

8. Madhav Lab IIT-K

9. Maziar Raissi

10. SANE2019 | Hirokazu Kameoka — Voice conversion with image-to-image translation and seq2seq learning

11. Jon Nordy

12.Audio Analysis Lab

13. MD2K Center (Abeer Alwan — Voice Feature Extraction from Smartphones)

14. Chris Tralie

15. UNSW eLearning (Lecture Notes are also Available)

16. Free Engineering Lectures (On digital signal processing)

17. Chief Speech TIFR

18. Enthought Youtube Channel

19.Short-time Fourier Transform and the Spectogram(Barry Van Veen)

20. Maziar Raissi (Deep Learning Applications and Speech synthesis )

21.NII Yamagishi Lab ( TTS based models have been discussed )

22. Digital Voice and Picture Communication by Prof.S.Sengupta, Department of E and ECE,IIT Kharagpur

23. https://www.youtube.com/watch?v=JZo8tn9QK4I&list=PLPpDNS1gEzNfBjrNr04i64qDSIAGMCzXb&index=3 (YouTube Videos of Digital Speech Processing)

24. https://www.youtube.com/watch?v=uVsuZJa-TCs (YouTube human Speech production system)

25. Speech features intro Channel Name (To Know More About Speech)

26. Prabhjot Gosal Signal and Speech Signal https://www.youtube.com/channel/UCtPdUJVOIobBDOKDGl32Ifw/videos

27. https://www.youtube.com/@HermanKamperML

28. Aze Linguistics: https://www.youtube.com/@AzeLinguistics/playlists

29. Deep Learning for Human Language Processing https://www.youtube.com/watch?v=nER51ZyJaCQ&list=PLJV_el3uVTsO07RpBYFsXg-bN5Lu0nhdG

30. Audio and Speech Processing https://www.youtube.com/@prabhjotgosal2489/playlists

(Blogs and Articles )

8. Speech Recognition medium blog by Jonathan Hui.

9. Understanding mel spectrogram medium blog by Leland Roberts.

10. Getting to know Mel spectrogram medium blog by Dalya Gartzman.

11. Speech Technology: A Practical Introduction by Kishore Prahallad IIIT Hydrabad ex-scholar.

12. Using instantaneous frequency, and aperiodicities detection to estimate F0 for high-quality speech synthesis. (paper)

13. Fourier Transformation : https://allsignalprocessing.com/ASP-Downloads/Foundations/ComplexSinusoids/Concepts%20Slides%20Complex%20Sinusoids.pdf

--

--