Implementation of Speech Recognition and Analysis using the Concept of Data Science

Zoya Ahmad
8 min readSep 6, 2020

This article discusses Speech Recognition and Analysis. Virtual assistants like Siri and smart home devices like Google Home have become an integral part of our lives now. The speech recognition is achieved by the concept of data science. It is one of the most sought after domain these days. Scientific methods, systems, algorithms and processes are implemented to mine knowledge and insights from structured and unstructured data. Several samples of audio carried per second are represented through a sample rate of an audio file for analyzing speech. It is measured in Hertz (Hz). The pitch, amplitude, voice quality, gender of the person speaking, voice modularity, the sound of syllables, phonetics, etc. are the factors used to recognize and differentiate and one voice from the other. The analysis is done on real-time speech. It is achieved by employing a self-learning model and applying the theories of machine learning.

INTRODUCTION

Speech recognition is a multifaceted subfield of computational dialectology that develops technologies and methods that assists the recognition and conversion of speech to text by the computers. It is often referred to as computer speech recognition, automatic speech recognition (ASR) or speech-to-text (STT). It combines research and knowledge in computer science, electrical engineering and linguistics fields.

Speech recognition denotes identifying the speaker, rather than what they are saying while speech analysis denotes comprehending the speech and deciphering each word. Detection of the speaker can streamline the job of translating the speech in the software or systems that have been particularly designed for this purpose. Speech recognition and analysis are usually used to verify, authenticate and differentiate the identity of a speaker in many fields such as during a security check.

The recent years have seen an uphill in the development of speech recognition and analysis software. Major innovations have been done in making these systems more efficient. With the advancement in the technologies, the developers have been able to achieve new heights in this field. Due to the introduction of the latest trends in technologies like machine learning, big data analytics and deep learning, the speech analysis arenas have profited.

These state-of-the-art trends are gradually being adopted by industries currently working in this field. The surge in the demands of these products proves that the concept of speech recognition and analysis is here to stay.

More and more companies are adopting this technology for their benefits. Varieties of methods are being applied to design, strategize, implement and deploy speech recognition systems.

AUDIO DATA

Using the concept of big data, huge chunks of data is collected, integrated and stored for future use. This data can be sorted or unsorted, structured or unstructured depending on the type of collection and need. Audio too is collected similarly. A huge amount of audio files and clips are collected and stored. The various ways in which audio is stored on machines and electronic devices are —

· mp3 (MPEG-1 Audio Layer 3) — It is a coding format for digital audio

· wav (Waveform Audio File) — Developed by Microsoft and IBM, it is an audio file format standard to store audio bitstreams on PCs.

WMA (Windows Media Audio) — Developed by Microsoft, it is a sequence of audio codecs and their resultant audio coding formats.

COLLECTION OF AUDIO DATA

Voice-enabled software applications are enhanced due to the collection of multilingual audio data. The collection of audio data explains the procedure of gathering and measuring audio data from multiple sources. For Automatic Speech Recognition (ASR) systems and virtual assistants to recognise speech, they are required to be exposed to humungous quantities of audio data. This collection of data can be done in following ways —

· Speed Data Collection — Data is acquired and is organized into files containing standardized noise indicators, the path of measures, noise description and other useful variables (Speed, GPS accuracy, etc.)

· Acoustic Data Collection — Data is collected in the form of internal stress waves, vibrations in structures and structure–fluid interactions involving acoustic radiation.

· Natural Language Utterance Collection — It is a collection of assorted natural language text data from a vast range of domains and user demographics

STEPS IN DATA COLLECTION

The steps involved in the process of collection of audio data are —

1. Obtain utterance in the form of blocks from audio files

2. The audio files are segmented by syllable

3. The audio files are then arranged in serial order based on syllable

4. The determination of intensity, duration and pitch values for each syllable should be done

5. Plots of spectral power are estimated in the spectrograph-like form

WORKING OF SPEECH RECOGNITION AND ANALYSIS SYSTEM

Virtual assistants like Google Assistant and Siri, are built based on mainly two technologies, Speech Recognition and NLP (Natural Processing Language). This is a process to convert speech into sounds, ideas and words. This is done by converting speech into a textual format.

The speech recognition and analysis systems first record your speech. A lot of computational power is required for the interpretation of sounds. The recording is sent to servers of the particular speech analysis systems. It is broken down into individual sounds. These are then compared to a database which contains pre-stored pronunciations of numerous words.

The words that closely correspond to the combination of individual sounds are returned to the server. The keywords are then identified to comprehend the task and corresponding functions are then carried out. For example, if the system observes words like “pictures” or “photographs”, it will open the gallery.

Machine Learning models are trained and are transcribed with datasets. This is done to create efficient speech-recognition models. These models are trained with highly miscellaneous datasets. These datasets comprise voice samples taken from an enormous group of people. Through this way, various accents can be catered.

Over the past few years, deep learning has produced incredible results in speech recognition and analysis. This has been likely due to the availability of large datasets as well as powerful hardware. This hardware uses efficient speech recognition algorithms which are trained on those datasets. Even the word error rate of speech recognition engines has considerably gone down to less than 10%.

Intent analysis, just like speech recognition and analysis, requires a huge amount of data to train Natural Language Processing (NLP) algorithms.

Another integral technology, employing Machine Learning is that of contextual understanding and that of entity extraction. The system picks up entities from the system.

APPLICATIONS OF SPEECH RECOGNITION AND ANALYSIS

The feature of speech recognition and analysis is applied in various fields like-

· Telecommunications:

® Automation of Operator Services

® Voice Recognition Call Processing (VRCP) system

® Automated Alternate Billing System (AABS)

® Automation of Directory Assistance

® Voice Dialing

· Car Bluetooth System

· Military:

® Training air traffic controllers

® Helicopters

® High-Performance Fighter Aircraft

· Healthcare:

® Use in Therapeutics

® Medical Documentation

· In smartphones

· Virtual Assistants like Siri, Cortana, Google Assistant

· Smart Home Devices like Google Home, Amazon Alexa

· Device Control

· Voice Transcription

DRAWBACKS OF SPEECH RECOGNITION AND ANALYSIS SYSTEMS

The following are the obstructions that one comes across during the functioning of speech recognition and analysis system —

· Emotions-The variations in emotions can be an obstruction in the smooth flow of speech

· Spontaneous Speech-This type of speech is difficult to comprehend by the speech recognition and analysis systems

· Prosody-The disparities in the tones, phonetics, etc. can also become an obstruction

· Naturalness-The languages spoken spontaneously are sometimes not comprehensive enough for the speech recognition and analysis systems to catch

· Sparsely Spoken Languages-Since these languages are hardly spoken, they are not fed into the systems along with the commonly used languages

· Disadvantages related to Different Types of Systems-Different systems have different approaches to speech recognition and analysis and sometimes a glitch occurs, hampering the process

· Ambiguities-Even today, uncertainties exist regarding the functioning of these speech recognition and analysis systems.

· Speech Synthesis for the Older People-The speech of elderly is a little ambiguous and slurry. This turns out to be a barrier for the speech recognition and analysis systems to catch on.

CONCLUSION

Mostly all speech recognition and analysis systems require training where the text or speech is stored in the isolated vocabulary of the system. The system then analyzes the person’s specific voice and uses it to rectify and sharpen the recognition of that person’s speech. This results in increased accuracy. Some systems do not use the concept of training their models. These systems are referred to as speaker-independent systems while the systems that use training are referred to as speaker-dependent.

Speech recognition and analysis finds its use in various areas. From day-to-day requirements to heavy use in the most coveted places. They are used in voice dialling (voice user interface), call routing, simple data entry, preparation of structured documents, determining the characteristics of the speaker (in high-level systems), speech-to-text processing (STT) and in aircraft where it is usually termed as direct voice input (DVU).

Studies like signal processing, supervised and unsupervised machine learning techniques and neuroscience-based methods are constantly being explored to solve the problems that occurred during speech recognition and analysis.

--

--