Speech Recognition with Deep Learning

Published in

CoderHack.com

3 min readSep 15, 2023

Speech recognition is the ability of a machine or program to identify and understand human speech. It has a wide range of applications, from virtual assistants like Siri and Alexa, to transcription of audio tracks, like generating subtitles for YouTube videos. Early speech recognition systems relied on hand-crafted algorithms and acoustic models. However, with the rise of deep learning, speech recognition systems have become far more accurate. Deep neural networks are able to learn speech representations directly from data, surpassing previous state-of-the-art models.

Overview of Deep Learning Models for Speech Recognition

Convolutional Neural Networks

Convolutional neural networks (CNNs) are effective for extracting speech features from raw audio. They use filters that operate on local features with properties like temporal invariance. For speech, filter sizes of 20–30ms are common, to capture phonemes and syllables. The filters are applied at multiple timescales to pick up on hierarchies of speech features. For example, DeepSpeech uses CNNs for feature extraction.

model.add(Conv2D(32, kernel_size=3, activation='relu'))
model.add(Conv2D(32, kernel_size=3, activation='relu'))  
model.add(MaxPool2D(pool_size=2))
model.add(Dropout(0.2))

Recurrent Neural Networks

Recurrent neural networks (RNNs) are well suited for modeling sequential data like speech, since they have memory of past inputs. Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) are variants commonly used for speech recognition due to their ability to capture long-term dependencies, overcoming the vanishing gradient problem.

model.add(LSTM(128, return_sequences=True, input_shape=(30, 161)))
model.add(LSTM(128, return_sequences=False))  
model.add(Dense(28, activation='softmax'))

Hybrid Models

Hybrid models combine the temporal modeling of RNNs with the feature extraction of CNNs. A common pattern is to have a CNN convert raw audio into feature representations, which are fed into an RNN to generate speech predictions over time.

For example, a hybrid model could have 2 CNN layers to extract speech features, max pooling for dimensionality reduction, followed by 2 LSTM layers and a softmax output layer. The CNN acts as an encoder, with the LSTM decoding the learned representations into predicted transcriptions.

Building a Speech Recognition Model

Gathering data

Public datasets for speech recognition include TED-LIUM, VoxForge, WSJ, and LibriSpeech. Raw audio is commonly in WAV format with corresponding text transcriptions. For deep learning, we need many hours of audio and their transcriptions.

Preprocessing

Key preprocessing steps include:

• Removing noise/silence: Filter out non-speech audio. • Segmentation: Split long audio into smaller parts. Each part should have a single speaker and consistent background noise. • Extracting features: Apply filters to get Mel-Frequency Cepstral Coefficients (MFCCs), Spectral features. MFCCs are commonly used for speech recognition. • Normalization: Techniques like z-normalization normalize audio amplitude/volume. • Padding/Windows: Pad audio to a standard length or split into windows of fixed length for CNNs.

Model training

Speech recognition models are usually trained to maximize the probability of the correct transcription given the audio features. An effective loss function is Connectionist Temporal Classification (CTC) loss, as it can handle sequences with variable lengths. Optimization is often done with Adam or SGD. Key challenges include:

• Imbalanced data: Much more audio than transcriptions. Oversample/augment transcriptions. • Vanishing/Exploding gradients: RNNs can have trouble with very long sequences. Gradient clipping helps. • Overfitting: Regularization techniques like dropout, especially with limited data.

Post-processing

Post-processing improves model predictions:

• Language modeling: Use a language model to rescore predictions based on the probability of word sequences. • Beam search: Maintains a “beam” of n-best predictions at each time step. Expands the beam at each step while pruning unlikely sequences. Results in the top prediction at the end. • Connectionist Temporal Classification Decoding (CTC-decoding): Collapses repeated characters and removes blank predictions. Gives a more readable final output.

Applications and Use Cases

Speech recognition powers various applications, including:

• Voice assistants like Siri, Alexa which can understand voice commands and respond. • Transcription of audio tracks or phone calls into text. Used to generate subtitles, generate transcripts, analyze call center logs, etc. • Voice enabled interfaces. For example, voice commands to control media playing, set alarms etc. • Accessibility tools like automated speech to text for hearing impaired users. • Surveillance. Analyze and transcribe speech in video footage.

Challenges include accurately handling different languages, accents and noisy environments. As deep learning models get more advanced, speech recognition will continue to expand in capability and use cases.

Examples and References

• TensorFlow Speech Recognition Tutorial: • Kaldi Speech Recognition Toolkit: http://kaldi-asr.org/ • Papers on DeepSpeech model architecture, sequence-to-sequence modeling for speech recognition: https://arxiv.org/pdf/1412.5567v4.pdf & https://arxiv.org/pdf/1211.3711.pdf