Don’t Get Lost in Translation: Recognising Key Words in Speech without using Natural Language Processing

Alvin Wong
HTX S&S COE
Published in
16 min readAug 16, 2023

If you are reading this article, then you must know or at least have heard of ChatGPT (Generative Pre-trained Transformer) developed by OpenAI. Using Natural Language Processing (NLP), a subset of Artificial Intelligence, ChatGPT has demonstrated impressive capabilities in generating coherent and contextually relevant responses, making it an invaluable tool for building conversational agents and dialogue systems.

NLP encompasses a wide range of tasks, including text classification, sentiment analysis, machine translation, question answering, and dialogue systems. By leveraging NLP techniques such as tokenisation, Recurrent Neural Networks (RNNs), Long Short Term Memory Networks (LSTMs), and Transformer models, ChatGPT can process user inputs and generate meaningful responses, creating more interactive and engaging conversational experiences.

Speech-to-text or automatic speech recognition (ASR) technology is an intersection between NLP and ChatGPT. Speech-to-text systems convert spoken language into written text using NLP & speech processing techniques to process and analyse audio data. Despite significant advancements in speech-to-text technology, a fundamental challenge remains — its accuracy is often affected by the spoken language and variations in pronunciation.

Different languages have unique phonetic characteristics, accents, and speech patterns, which is a challenge to have an accurate transcription of spoken language into text. Currently, speech-to-text systems are varied in performance across different languages, with some languages better supported than others.

This is especially so for low resource languages such as local dialects. Low-resource languages lack sufficient training data, making it difficult to train language models. As a result, generative models do not have the same level of accuracy and fluency. If the training data contains biases or lacks diversity, it can impact the generated output and limit the model’s ability to process and generate text in a nuanced and unbiased manner.

This article will discuss an alternative approach to perform keyword recognition using acoustic methods that are language independent and focus on the unique phonetic characteristics, accents, and speech patterns to recognise keywords.

So, how does NLP interpret and generate human language?

First, let us take a look at some of the NLP techniques used in processing and generating contextual responses.

  1. Converting Human Language into machine interpretable ‘tokens’

For machines to interpret human language, they perceive words through the process of Tokenisation. This is the process of breaking down a text or a sequence of characters into smaller units called tokens. With English language tokenisation, the most common approach is to tokenise the text into individual words. It is known as word tokenisation or word-level tokenisation. This can be done by using python libraries such as NLTK/SpaCy.

An example of tokenising an English sentence with Natural Language Toolkit (NLTK) is shown below:

[Image credits: Generated by HTX S&S CoE]

Significant progress has been made in enhancing the conventional tokenisation methods. In 2018, a team of researchers at Google AI Language introduced BERT (Bidirectional Encoder Representations from Transformers). The creation of BERT was detailed in a research paper titled “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” published in October 2018.

BERT is a new approach to tokenisation called WordPiece. WordPiece tokenisation applies a sub-word approach where words are split into sub-word units, allowing the model to handle out-of-vocabulary (OOV) words and capture more fine-grained information.

An example of how BERT’s WordPiece tokenisation would tokenise the sentence “I love new technologies!” below:

[Image credits: Generated by HTX S&S CoE]

The words are split into sub-word units using the WordPiece approach. The sub-word tokens are denoted with a”##” prefix to indicate that they are part of a larger word. This allows BERT to handle variations of words and capture more detailed information.

By using sub-word units, BERT can handle OOV words by breaking them down into smaller parts that it has been trained on. This enables the model to have a larger vocabulary to handle rare or unseen words during training and inference. The sub-word approach also helps to capture morphological information and improve the model’s ability to understand the structure and context of words in a sentence.

2. Encoding and Attention

After generating the tokens, it undergoes an encoding process. This process involves multiple layers of self-attention and feed-forward neural networks. Each token’s representation is updated according to its context within the input sequence. The primary focus of attention during encoding is to ensure that the representations effectively capture the essence of the input. By mapping tokens to dense vector representations that encompass their semantic and contextual information, the model gains an understanding of the meaning and relationships between tokens in the sequence. This encoding step plays a crucial role in accurately interpreting human language and addressing various language understanding tasks.

Below is an example of an encoding process with the BERT model:

[Image credits: Generated by HTX S&S CoE]

The output vector size of [1, 7, 768] signifies that each tokenised word has undergone a transformation into a sequence of 768 values. This sequence captures the contextual information of the word within a sentence comprising 7 sub-words. The 3D structure of the output vector represents the encoding of multiple tokens within a single sentence. We refer to this vector representation as the token embedding vector. The sequence of 768 value representation goes through a series of self-attention layer, which involves calculating attention scores between each pair of tokens to determine their contextual importance. The attention mechanism allows tokens to attend to other tokens, capturing dependencies and relationships within the sequence. The self-attention mechanism produces weighted representations for each token, based on its importance in the context of the sequence. Tokens with higher relevance receive higher weightage, while irrelevant tokens receive lower weightage.

3. Encode (BERT) and Decode (GPT)

The weighted representations obtained from the self-attention layer, then undergo a transformation in a feed-forward neural network. This network applies non-linear operations to the token representations, amplifying their expressive capabilities and capturing intricate patterns in the data. The figure below provides an overview of utilising BERT as an encoder to convert human language into machine-level representation. GPT functions as a generative decoder, learning from the encoded representation to facilitate a suitable decoding process that translates the information back into appropriate human language responses. This enables a meaningful conversational exchange.

[Image credits: By Niklas Heidloff https://heidloff.net/article/foundation-models-transformers-bert-and-gpt/ ]

Overall, BERT and GPT work together in a chat-bot system like ChatGPT, with BERT serving as the encoder to understand user input and generate token embedding, and GPT acting as the decoder to generate appropriate responses based on the encoded input. This collaborative approach allows for more accurate and contextually relevant interactions in a chat-bot application.

Can we use ChatGPT in speech recognition?

Both ChatGPT and BERT are powerful language models that excel in natural language understanding and generation tasks. However, they primarily operate on text-based inputs and will face challenges when it comes to processing audio or speech data. While the models can be used to handle transcriptions of spoken language, it will be challenging with direct acoustic analysis or keyword recognition.

The need to explore acoustic methods for keyword recognition arises from the fact that speech carries valuable information that may not be effectively captured by text-based models alone. Acoustic methods involve analysing the audio signals directly, extracting features related to pitch, frequency, and other acoustic characteristics. By using this information, the system can better adapt to different accents, dialects, and pronunciation variations, leading to improved accuracy in detecting spoken words. By incorporating acoustic analysis techniques, we can enhance the performance of keyword recognition systems and enable more accurate and robust speech processing applications.

How do we recognise keywords in speech without translating them into text?

Companies like Apple, Google, and Microsoft, use voice-activated wake-word detection, with Deep Neural Networks (DNNs) to carry out the transformations. Apple, for example, uses the concept of a Multi-Layer Perceptron (MLP), where each cell utilises sigmoid activation functions (see diagram below). This allows the cells to engage in feature engineering, enabling them to comprehend the unique acoustic patterns associated with phrases like ‘Hey Siri.’ The ultimate layer uses the ‘Soft-Max function’ to gauge the confidence level for initiating subsequent chat functionalities. The arrangement adeptly processes audio streams, effectively enacting the principle of wake-word detection.

[Image credits: Apple Machine Learning Research https://machinelearning.apple.com/research/hey-siri ]

This method conducts analysis exclusively when the phrase ‘Hey Siri’ is identified. The continuous monitoring for this wake word involves transforming the audio input into a feature vector, which then goes through the DNN. Within a sequence of hidden-state cells, the model’s task is to recognise the occurrence of ‘Hey Siri’ before allowing the conversation to move to a Large Language Model (LLM). The LLM takes care of additional chat-bot functions to generate suitable responses. The diagram below provides a visual representation of the process flow for wake word detection.

[Image credits: Apple Machine Learning Research https://machinelearning.apple.com/research/hey-siri]

Wake-word detection relies on using MLP as a key component. This is tailored to enhance the end-user experience on mobile devices, using wake words such as ‘Hey Siri’ or ‘Okay Google’. The efficiency of this system lies in its ability to swiftly recognise and trigger the desired voice assistant with the specified wake word.

However, as more keywords are introduced, the more complex the detection process will be. Relying on MLP for feature engineering, although conceptually straightforward, will require considerable training for the hidden-state cells. This training is crucial to fine-tune the detector’s performance and ensure accurate wake-word recognition.

So, how else can we detect keywords in speech without using NLP?

[Image credits: Dhananjay Ram https://arxiv.org/pdf/1911.08332.pdf]

Taking reference from Dhananjay Ram, in his publication in Nov 2019, Neural Network based End-to-End Query by Example Spoken Term Detection, we experimented with the Query by Example Spoken Term Detection (QbE-STD technique in our Home Team context. QbE-STD is a technique used to search for specific spoken words or phrases within a large collection of audio data. It allows us to find instances of a particular spoken term by providing an example or reference of that term.

QbE-STD is like searching for a specific word or phrase in a big pool of spoken recordings by using a sample of how that word or phrase sounds. It helps to locate and identify instances where the word or phrase appears in the audio data.

For example, if we have a collection of recorded conversations and we want to find all instances when the word “Happy Birthday” was mentioned, we can provide a sample of the word “Happy Birthday” as a reference from another speaker. The QbE-STD system will analyse the audio data and look for similar patterns or characteristics that match the sample, allowing us to identify the occurrences of “Happy Birthday” within the recordings.

The workflow of QbE approach is divided into three portions as shown in the diagram below. (1) Feature extraction; (2) Similarity matrix computation; and (3) Convolutional Neural Network (CNN) classifier to detect the presence of keywords.

[Image credits: Generated by HTX S&S CoE]
  1. Converting Speech into Feature Representation using Mel-Frequency Cepstral Coefficients (MFCCs)

Recognising keywords through speech without the need for text translation entails harnessing acoustic features and employing pattern recognition techniques. In contrast to NLP models, acoustic-based models analyse the arrangement of specific utterance combinations by visualising the speech waveform. These waveforms are known as Mel-spectrograms or MFCCs. MFCCs are commonly utilised as visual input features in most acoustic-based models.

An Illustration of how to extract the acoustic features is shown below.

import librosa
import librosa.display
import matplotlib.pyplot as plt

#load the audio file
audio_file = '/home/alvinwong/Documents/sample/Happy_001.wav'
audio, sr = librosa.load(audio_file)

#plot the waveform
plt.figure(figsize=(10,4))
plt.plot(audio)
plt.title('Waveform')
plt.xlabel('Time')
plt.ylabel('Amplitude')

#generate the spectrogram
spectrogram = librosa.stft(audio)
spectrogram_db = librosa.amplitude_to_db(abs(spectrogram))
plt.figure(figsize=(10,4))
librosa.display.specshow(spectrogram_db, sr=sr, x_axis='time', y_axis='log')
plt.colorbar(format='%+2.0f dB')
plt.colorbar('Spectrogram')
plt.xlabel('Time')
plt.ylabel('Frequency')

#generate the MFCC
mfccs = librosa.feature.mfcc(y=audio, sr=sr, n_mfcc=13)
plt.figure(figsize=(10,4))
librosa.display.specshow(mfccs, x_axis='time')
plt.colorbar()
plt.title('MFCC')
plt.xlabel('Time')
plt.ylabel('MFCC Coefficients')
[Image credits: Generated by HTX S&S CoE]
[Image credits: Generated by HTX S&S CoE]
[Image credits: Generated by HTX S&S CoE]

Our experiment used SWS2013 Multilingual Database, as well as 13,572 audio recordings, collected manually from our own data & environment for the testing and retraining process. These recordings include languages commonly spoken in Singapore, such as English, Malay, Mandarin, and Tamil. A small proportion of the data also included dialects such as Hokkien and Cantonese. The dataset is then divided into a training set and a test set with an 80/20 ratio for the experiment.

2. Using Cosine Similarity to identify similar phonetic characteristics

When we derive the Mel Frequency Cepstral Coefficients (MFCCs), we obtain a vector representation that captures the spectral characteristics of a word or speech segment. These MFCC vectors serve as a representation of the acoustic features.

To determine the similarity between two words or speech segments, we compute the cosine similarity between their respective MFCC vectors. Cosine similarity measures the cosine of the angle between two vectors and provides a value between -1 and 1. A cosine similarity of 1 indicates that the vectors are identical, whilst a value close to -1 indicates that they are dissimilar.

By applying cosine similarity to the MFCC vectors, we can compare the phonetic characteristics of different words or speech segments. Words with similar pronunciation or phonetic properties tend to have higher cosine similarity values, indicating a closer match in terms of their acoustic features.

The scientific formula is presented below:

[Image credits: Formula taken from Kaggle]

The graphs below show three exemplary visualisations of the resulting output matrix with cosine similarity.

Left Image — Two identical audio segments, Centre Image — Two similar phonemes with high correlations, Right Image — Two different phonemes with low correlations
[Image credits: Generated by HTX S&S CoE]

In an ideal scenario, if both MFCC segments precisely match the audio waveform, we will see a solid diagonal line highlighted in red as shown in the ‘Ideal Match’ visualisation. However, in practice, we may see small segments of high correlation between the phonemes, as depicted in the ‘Positive Outcome’ visualisation. Conversely, in the ‘Negative Outcome’ visualisation, there is no significant correlation observed, as indicated by the absence of bright yellow segments in the audio spectrogram visualisation.

The difference between an ‘Ideal Match’ and a ‘Positive Outcome’ is due to the presence of background noise and time synchronisation issues. When there is background noise or problems with timing, it can affect the level of correlation between two signals.

In an ‘Ideal Match’ scenario (with no background noise or timing issues), the correlation between two signals is high. This means that the signals match well and align perfectly. However, in a ‘Positive Outcome’ scenario, there may have been some background noise or timing issues that caused a decrease in the correlation level. This means that although the signals are not a perfect match, they still have some similarity or positive aspects.

Whilst efforts can be made to reduce background noise, eliminating it totally is not possible. Even with some background noise, there can still be positive outcomes or similarities between the signals.

In short, the presence of background noise and time synchronisation issues do affect the level of correlation between signals, but it is still possible to have positive outcomes or similarities despite these issues.

3. Training a Binary Classifier with Convolution Neural Network (CNN)

In our experiment, we trained a binary classifier using a Convolution Neural Network (CNN) on both ‘positive’ and ‘negative’ visualisation heat-maps as the training dataset. An important difference between using vision-based CNN models and audio applications lies in the preparation of input dataset. In vision-based applications, we have a balanced dataset with samples from each class, and we can use raw images as input. In audio applications, we must first conduct a pre-processing step and then perform feature engineering before feeding the data into the classifier. Specifically for Automatic Speech Recognition (ASR), we define step 1 as pre-processing and step 2 as feature engineering. This approach in audio differs from vision-based applications, as it involves additional steps to adapt the data for analysis. Despite these differences, using CNNs for audio classification provides flexibility and adaptability in the process, thus allowing us to achieve accurate results in ASR tasks.

Below is the original backbone architecture diagram developed by the author.

Note: Input images are MFCC are purely for illustrations only (Original QbE-STD Architecture)
[Image credits: Generated by HTX S&S CoE]

The architecture diagram defines a CNN classifier with multiple convolutional layers and a fully connected neural network. It takes an input tensor, applies convolutional operations with max-pooling and dropout layers, followed by a fully connected layer and a Soft-Max activation. The model is designed for binary classification with 2 output classes. The number of filters in each convolutional layer is determined by the ‘depth’ parameter, which is 30, and a dropout rate of 0.1 is used for regularisation.

We modified the architecture to simplify the training process as shown below.

Note: Input images are MFCC are purely for illustrations only (Modified QbE-STD Architecture)
[Image credits: Generated by HTX S&S CoE]

Considering input images as feature-maps of correlated signals, the decision to remove the first max pooling layer is based on experience. This is because applying max pooling as the first layer in a Convolutional Neural Network (CNN) may lead to a loss of spatial information. Max pooling reduces the spatial dimensions of feature maps by selecting the maximum value in each pooling region. While it helps with computational complexity and overfitting control, using it as the initial layer can be detrimental, especially when dealing with feature maps as input.

Similarly, the removal of dropout in CNNs is done to retain more information and increase model capacity. Dropout is a regularisation technique used to prevent overfitting during training, by randomly setting a fraction of neurons to zero temporarily. This enables the network to learn more robust representations and avoids over-reliance on specific neurons. While dropout can be beneficial in fully connected layers with many parameters, it may not be as effective in convolutional layers. CNNs already have a lower risk of overfitting due to their shared weight structure, which allows them to learn local patterns and invariances efficiently.

Additionally, the reconstruction of the input dimension is adjusted from 200 * 750 to 200 *200. The inclusion of padding in all convolutional layers is important for the preservation of spatial dimensions and avoiding of border effects.

Does QbE-STD work?

Our experiments in keywords recognition were divided into two parts — a lab-based environment and a real-time Home Team environment.

[Image Credits: Generated by HTX S&S CoE]

The results demonstrated that QbE-STD performs well in detecting most keywords, regardless of whether the environment is quiet or noisy. However, there were instances where certain words were not detected, and these include:

1. Differences in Pronunciation: People can pronounce the same word differently depending on their culture or accent. For example, the word “Tomatoes” can be pronounced as ‘to’-’ma’-’toes’ or ‘to’-’matoes’.

2. Difficulties with Short Syllable Keywords: There were challenges in recognising one syllable keywords, leading to false triggering. These include words like ‘air’, ‘cent’, ‘maid’, which consist of only one unit of phonemes.

3. Noisy Environment and Poor Audio Quality: In cases of noisy or echoey recordings, both QbE-STD and Speech-to-text systems using NLP methods encounter difficulties in accurate detection and transcription.

We have also conducted experiments to analyse the correlation between all 72 words based on their phonetic pronunciation. The results below, align with what was mentioned earlier, as all 72 keywords were compared to measure their similarity. Certain words like ‘play’ and ‘day’ has a high correlation, as indicated by the vertically highlighted circle icon in red below. Such words tend to confuse the model classifier, leading to potential false triggering.

[Image credits: Generated by HTX S&S CoE]
[Image credits: Generated by HTX S&S CoE]

In the controlled laboratory environment, our model classifier achieved an accuracy of 83%. When deployed in a real-life setting, there was a decrease in performance, which was expected given the challenges stated above. In the real-life testing scenario, the decrease was 23% with the Original QbE-STD Architecture pre-trained with SWS 2013 Multilingual Database. To address this, we used another dataset collected from the Home Team environment and retrained it with a modified QbE-STD model. As a result, we achieved an improvement of 7% in the F1-Score on the modified QbE-STD model. This demonstrates that QbE-STD can work in real life settings, and we have shown how Qbe-STD can perform keyword detection without going through the process of transcribing speech into text. Whilst there are limitations with using acoustic methods to recognise keywords, it can be used as an augmentation method to NLP. Some possible use cases for this include detecting specific keywords in long hours of recordings, audio messages that require manual intervention to review the contents or monitoring of keywords in an emergency or elderly facility. There is still some way to go in terms of actual deployments but the results of our experiment show that this can be a viable method to be used alongside speech to text models.

HTX S&S COE

If you’ve been following our articles, you will notice that we thrive on exploring different possibilities and approaches to tackle a problem. Our work involves rigorous research and experimentation to validate our hypotheses and develop viable solutions to solve real world problems. Our expertise lies in the integration and processing of data from different sensory devices, including visual, acoustic, lidar, Wi-Fi, Sonar, etc. Our primary focus is advanced computer vision techniques and machine learning algorithms to extract meaningful insights and valuable information from a diverse range of data sources. Our latest posts include ‘Breaking Through the Darkness: How to Enhance Low-Light Images with Deep Learning Techniquesby Ben Cham and how our intern, Tiong Kai, spent his time with the team on ‘An Intern’s Journal with HTX Sense-making & Surveillance Centre of Expertise (S&S CoE)

If you want to stay updated on our projects in different AI and sensors engineering fields, consider subscribing to our medium channel. Likewise, feel free to reach out to me at Alvin_WONG@htx.gov.sg if you want to learn more or discuss ideas related to Speech Recognition and NLP.

--

--

Alvin Wong
HTX S&S COE

I love to explore the unknown and look for possibilities in a challenge.