Acoustic Word Embeddings
This blog post is intended as an introduction to the field of acoustic word embeddings (AWEs) for those with a background in speech processing, NLP, or DL/ML. I noticed that the field of acoustic word embeddings is barely covered in the media that is why I decided to write this post. In it, I explain the notion and importance of AWEs and give an overview of various approaches to obtain them. I use diverse illustrations to back up my explanations throughout the post so I encourage you to pay attention to them.
Intro to AWEs
Let’s imagine two different speakers pronouncing the exact same word. Their speech segments will never be identical because they will vary in duration, pitch, accent, etc. Nevertheless, these speech segments will be acoustically alike because the word that they represent is the same. This phonetic information defining the identity of a pronounced word is what AWEs are modelling.
What is AWE?
Acoustic Word Embedding (AWE) is a fixed-dimensional representation of a variable-length audio signal in an embedding space. The idea of AWEs is quite close to well-known textual word embeddings which create similar vector representations for semantically similar words. However, AWEs aim to model acoustic similarity rather than semantic similarity.
The task of the AWE space is to represent speech word segments in such a way that similarly sounding utterances cluster together. The AWE space has to accurately discriminate between different word types (vocabulary entries) and place embeddings of speech segments corresponding to the same word type close to each other, while embeddings of segments corresponding to different word types — far apart.
Why are AWEs important?
The major advantage of AWEs is that mapping a variable-length speech signal to a fixed-dimensional space enables distance computation between vectors such as Euclidean or cosine. This distance can give us information on how similar two pronounced words are and if they can be referred to the same word type. Distance computation makes a big variety of machine learning algorithms to be applicable and suggests a new generation of computationally efficient algorithms for various speech applications such as voice search, wake-up word detection (“Ok Google”), end-to-end ASR and many other tasks.
Why are AWEs hard to learn?
Even though AWEs represent a very powerful tool for different speech tasks, learning them is quite challenging. First of all, speech signals can greatly differ in duration. This requires algorithms that can handle input features of variable lengths.
Secondly, a speech signal is characterized not only by phonetic information inherent to a word but also by speaker information. The task of AWEs is to learn phonetic information about the word identity while staying invariant to speaker information such as speaker gender, age, accent/dialect, emotive state, physiology or speaking rate. Moreover, an audio signal is quite sensitive to different kinds of environmental variabilities like reverberation and noise which makes it harder to learn valuable information.
AWE approaches
So, let’s dive into how researchers tackled the above-mentioned problems in order to obtain AWEs. I present the timeline of the most prominent approaches to learning AWEs in the format of a table overview. In the table, you can find short descriptions as well as some useful information on approaches.
If you are interested in more information about an AWE approach, you can forward yourself to the corresponding approach review presented later in the post. I wrote the reviews in a form of concise highlights with multiple illustrations for the sake of clarity and put them after the tables in this post.
If you are here only to get familiar with AWEs and not interested in the thorough overview of all possible AWE approaches, I would recommend you to take a look at just a few of them such as downsampling (the starting point of AWEs) and Siamese CNN (a widely used setting in AWE extraction) and scroll the rest.
Below you can see the comparison of average precision (AP) scores for different AWE approaches. AP scores were extracted from relevant papers and organized in groups based on an evaluation task (see evaluation information below the table). Some AWE approaches had specific evaluation tasks and metrics therefore they are not comparable with other approaches and not cited in this comparison table.
Eval 1 — AP score, Switchboard corpus, word discrimination task
Eval 2 — AP score, Buckeye corpus, word discrimination task
Eval 3 — AP score, WSJ corpus, query-by-example task
Eval 4 — AP score, WSJ corpus, word discrimination task
Heuristic AWEs
First AWEs were obtained with simple heuristic approaches designed for getting fixed-dimensional speech representations.
1. Downsampling
First attempts to create fixed-dimensional acoustic word embeddings were discussed in [Levin et al., 2013]. Previously existent techniques for handling variability in speech segments were either too task-specific or computationally costly.
One of the first introduced approaches to moving a variable-length audio signal to the fixed-dimensional space was simply to downsample acoustic features. Given an audio signal comprised of extracted acoustic features, the task is to split the signal into k segments and uniformly sample features from these k segments. To obtain the AWE, k samples are simply concatenated in one vector.
For the more sophisticated AWEs, the non-uniform downsampling can be used where word acoustics are modelled with k-state HMM. Concatenation of the means of k Gaussian distributions gives an AWE vector.
The downsampling approach is very simple, computationally feasible and does not require any data except for an audio signal itself. Despite having low average precision scores for a word discrimination task, this approach marked the beginning of AWE expansion and is still considered as one of the possible baselines in the AWE research.
2. Reference vector
Another approach for fixed-dimensional embeddings proposed by [Levin et al., 2013] required a number of word audio signals as a reference set. This reference set should be quite diverse, covering different word types and speakers while being not very big in size. Given this reference set, a given speech segment can be compared with every entry of the reference set and the resulting similarity score is written into a reference vector — AWE. The similarity score is obtained by DTW, a popular algorithm in signal processing, which measures the similarity between two differently paced temporal sequences.
Neural AWEs
With the rise of neural networks, speech researches started to pay particular attention to advances in neural networks in order to apply them to AWEs extraction.
CNN approaches
Convolutional Neural Network is the type of deep neural networks that was successfully applied to image processing and later was adapted to AWE extraction.
1. Convolutional Vector Regression
The first attempt to apply CNNs to modelling of acoustic word representations was undertaken in [Maas et al., 2012]. Using CNN was motivated by the fact that deep neural networks achieved substantial improvements in modelling acoustics for the speech recognition field.
The authors’ idea was to process the word acoustics with a CNN and obtain a fixed-dimensional representation of a word as a final layer of the network. The trick for converting variable-length representations into fixed-dimensional ones lied in the final pooling layer of the CNN. The pooling region size for this layer is scaled by a free parameter which denoted the number of pooling regions. Thus, no matter what is the speech input size, it is pooled to the fixed-length output.
For the training, the authors chose a supervised setting where the network learns to minimize the Euclidean distance between the predicted and true representations. A true word embedding is precomputed as a bag of phonemes.
In order to reduce model complexity, the authors decided to turn the problem into regression instead of classification by introducing a regression layer. The classification over the big vocabulary size is quite complex whereas regressing to word vectors simplifies the model.
2. Letter-ngram embeddings
In [Bengio and Heigold, 2014], the authors set the goal of revisiting the common phoneme-based ASR architecture with word-based one. The idea suggested using CNN for word acoustic modelling with the help of transcribed speech data processed with DNN.
The CNN aimed at learning acoustic word representations from acoustic features and transcribed word types. To tackle the problem of variable-length input, the authors proposed to fix the input length and cut larger inputs or pad smaller ones.
To improve out-of-vocabulary word handling (the main problem of the word-based ASR), the authors proposed an additional DNN that creates word embeddings for words represented as a bag of all possible combinations of letter-ngrams. The authors trained the acoustic CNN separately and used its fixed AWEs for DNN training with triplet loss¹.
This setting helps to move letter-ngram embeddings closer to their acoustic representations. Thus, these letter-ngram embeddings gain discrimination power of acoustic embeddings and can better handle out-of-vocabulary words since they can be extracted from any given word.
3. Siamese CNN
In [Kamper et al., 2016], the authors undertook the first known attempt to apply the Siamese setting directly to CNNs to obtain AWEs from speech. The Siamese networks are represented as a pair of networks with tied parameters which are trained to minimize the distance between representations of two data instances. In the case of AWEs, the Siamese CNN is trained to obtain acoustic word representations by learning to discriminate between different word types.
The Siamese CNN proposed by the authors is trained to output AWEs that minimize the cosine distance between audio segments of the same word type while maximizing the distance between different word types. To achieve this, the tied CNNs use triplet loss¹. The minimum loss can be achieved when closer distance corresponds to word pairs of the same type and bigger distance corresponds to word pairs of different types in the embedded space.
Since CNN requires the fixed-dimensional input, the authors zero-padded word segments to the same length that is equal to the longest word segment in the training data.
RNN approaches
Recurrent Neural Network is another type of deep neural networks that is designed to handle temporal sequences and, thus, gained attention in the AWE field.
1. LSTM embeddings
[Chen et al., 2015] tried to embed a variable-length speech input into a fixed-dimensional acoustic representation with RNN. This choice was motivated by the RNN success in sequence modelling.
In their approach, the LSTM neural network is trained to recognize words from speech segments. Then, to obtain the acoustic word representation of a variable-length word segment, the authors extracted the last hidden LSTM layer. As each hidden unit in LSTM encodes information about the past history, they decided not to store all hidden units but k last ones which creates a fixed-length representation. If the length of a speech segment occurs to be smaller than k, the extracted acoustic representation is padded with zeros in front.
2. Siamese LSTM
In [Settle and Livescu, 2016], the authors developed the previous idea of a Siamese CNN and proposed to learn acoustic embeddings with a Siamese LSTM. The reasoning behind using RNNs instead of CNNs is due to the power of RNNs to model sequential data as well as their ability to handle arbitrary-length inputs.
As in the Siamese CNN, the Siamese LSTM is trained with weak supervision in the form of word pairs. The network minimizes or maximizes the output distance depending on whether a word pair comes from the same or different word type with the help of triplet loss¹.
In contrast to LSTM embeddings where authors used k units from the last hidden layer to extract AWEs, [Settle and Livescu, 2016] proposed to use a stack of fully-connected layers applied to the last LSTM hidden layer. They claim that a fully-connected layer serves as a useful transformation that improves word representation.
3. Multi-view embeddings
In [He et al., 2017], the authors suggested that orthography reflects the similarity in words’ pronunciations which should also be reflected in AWEs. In other words, similarly written words are likely to be pronounced in a similar way. Thus, incorporating character-level information should improve the discriminative power of AWEs.
The authors proposed a multi-view approach to jointly learn acoustic as well as character word embeddings. Two Bidirectional LSTM networks are trained together to minimize a multi-view contrastive loss function². The first BiLSTM processes a word audio signal and learns an acoustic word representation; the second BiLSTM gets orthographic sequence for a word and outputs a character word representation. The objective of two networks is to produce such embeddings that minimize the distance between the acoustic and character word embeddings (AWEs and CWEs) of the same word while AWEs and CWEs corresponding to different word types are further from each other in the embedding space. Even though the task of CWEs is to improve AWEs, they can be used as approximate AWEs extracted from the text.
Autoencoder approaches
Autoencoder (AE) is a type of neural networks that aims at learning data representations and therefore suitable for extracting AWEs.
1. Correspondence Autoencoder
[Kamper et al., 2015] considered the zero resource setting when only recorded audio corpus with no transcribed data is available. For AWE extraction in this setting, the authors used the Autoencoder approach.
The proposed approach required the data to be organized in pairs — audio word segments of the same word type are aligned giving pairs of the same length. These word pairs can be obtained in a supervised manner from a transcribed data with a trained ASR system (as was done for all previous approaches) or in an unsupervised manner using Unsupervised Term Discovery (UTD) approach. Exploiting the UTD approach allows the AWE algorithm to be truly unsupervised and to be used in zero-resource downstream tasks.
[Kamper et al., 2015] called their approach the Correspondence Autoencoder which can be seen as a denoising autoencoder. This CAE is trained with the speech segments of obtained word pairs (one instance as an input, another as an output) to minimize the reconstruction loss. Thus, an input segment is perceived as a corrupted output segment.
To give additional information on the nature of acoustic features, the CAE is initialized with pre-trained weights. These weights are obtained from a stacked autoencoder trained directly on the acoustic features from a speech corpus.
2. Seq-to-Seq Autoencoder
In [Chung et al., 2016], the authors proposed to use a Sequence-to-Sequence Autoencoder to extract acoustic word representations. While autoencoders is a powerful tool for learning intrinsic data information, they need fixed-dimensional inputs which are not characteristic to speech. To tackle this problem, the authors decided to use a Seq2Seq Autoencoder that overcomes the input limitation of autoencoders.
Sequence-to-Sequence Autoencoder represents an Encoder-Decoder system where LSTM Encoder is in charge of encoding a sequential input into a vector representation and LSTM Decoder processes this representation and generates a sequential output. The main objective of Seq2Seq Autoencoder is to obtain the output which is as similar to the input as possible. Encoder and Decoder are jointly trained to minimize reconstruction loss. After the AE training, the Encoder gives the learned AWEs.
To learn more robust embeddings, the authors used the Denoising Seq2Seq Autoencoder: some noise was added to the input to better learn the internal structure of the acoustic features.
Further Improvements
With the run of time, various improvements or extensions of the core approaches presented above were proposed in the AWE research. The newly introduced methods improved the quality of AWEs for different tasks and allowed new interesting AWE applications.
1. Phonetically-associated Siamese network
Even though Siamese networks are trained on acoustic features, they entirely rely on the relative relationship between words and lack phonetic reasoning. To incorporate more phonetic information, in [Lim et al., 2018], multitask scheme is exploited: lower layers of NN are trained to encode word acoustics with cross-entropy loss while upper layers are in charge of describing relationships among word types with triplet loss¹. This multitask learning gained sufficient improvement in comparison to the basic Siamese architecture.
2. Embeddings with temporal context
Another idea that improved the Siamese setting but can be exploited freely with other AWE learning strategies was proposed by [Yuan et al., 2018]. For Siamese CNN AWE training, to get a fixed-length input vector, acoustic features are usually padded with zeros. The authors proposed to incorporate word neighbouring context information instead of zero-padding. This approach yielded better results in comparison to the basic Siamese CNN approach and was also successfully applied to the basic Siamese LSTM architecture in [Yuan et al., 2019].
3. Linguistically-informed embeddings
In [Yang et al., 2019], the authors proposed to incorporate linguistic information into acoustic word embeddings to better address real-world applications. For example, in real query-by-example search, we expect to find all morphological variations of a query in the speech database (find “Tsunami in ….” with a query “tsunamis”). To enrich AWEs to handle morphologically related words, the authors suggest to train the basic Siamese networks on words grouped by stems and to locate them in the embeddings space based on their edit distance.
4. Seq-to-Seq Correspondence Autoencoder
In [Kamper, 2019], H. Kamper proposed a new autoencoder system for AWE extraction that incorporates many ideas that were introduced so far in the field of AWEs. The author combined a Seq-to-Seq Autoencoder model for obtaining word representations with additional weak top-down supervision in a form of word pairs as was used in the Correspondence Autoencoder. This Seq2Seq Correspondence Autoencoder outperforms all AWE autoencoder models and could be used in zero-resource settings if combined with UTD for word pairs discovery.
5. Multi-view Encoder-Decoder embeddings
In [Jung et al., 2019], the authors revisited the multi-view approach which embeds character-level information into AWEs and improved the multi-view setting further by incorporating Encoder-Decoder architecture. The shared LSTM Decoder was introduced on top of AWE and CWE Encoders to predict a word type from either AWE or CWE provided by Encoders which were iteratively switching. This change in architecture gave a significant improvement over the previously introduced setup.
Conclusion
In this blog post, I got you familiar with the promising and constantly developing field of Acoustic Word Embeddings. AWEs is a powerful tool for extracting acoustic word representations and is steadily gaining recognition in speech technology field through their applicability to diverse speech-related tasks.
The most successful AWE approach up to date is a multi-view approach with Encoder-Decoder architecture proposed by [Jung et al., 2019]. However, there is a bunch of techniques explored by other researchers that can be tried to improve AWEs even further such as UTD, temporal context, different types of sampling, etc.
The ability of AWEs for catching meaningful information enables many downstream tasks. The most widely used applications of AWEs are query-by-example search where a speech query is performed through a large speech database and spoken-term discovery where a particular word or phrase is searched in a given speech utterance. These two speech tasks allow such familiar applications as voice search or wake-up word detection. Other interesting applications for AWEs are speech indexing — grouping together related utterances in a speech corpus, — and spoken document clustering which allows identification of different topics in speech records. In the ASR field, AWEs are useful to end-to-end speech recognition, discovering language lexicons and correcting ASR errors. Moreover, AWEs can be used as a supportive technique for cross-lingual transfer learning for low-resource languages or data augmentation.
Finally, AWEs allow many interesting research directions. For example, unsupervised AWEs could serve for the understanding of how infants acquire language from speech. This investigation could bring advances into the robotics area and enable robotic applications for learning an absolutely new language without any supervision.
References
Levin, K., Henry, K., Jansen, A., & Livescu, K. (2013, December). Fixed-dimensional acoustic embeddings of variable-length segments in low-resource settings. In 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (pp. 410–415). IEEE.
Maas, A. L., Miller, S. D., O’neil, T. M., Ng, A. Y., & Nguyen, P. (2012, July). Word-level acoustic modeling with convolutional vector regression. In Proc. ICML Workshop Representation Learn.
Bengio, S., & Heigold, G. (2014). Word embeddings for speech recognition.
Kamper, H., Wang, W., & Livescu, K. (2016, March). Deep convolutional acoustic word embeddings using word-pair side information. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4950–4954). IEEE.
Chen, G., Parada, C., & Sainath, T. N. (2015, April). Query-by-example keyword spotting using long short-term memory networks. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5236–5240). IEEE.
Settle, S., & Livescu, K. (2016, December). Discriminative acoustic word embeddings: Recurrent neural network-based approaches. In 2016 IEEE Spoken Language Technology Workshop (SLT) (pp. 503–510). IEEE.
He, W., Wang, W., & Livescu, K. (2016). Multi-view recurrent neural acoustic word embeddings. arXiv preprint arXiv:1611.04496.
Kamper, H., Elsner, M., Jansen, A., & Goldwater, S. (2015, April). Unsupervised neural network based feature extraction using weak top-down constraints. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5818–5822). IEEE.
Chung, Y. A., Wu, C. C., Shen, C. H., Lee, H. Y., & Lee, L. S. (2016). Audio word2vec: Unsupervised learning of audio segment representations using sequence-to-sequence autoencoder. arXiv preprint arXiv:1603.00982.
Lim, H., Kim, Y., Jung, Y., Jung, M., & Kim, H. (2018). Learning acoustic word embeddings with phonetically associated triplet network. arXiv preprint arXiv:1811.02736.
Yuan, Y., Leung, C. C., Xie, L., Chen, H., Ma, B., & Li, H. (2018). Learning acoustic word embeddings with temporal context for query-by-example speech search. arXiv preprint arXiv:1806.03621.
Yuan, Y., Leung, C. C., Xie, L., Chen, H., & Ma, B. (2019). Query-by-example speech search using recurrent neural acoustic word embeddings with temporal context. IEEE Access, 7, 67656–67665.
Yang, Z., & Hirschberg, J. (2019). Linguistically-informed Training of Acoustic Word Embeddings for Low-resource Languages. Proc. Interspeech 2019, 2678–2682.
Kamper, H. (2019, May). Truly unsupervised acoustic word embeddings using weak top-down constraints in encoder-decoder models. In ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6535–3539). IEEE.
Jung, M., Lim, H., Goo, J., Jung, Y., & Kim, H. (2019). Additional Shared Decoder on Siamese Multi-view Encoders for Learning Acoustic Word Embeddings. arXiv preprint arXiv:1910.00341.
Appendix
¹Triplet loss:
²Multi-view triplet loss: