Attention in end-to-end Automatic Speech Recognition

Shreekantha Nadig
Sep 6, 2019 · 14 min read

TL; DR: This post gives an overview of how we progress from a traditional ASR system to one with an Encoder-Decoder architecture with Attention. It also mentions some of the drawbacks of the traditional architectures and how they can be overcome with this Encoder-Attention-Decoder architectures. This post also gives a small introduction to the new hybrid CTC/Attention architecture.

Automatic Speech Recognition — The aim of research in automatic speech recognition (ASR)is the development of a device/algorithm that transcribes natural speech automatically. The basic system consists of an acoustic processor and a linguistic decoder.¹

In a traditional ASR system, there is a source speaker who is speaking a sequence of words, which gets propagated through a channel (a noisy channel), and we are concerned with transcribing this speech signal to a sequence of words at the target.

Here, we are not concerned with the semantics of the words/sentences. We just want to transcribe what is there in the speech signal.

Traditionally, this is viewed in the framework of the noisy channel model². We would like to know what is the most likely sentence out of all sentences in the language L given some acoustic input O?

Here, we treat acoustic input O as a sequence of individual observations O = {o_1 , o_2 , o_3 , …, o_T} and define a sentence as a sequence of words W = {w_1, w_2, w_3, … , w_U}

Speech recognition as a noisy channel model

A speech signal is a 1-D signal, typically sampled at 8KHz or 16KHz, each sample typically being 8-bit or 16-bit

1-D speech signal

There are a few reasons we can not use this 1-D signal directly to train any model.

  • The speech signal is quasi-stationary.
  • There are inter-speaker and intra-speaker variability, i.e. the same word when spoken by different speakers gives different signals, and also the same word when spoken by the same speaker at different times gives different signals.
  • Variation in the rate of speaking, pitch, and volume might affect the model.

For the above reasons, we typically do some feature extraction from the 1-D speech signal.

The simplest features we can extract from a speech signal is the spectrogram³. We typically extract a short-time Fourier spectrum⁴ (25ms duration). Thus, for every 25ms window of the signal, we get some number of features.

This is depicted in the figure below, where for a 20ms window, we take the log of the power spectrum to get 1 frame of the spectrogram. We repeat this for the entire signal to get the entire spectrogram of the signal.

This feature is later used to train a machine learning model to transcribe speech.

Short-time Fourier transform
Spectrogram of a speech signal

The probabilistic implications of these points are:

We would like to get those sequence of words W from the language L, which maximizes the following condition given the acoustic input O.

We can use Bayes rule to rewrite this as:

Since the denominator is the same for each candidate sentence W, we can ignore it for the argmax:

Before end-to-end ASR

Traditionally, ASR systems consisted of many individual blocks which needed to be optimized separately. The below figure shows one example of a traditional ASR system. There is a front-end feature extraction, followed by a GMM acoustic model, an HMM (pronunciation modeling) and also a Language model (LM).

Traditional ASR pipeline

Challenges in traditional ASR and Motivation for end-to-end

There are many disadvantages to such a system for ASR. As summarized in the work by S. Watanabe et al., 2017 :

  • Step-wise refinement: Many module-specific processes are required to build an accurate module. For example, when we build an acoustic model from scratch, we have to first build a hidden Markov model (HMM) and Gaussian mixture models (GMMs) to obtain the tied-state HMM structure and phonetic alignments, before we can train deep neural networks (DNNs).
  • Linguistic information: To factorize acoustic and language models well, we need to have a lexicon model, which is usually based on a handcrafted pronunciation dictionary to map a word to phoneme sequences. Since phonemes are designed using linguistic knowledge, they are subject to human error that a fully data-driven system might avoid. Finally, some languages do not explicitly have a word boundary and need tokenization modules.
  • Conditional independence assumptions: The current ASR systems often use conditional independence assumptions (especially Markov assumptions) during the above factorization and to make use of GMM, DNN, and n-gram models. Real-world data do not necessarily follow such assumptions leading to model misspecification.
  • Complex decoding: Inference/decoding has to be performed by integrating all modules. Although this integration is often efficiently handled by finite-state transducers, the construction and implementation of well-optimized transducers are very complicated
  • Incoherence in optimization: The above modules are optimized separately with different objectives, which may result in incoherence in optimization, where each module is not trained to match the other modules.

In addition to the above points, the traditional pipeline, even though highly tweakable, is very difficult to get working well. Historically, each part of the system has its own set of challenges (eg. choosing the right feature representations).

Problem of Speech

In most of the problems we are trying to solve with Machine Learning/Deep Learning, we have a set of inputs that we would like to map to a set of outputs. Mostly, each input corresponds to an output.

We assume there is some function f() that can map all of these inputs to their corresponding outputs.

The problem with Speech is, our feature vectors are taken from a of the speech signal. Which means we have a feature vector for every 20–25ms or so.

The problem is, we don’t know where the boundary is between one sound (phoneme) and another. If you could open a speech signal in any and cut the signal to ~200–300ms and listened to it without any context, it is not possible to distinguish which sound it is. Observe, the data we have is the speech signal and the corresponding transcript at the word level. That is, we do not have data about where a word (or a phoneme) ends and where another begins.

How do we train a model when we don’t even know the x−y pairing?

Popular loss functions (cross-entropy, mean-squared error) assume a one-to-one correspondence between the network’s output and target labels. But, what if there is no one-to-one correspondence? (Speech Recognition, Machine Translation, handwriting recognition).

We need a way to find alignment between network outputs and target labels. Traditionally this is done with bootstrapping ().

In the end-to-end context, this could be done in several ways like CTC, Attention, seq2seq models.


In most of the end-to-end systems, there is a common module — the Encoder network.

If we take a Bi-directional LSTM encoder, it takes the feature vector sequence, the previous hidden state and transforms them into a hidden (encoded) vector sequence through some Encoder function

Each encoded representation ht contains information about the input sequence with a focus on the t^th input of the sequence.

is the hidden state at time t, where Encoder() is some function the Encoder is implementing to update its hidden representation.

This encoder can be deep in nature, i.e. we can have a deep BLSTM encoder, where every successive layer transforms the outputs from the previous layers. We would like to use the Bi-directional network in speech because, how we pronounce a word is affected by the phonemes of the future also.

For example, the phoneme k is pronounced different in both the words “car” and “Kleptomania”. Just the knowledge that in Kleptomania, there is a phoneme l after the first k, changes our pronunciation of the phoneme k. By using a Bi-directional network, we can capture these effects from the future also, which helps in modelling the speech signal better.

Bi-directional LSTM Encoder


Please refer to this awesome article on for a detailed discussion on the topic.

Most of us know how a DNN works, given a feature vector x, it transforms it into a target vector y through some transformations at each layer. An LSTM Encoder works similarly. For every feature vector x, it generates a target vector y, but this vector is not completely dependant on the input. It has some effect from the past (and future in case of Bi-directional).

Below figure shows an uni-directional LSTM which produces one target vector for every input feature vector. This is equivalent to getting repeated character outputs from the model. We need some way to collapse these repeated outputs to a human-readable form. That is where CTC comes into the picture.

The frame-level output from a model

In Connectionist Temporal Classification, the problem of alignment is solved by integrating over all possible time-character alignments.

Example: If the word is “hi” and the model output is 3 target vectors. We want to map these 3 vectors into the transcript “hi” with 2 letters.

CTC computes how many possible ways are there for this mapping. Possible C such that k(C)=W: hhi,hii,_hi,h_i,hi_. Where “_” is a blank symbol.

This is shown in the distill article about CTC, as shown below.

CTC from

Thus, in our LSTM encoder, each dimension of the target vector corresponds to the probability of occurrence of a character in that timestep. In the below figure, you can see the probability occurrence of character “B” at timestep 17.

Encoder output

Thus, computing the CTC likelihood is computing the likelihood of each possible label sequences and adding them together. If we are training an utterance “Hello”, we compute the individual likelihoods of each label sequence which map to “Hello” and sum them. This is the parameter we would like to maximize.

CTC likelihood

We update the network parameters to maximize the likelihood of the correct label sequence.

CTC backprop


In the Encoder-Decoder framework, the Encoder tries to summarize the entire input sequence in a fixed dimension vector ht. The Encoder itself is a Recurrent neural network (RNN/LSTM/BLSTM/GRU) which takes each input feature vector xt and switches its internal state to represent (summarize) the sequence till that time inside ht.

We could take ht at every time step to make a prediction (or not), but we shall wait till the end of the sequence at time T and take the representation hT to start generating our output sequence. This is because we don’t know the word/letter/phoneme boundaries and we are hoping the Encoder can summarize the input sequence entirely inside hT.

We give as input a <sos> — the start of the sequence token to the Decoder for consistency and to start generating output symbols. The Decoder is another Recurrent neural network (not bidirectional) which switches its internal state every time to predict the output.

At every time step, we feed the output from the previous time step to predict the current output.

is the Decoder hidden state when predicting i^th output symbol, where Decoder() is some function the Decoder LSTM is implementing to update its hidden representation.

Below figure shows the operation of an example Encoder-Decoder network.


We will stop generating the output symbol sequence when the Decoder generates an <eos> — end of sequence token.

Given the Decoder hidden representation s(i−1) (from the previous output time) and the output symbol y(i−1) (the previous output symbol), we can predict the output symbol at the current time step as:

Where g() is the entire Decoder function. The probability of the full output sequence yy can be computed as:

Potential issues with the Encoder-Decoder

  • The neural network needs to be able to compress all the necessary information of the input feature vector sequence into a fixed dimension vector
  • When the sequence is long, especially when the input sequence at test time is significantly longer than the training ones, the performance of the basic Encoder-Decoder network degrades.
  • Also, it is my opinion that forcing the Encoder to summarize the entire feature vector sequence into a fixed dimension vector depends on the size of the vector (longer the sentence — the longer the vector) which we can’t fix as the sequence length can vary significantly.


One of the solutions to the problems mentioned above that people have been proposing is the use of Attention. Attention is an extension of the Encoder-Decoder framework.

Each time the model needs to generate an output symbol, it (soft-) searches for a set of positions in the input feature vector sequence where the most relevant information is concentrated.

We are now concerned with making the model select the set of positions in the input sequence accurately. The main difference with the Encoder-Decoder framework is that here we are not trying to summarize the entire input sequence into a fixed dimension vector.

Here, instead of feeding the hidden representation hT, let us select a subset of h which are most relevant to a particular context to help the Decoder network generate the output.

We linearly blend this relevant ht to get what we refer to as the Context vector Ci.

Attention: In a way, the model is attending to a subset of the input features which are most relevant to the current context.

In all the deep learning techniques, we would like the functions to be differentiable so that we can learn them using backprop. To make this technique of attention to a subset differentiable, we attend to all the input feature vectors, but with different weight!

How is Attention different from the Encoder-Decoder

  • In the Encoder-Decoder network that we discussed in the previous post, the Decoder hidden state is computed as:
  • In the Attention extension, we take the Context vector in computing the Decoder hidden state:
  • The Context vector is the summary of only the most relevant input feature vectors. To capture this relevance, let’s consider a variable αα where αiαi represents the weight of the encoded representation (also referred to as the annotation) hi in the Context vector Ci — for predicting the output at time ii. Given this αα, we can compute the Context vector as:
  • To compute alpha(i,j), we need e(i,j) — the importance of the j^th annotation vector for predicting the i^th output symbol. This is what the compatibility function produces. The weight alpha(i,j) of each annotation hj is computed as:
  • Where e(i,j)=AttentionFunction(si−1,hj), is a compatibility function which computes the importance of each annotation hj with the Decoder hidden state s(i−1).

In all our Attention models, it is this function AttentionFunction() that is going to be different. AttentionFunction() defines what type of Attention it is.

The below figure shows an example of Attention in Speech Recognition. The bottom-most figure shows the input feature vector (spectrogram), the one above that shows where the model is concentrating at each time-step in output, the 3rd picture shows attention weights at each time step, and the 4th picture shows the full attention weight distribution over the entire input-output sequence.

Attention over an input sequence to generate the output

If you see the Attention weights before the model is trained (at epoch 0), the Attention weights are all random and hence the Context vector Ci contains unnecessary noise from irrelevant input feature vectors. This leads to degraded performance of the model. It is fairly evident that a good Attention model produces a better Context vector which leads to better model performance.

This can be shown in an abstract form as follows. Each feature vector sequence is blended together with some weight and then fed to the decoder network to make a decision. The weights with which these features are blended together is decided by the AttentionFunction.

Attention — an abstract representation

We could also see how Attention weights progress over time (epochs) to get a deeper understanding of how the model is learning. I did just that combining all the Attention weights from each epoch into a gif. Here’s what it looks like:

Attention weights progress over epochs during training

Many attention models are widely used in literature. A toolkit like ESPnet itself has around 13 attention mechanisms that we can easily start experimenting with.

Finally, let’s take an example of one of these attention mechanisms — dot product attention and see how attention works end-to-end to generate the context vector Ci.

Dot product attention

Beyond Attention — Joint CTC/Attention architectures

Attention-based ASR may be prone to include deletion and insertion errors because of its flexible alignment property, which can attend to any portion of the encoder state sequence to predict the next label, as discussed in Section II-C of. Since attention is generated by the decoder network, it may prematurely predict the end-of-sequence even when it has not attended to all of the encoder frames, making the hypothesis too short. On the other hand, it may predict the next label with a high probability by attending to the same portions as those attended to before. In this case, the hypothesis becomes very long and includes repetitions of the same label sequence.

For this reason and many others, a joint CTC/Attention architecture was proposed, which effectively utilizes the advantages of both architectures in both training and decoding. During training, we employ the multiobjective learning framework to improve robustness and achieve fast convergence. During decoding, we perform joint decoding by combining both attention-based and CTC scores in a one-pass beam search algorithm to further eliminate irregular alignments.

Joint CTC/Attention architecture


This post gives an overview of how we progress from a traditional ASR system to one with an Encoder-Decoder architecture with Attention. It also mentions some of the drawbacks of the traditional architectures and how they can be overcome with this Encoder-Attention-Decoder architectures. This post also gives a small introduction to the new hybrid CTC/Attention architecture.


[1] L. R. Bahl, F. Jelinek, and R. L. Mercer, “A Maximum Likelihood Approach to Continuous Speech Recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. PAMI-5, no. 2, pp. 179–190, 1983.


[3] S. Watanabe et al., “Hybrid CTC / Attention Architecture for End-to-End,” vol. 11, no. 8, pp. 1240–1253, 2017.

[4] Hannun “Sequence Modeling with CTC” 2017

Intel Student Ambassadors

Stories from Intel Student Ambassadors working on innovative use cases and projects around Artificial Intelligence, Machine Learning & Deep Learning. Learn More Here:

Shreekantha Nadig

Written by

Research Scholar at IIIT–Bangalore. Interested in Machine Learning, Data Sciences, Speech Processing & everything else.

Intel Student Ambassadors

Stories from Intel Student Ambassadors working on innovative use cases and projects around Artificial Intelligence, Machine Learning & Deep Learning. Learn More Here:

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade