Predict the next sentence for President’s Speech dataset.

Ankita Sinha
Jun 11 · 5 min read

Generate the next sentence using Temporal Convolution Network. (TCN)

Sequence Model

Temporal Convolution Network is a variation of Convolution Neural Networks for sequence modelling tasks. It is a strong alternative to RCNNs (Recurrent networks) and does not suffer from vanishing or exploding gradient problems.

TCNs are implemented using Dilated Causal blocks. Causal Blocks are convolution blocks that can only look into the past and not into the future. TCN is thus an Auto-Regressive model. Causal Blocks prevent the model from cheating by directly looking at the next word!

Causal convolutions look only at the past.

Simple Causal Layers can only look into linear depth and thus would not work for tasks that require a longer history. We solve this by using Dilated Convolutions with dilations exponentially increasing in each layer. Since TCN’s receptive field depends a lot on the network depth (as dilations increase with each layer, so the deeper the model, the larger its receptive field), residual connections are used to prevent vanishing or exploding gradients problem.

A dilated causal convolution with dilation factors d = 1, 2, 4 and filter size k = 3. The receptive field is able to cover all values from the input sequence.

TCN is a character generation model. It outputs the probability of each of the 26 characters in the english language.


The probability of each output depends on the input and the previous output, going sequentially. We can use the same sentence as both our input and our label by shifting the letters by 1!!. Lets assume for simplicity here, the model has seen “th”. Thus our input is “th” and the label becomes “e”. This is similar to how we train our LSTMs. The added benefit of TCN over LSTM is now we do not have any recurrent connection during training. The output is independent from one another. This also means it can now be trained in parallel as opposed to LSTM. Thus we can train our temporal network in a fully convolutional way. This also has the benefit of adding local information to temporal information.

Character Level Language Model

There are several ways to generate a sentence using the outputs of your language model (TCN). The most preferred way is using BEAM SEARCH.

At every step (for every character), Beam search expands all the possible characters. This means for each character, we look at its log-likelihood with all other characters. For example, we start with ‘a’. We then look at the log-likelihood of ‘aa’, ‘ab’, ‘ac’ ….. This would require exponential space. To solve this, we only keep the top few sentences (called Beams) and remove the rest.

We generally use 2 criteria for selecting the substring —

  1. Per-Character Log-Likelihood
  2. Overall Log-Log-Likelihood

Log -Likelihood is the log probability of that string under the Language Model -TCN.

Now Let us look at How to Implement TCN.

TCNs are made of Causal Convolutions of very small channels ( generally < 50) repeated to get a large depth. In 1d ( we need only 1 dimension to take care of time), we implement causal convolutions using Conv1d Convolutions and by shifting. We use padding to shift the network. My Char length is 20 to add “ “ (space) and “.” (period).

Let us first look into the major differentiating components of the TCN class and then we will go into the building blocks.

  1. We have used many layers with small channels (8 filters repeated 10 times.) It is preferred to have your filter size < = 50.
  2. The dilation at each level is exponential (2^i) to increase our receptive field.
  3. The padding at each level also changes according to the dilation. So padding = (kernel_size — 1) * dilation_size.
  4. Due to the exponential increase in size in each layer, we have implemented a Chomp class to reduce the size and we are also using Dropout.

Our CausalConv1Block is thus a sequential Conv1d layer with dilation and padding, followed by non-linearity and dropout. I have used Constant1dPadding and have added a padding of (kernel_size-1)*dilation only to the left keeping number of padding 0 to the right. This shifts the network.

To train our model, we use the same sentence for both input and label. We shift it by 1 for our label. For example, if our sentence is “The apple fell from the tree”, Our input is “The apple fell from the tre” in one-hot encoded format. And our label is “The apple fell from the tree”. The training starts from an empty string and the first prediction of the model is “T”. We achieve this by coding the first character as a torch.nn.Parameter to automatically add it to the list of module parameters. You can see this in line 21 of (The first code block).

Now, we will use Beam Search to generate the top sentences that were generated by our TCN network. As we know, TCN generates 1 character at a time. It outputs the log-probabilities of each of the 28 characters, 26 in the english alphabet and space and period. At every step Beam Search will expand all possible characters and store the top candidate substrings at each level based on the average per character log-likelihood. Stopping criteria for Beam search is when it predicts a “.” or it reaches a maximum length specified. You can find a greedy Beam Search Implementation here. You can learn more about Beam Search in this video by DeepLearningAI.

Output from Beam Search

You can find the code for the complete implementation here —

I have trained my model on Obama’s speech. You can find the dataset on Kaggle.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem

Ankita Sinha

Written by

Hi, I am Ankita. I write about Machine Learning and how I try to navigate the puzzle called life!

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem