Humor prediction of Spanish sentences by using Tensorflow 2.0 and Transformers
In this post we will describe a transformer-like structure we implemented at Umayux Labs (@UmayuxLabs) to predict whether a sentence was humorous or not by using a Spanish corpus from the HAHA@IberLEF2019 dataset. We based our implementation on this Tensorflow 2.0 transformer guide
Dataset
The dataset used for training and testing the model consists of a total of 24,000 crowd-annotated tweets obtained from the 2019 HAHA dataset. Briefly, the scoring and tagging of the tweets was made by using a voting strategy where users were asked to label the tweets as humorous (1) or not (0) and to provide a scoring from 1 (not funny) to 5 (excellent).
Tweets were labeled as humorous if they received at least three votes indicating a number of stars, and a least five votes in total. The not humorous tweets were labeled for tweets with at least three votes for not humor.
The structure of the dataset is described in the table below:
For more details about the dataset, please follow this link
Deep Learning Model
We built a neural network structure based on the transformer architecture with multi-head attention (see this post for details) to predict whether a tweet is humorous or not. However, as our goal was not to produce a sentence (such in the case of language translation), we decoupled the decoder and added a classification module on top of the encoder.
The Encoder
- The input of the encoder consists of the aggregation of the input embedding and the positional encoding.
- The body of the encoder consists of the multi head attention followed by a normalization, feed forward and a second normalization layers.
- Dropout is used on each to the output of each of the layers before normalization.
- The attention weights are also provided as output from the encoder.
The classification layer
The encoder output is flattened to feed a dense layer with a Softmax activation function to predict the humor/no humor classes.
Training
To train and test the model, the HAHA dataset was first divided into the training (60%), testing (20%) and validation sets (20%).
- Train: 15360
- Test: 3840
- Val: 4800
The test set was used to check the performance of the model after each epoch. After 5 epochs, the iteration with the highest performance on the test set was kept and stored into the checkpoint. Then, this process was repeated 3 times. We observed that by following this strategy, the performance of the model improved faster than by doing it on 15 straight epochs. Finally, as the test set was involved during the selection of the best parameters, the validation model was used to assess the performance of the best fit model.
Results
Our model achieved a 0.83 accuracy and a 0.83 weighted averaged F1 score on the validation set.
As shown in the Figure below, our model achieved a higher performance for detecting not humorous tweet compared to humorous tweets. Although, it is not the same validation dataset as in the challenge, our model achieved an average performance comparable to other competitors from the HAHA challenge.
One of the outputs of our model consists of an embedding vector taken from the output of the transformer-like structure just before the classification layer. Thus, we use this embedding vector to project the tweets into a three dimensional space using UMAP (using projector). The results are shown below (a video is also available here):
As illustrated in the figure above, the model was able to differentiate the two classes of interest (humor/not humor) and group them into three dominant clusters. However, one of the clusters contained a mixture of humorous and not humorous tweets with correct and wrong predictions. When analyzing this cluster we found that some of these tweets are not labeled properly. Therefore, this could influence the performance of our model (and others). In overall, the curation of the dataset looks correct.
Understanding the model with attention
The transformer topology used in our model allow us to identify where the neural network is pointing when predicting humorous or non humorous texts. Our model consists of an encoder transformer where 10% of input words are randomly ignored, it consists of 12 heads, 8 encoder layers, a total of 20,000 tokens and 512 model dimension. Each tweet is limited to a maximum length of 40 words. Therefore, the attention consists of a 12x40x40 matrix. Here are few examples of attention:
Attention to humorous tweets
In this section we will visualize what makes tweet funny or what do the network attends when funny tweets are classified. The following figure shows the attention weights across all the attention heads.
Attention to non humorous tweets
The next few examples show how the model pays attention to the non humorous tweets. Interestingly, unlike humorous tweets, the attention head #12 does not pay attention to the first sentence. But, seems that the attention head #4 pays attention to the first sentence for non humorous tweets. Also, note that the attention #3 and #12 pay attention to the last words on the sentences.
In general, for humorous tweets, seems like the first sentence is key to determine whether the tweet is funny or not (as seen by the attention layer #12). As most of the words were divided into smaller ngrams, it is not possible to determine the weight of each individual word. However, the results clearly suggest that the attention head #12 captures the topology of the sentence. Note that for non humorous tweets, this attention head discards the first sentence and weights over the middle and last sentence.
Cases where the model does not work properly
The next example shows a humorous tweet classified as not humorous. Note that the head model attention #12 didn’t weight the first sentence. As this sentence don’t have a funny context.
In the next example, a tweet labeled as humorous was labeled as not humorous by the model. However, this tweet can be also considered as not funny, even offensive. For example, in this sentence, the words “sean tontas” “they are dumb” may be considered not funny and offensive by certain people.
In overall we can see that the attention given by the model varies according to the context of the sentences and its category. In addition, some of the tweets cannot be labeled as humorous or non humorous clearly what makes the prediction harder, but, in general the model is able to get a good prediction score and its possible to seek into what the model is learning. This model has been applied to a humorous dataset, but, it can be applied to any other text datasets. We’ll be showing its performance for other tasks.