Humor prediction of Spanish sentences by using Tensorflow 2.0 and Transformers

6 min readJul 15, 2019

In this post we will describe a transformer-like structure we implemented at Umayux Labs (@UmayuxLabs) to predict whether a sentence was humorous or not by using a Spanish corpus from the HAHA@IberLEF2019 dataset. We based our implementation on this Tensorflow 2.0 transformer guide

Dataset

The dataset used for training and testing the model consists of a total of 24,000 crowd-annotated tweets obtained from the 2019 HAHA dataset. Briefly, the scoring and tagging of the tweets was made by using a voting strategy where users were asked to label the tweets as humorous (1) or not (0) and to provide a scoring from 1 (not funny) to 5 (excellent).

Tweets were labeled as humorous if they received at least three votes indicating a number of stars, and a least five votes in total. The not humorous tweets were labeled for tweets with at least three votes for not humor.

The structure of the dataset is described in the table below:

For more details about the dataset, please follow this link

Deep Learning Model

We built a neural network structure based on the transformer architecture with multi-head attention (see this post for details) to predict whether a tweet is humorous or not. However, as our goal was not to produce a sentence (such in the case of language translation), we decoupled the decoder and added a classification module on top of the encoder.

Transformer topology used to predict humor/non humor texts. The model consists of an encoder transformer with positional encoding a flatten layer and a dense layer with Softmax output for prediction.

The Encoder

The input of the encoder consists of the aggregation of the input embedding and the positional encoding.
The body of the encoder consists of the multi head attention followed by a normalization, feed forward and a second normalization layers.
Dropout is used on each to the output of each of the layers before normalization.
The attention weights are also provided as output from the encoder.

The classification layer

The encoder output is flattened to feed a dense layer with a Softmax activation function to predict the humor/no humor classes.

Training

To train and test the model, the HAHA dataset was first divided into the training (60%), testing (20%) and validation sets (20%).

Train: 15360
Test: 3840
Val: 4800

The test set was used to check the performance of the model after each epoch. After 5 epochs, the iteration with the highest performance on the test set was kept and stored into the checkpoint. Then, this process was repeated 3 times. We observed that by following this strategy, the performance of the model improved faster than by doing it on 15 straight epochs. Finally, as the test set was involved during the selection of the best parameters, the validation model was used to assess the performance of the best fit model.

Results

Our model achieved a 0.83 accuracy and a 0.83 weighted averaged F1 score on the validation set.

Performance of the model over the validation dataset.

As shown in the Figure below, our model achieved a higher performance for detecting not humorous tweet compared to humorous tweets. Although, it is not the same validation dataset as in the challenge, our model achieved an average performance comparable to other competitors from the HAHA challenge.

One of the outputs of our model consists of an embedding vector taken from the output of the transformer-like structure just before the classification layer. Thus, we use this embedding vector to project the tweets into a three dimensional space using UMAP (using projector). The results are shown below (a video is also available here):

Embeddings projection into 3D. Two clusters were clearly non humorous and humorous dominant and a third cluster that contained both humorous and not humorous comments with misclassifications.

As illustrated in the figure above, the model was able to differentiate the two classes of interest (humor/not humor) and group them into three dominant clusters. However, one of the clusters contained a mixture of humorous and not humorous tweets with correct and wrong predictions. When analyzing this cluster we found that some of these tweets are not labeled properly. Therefore, this could influence the performance of our model (and others). In overall, the curation of the dataset looks correct.

Understanding the model with attention

The transformer topology used in our model allow us to identify where the neural network is pointing when predicting humorous or non humorous texts. Our model consists of an encoder transformer where 10% of input words are randomly ignored, it consists of 12 heads, 8 encoder layers, a total of 20,000 tokens and 512 model dimension. Each tweet is limited to a maximum length of 40 words. Therefore, the attention consists of a 12x40x40 matrix. Here are few examples of attention:

Attention to humorous tweets

In this section we will visualize what makes tweet funny or what do the network attends when funny tweets are classified. The following figure shows the attention weights across all the attention heads.

In this example each attention head pays attention to different words in the sentence. For example, attention head #12 pay attention to “Te caiste? — noooooooo”.

In this example, note that the attention head #12 also play attention to the first sentence “como va la dieta?”.

In this tweet the model attention head #12 also pay attention to the first sentence “Oye cuantos anios crees que tengo?”.

Attention to non humorous tweets

The next few examples show how the model pays attention to the non humorous tweets. Interestingly, unlike humorous tweets, the attention head #12 does not pay attention to the first sentence. But, seems that the attention head #4 pays attention to the first sentence for non humorous tweets. Also, note that the attention #3 and #12 pay attention to the last words on the sentences.

Attention scores of non humorous tweets. Note that the attention head #4 pays attention to the first sentence in the text whereas the attention head #12 looks at the last sentence.

Attention scores for non humorous tweets

Attention scores for non humorous text. Interestingly, the first sentence of this text “TUITAZOOO!!!” and the user reference “@javierbord”, that don’t contribute to the context of the sentence were

In general, for humorous tweets, seems like the first sentence is key to determine whether the tweet is funny or not (as seen by the attention layer #12). As most of the words were divided into smaller ngrams, it is not possible to determine the weight of each individual word. However, the results clearly suggest that the attention head #12 captures the topology of the sentence. Note that for non humorous tweets, this attention head discards the first sentence and weights over the middle and last sentence.

Cases where the model does not work properly

The next example shows a humorous tweet classified as not humorous. Note that the head model attention #12 didn’t weight the first sentence. As this sentence don’t have a funny context.

Humorous tweet classified as non humorous.

In the next example, a tweet labeled as humorous was labeled as not humorous by the model. However, this tweet can be also considered as not funny, even offensive. For example, in this sentence, the words “sean tontas” “they are dumb” may be considered not funny and offensive by certain people.

In overall we can see that the attention given by the model varies according to the context of the sentences and its category. In addition, some of the tweets cannot be labeled as humorous or non humorous clearly what makes the prediction harder, but, in general the model is able to get a good prediction score and its possible to seek into what the model is learning. This model has been applied to a humorous dataset, but, it can be applied to any other text datasets. We’ll be showing its performance for other tasks.