1 line to XLNET Word Embeddings with NLU in Python

Christian Kasim Loan

Follow

Published in

spark-nlp

6 min readJan 17, 2021

--

Including Part of Speech, Named Entity Recognition, Emotion Classification in the same line! With Bonus t-SNE plots!

0. Introduction

0.1 What is NLU?

John Snow Labs NLU library gives you 1000+ NLP models and 100+ Word Embeddings in 300+ languages and infinite possibilities to explore your data and gain insights.

In this tutorial, we will cover how to get the powerful Xlnet Embeddings with 1 line of NLU code and then how to visualize them with t-SNE. We will compare Comparing Sentiment with Sarcasm and Emotions!

0.2 What is t-SNE?

T-SNE is a tool to visualize high-dimensional data. It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data. t-SNE has a cost function that is not convex, i.e. with different initializations we can get different results.

0.3 What are XLNET embeddings?

With the capability of modeling bidirectional contexts, denoising autoencoding-based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling. However, relying on corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy. In light of these pros and cons, XLNet was proposed, a generalized autoregressive pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into pretraining. Empirically, under comparable experiment settings, XLNet outperforms BERT on 20 tasks, often by a large margin, including question answering, natural language inference, sentiment analysis, and document ranking.

1. Import NLU, load Xlnet, and embed a sample string in 1 line

nlu.load('xlnet').predict('He was suprised by the diversity of NLU')

2. Load a larger dataset

The following snippet will download a Reddit sarcasm dataset and load it to a Pandas Dataframe

import pandas as pd# Download the dataset! wget -N https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/sarcasm/train-balanced-sarcasm.csv -P /tmp# Load dataset to Pandasdf = pd.read_csv('/tmp/train-balanced-sarcasm.csv')df

3. Predict on the dataset and also add Part of Speech, Emotion and Sentiment Classifiers

Since adding additional classifiers and getting their predictions is so easy in NLU, we will extend our NLU pipeline with a POS, Emotion, and Sentiment classifier which all achieve results close to the state of the art.

Those extra predictions will also come in handy when plotting our results.
We are limiting ourselves to a subsection of the dataset because our RAM is sadly limited and we are not running on a cluster. With Spark NLP you can take exactly the same models and run them in a scalable fashion inside of a Spark cluster

pipe = nlu.load('pos sentiment emotion xlnet')df['text'] = df['comment']# NLU to gives us one row per embedded word by specifying the output level
predictions = pipe.predict(df[['text','label']], output_level='token')predictions

4. Emotion Plots

We can quickly plot the distribution of predicted emotions using pandas functions on the data frame

# Some Tokens are None which we must drop firstpredictions.dropna(how='any', inplace=True)# Some sentiment are 'na' which we must drop firstpredictions = predictions[predictions.emotion!= 'na']predictions.emotion.value_counts().plot.bar(title='Dataset emotion distribution')

5. Prepare data for T-SNE

We prepare the data for the T-SNE algorithm by collecting them in a matrix for TSNE

import numpy as npmat = np.matrix([x for x in predictions.xlnet_embeddings])

6. Fit T-SNE

Finally, we fit the T-SNE algorithm and get our 2-Dimensional representation of our Xlnet Word Embeddings

from sklearn.manifold import TSNEmodel = TSNE(n_components=2)low_dim_data = model.fit_transform(mat)print('Lower dim data has shape',low_dim_data.shape)

7. Plot XLNET Word Embeddings, colored by Part of Speech Tag

The following plots show scatter plots for the 2-D representation of the Word Embeddings. Each point represents a word in a sentence and the color represents the POS class that word belongs to.tsne_df = pd.DataFrame(low_dim_data, predictions.pos)ax = sns.scatterplot(data=tsne_df, x=0, y=1, hue=tsne_df.index)ax.set_title(‘T-SNE XLNET Embeddings, colored by Part of Speech Tag’)

tsne_df =  pd.DataFrame(low_dim_data, predictions.pos)ax = sns.scatterplot(data=tsne_df, x=0, y=1, hue=tsne_df.index)ax.set_title('T-SNE XLNET Embeddings, colored by Part of Speech Tag'

8. Plot Plot Xlnet Word Embeddings, colored by Emotion

The following plots show scatter plots for the 2-D representation of the Word Embeddings. Each point represents a word from a sentence that was classified with a particular emotion, which reflects in the colors

tsne_df =  pd.DataFrame(low_dim_data, predictions.emotion)ax =  sns.scatterplot(data=tsne_df, x=0, y=1, hue=tsne_df.index)ax.set_title('T-SNE XLNET Embeddings, colored by Emotion')

9. Plot Plot Xlnet Word Embeddings, colored by Sarcasm

The following plots show scatter plots for the 2-D representation of the Word Embeddings. Each point represents a word from a sentence that was classified as sarcastic or not, which reflects in the colors

tsne_df =  pd.DataFrame(low_dim_data, predictions.label.replace({1:'sarcasm',0:'normal'}))
tsne_df.columns=['x','y']ax = sns.scatterplot(data=tsne_df, x='x', y='y', hue=tsne_df.index)ax.set_title('T-SNE XLNET Embeddings, colored by Sarcasm label')

9. Plot Plot Xlnet Word Embeddings, colored by Sentiment