Inferring User Emotions in Texts Using SparkNLP

Patrick Perrin, Ph.D.
Holler Developers
Published in
26 min readApr 22, 2021

AI Research at Holler Technologies

1. Introduction

In this post, we demonstrate how to leverage SparkNLP and SparkML to quickly set an experiment for testing initial discovery hypotheses towards inferring emotions in short texts. We will show how to transform the input short texts, how to train a multi-class text classifier that gives reasonable results given the input data, how to evaluate and compare such text classification models, how to setup an experiment to generate multiple models and finally how evaluate the outcomes.

SparkNLP (1) is provided by John Snow Labs (2) as a unified library of state-of-the-art NLP tools within the Spark environment that can be used in production. Even though the AI Research Labs at Holler Technologies develop their own proprietary AI solutions for production, we have found SparkNLP to be a valuable and useful set of resources helping the discovery process in evaluating hypothesis and ideas about text transformations and NLP. For any problem we want to solve, the discovery phase is important, as one wants to try as many possibilities as possible before deciding on any one solution to further study and implement for production. SparkNLP is one of these for NLP tasks, as it allows to transform texts and extract features ready to apply to some machine learning algorithms with a selection of state of the art technologies and pre-trained models, within the same Spark distributed environment and without having to integrate a range of technologies together. This is important as the discovery phase should enable a data scientist to quickly evaluate what works and what does not work for a particular problem and available training data.

We will exemplify this concept in this article with the task of inferring basic emotions in short text. We will follow the following steps in our example:

  1. Understand the problem and the machine learning task we want to solve, and generate the hypotheses we want to test in designing our experiment.

2. Get or create supervised data to train and test our hypotheses.

3. Study the data for such things as whether it is balanced among all emotions, whether it contains erroneous values and decide what to do with those. Make sure the data is in the desired language for the study, English here.

4. Look at how we can transform text, normalized or not, into numerical vectors amenable for using the chosen machine learning algorithm and create models of emotion inference.

5. Run the experiment, and metrics to compare models, then conclude with respect to the objectives set in the next section.

2. Problem Description and Hypotheses to Test

Our problem is to identify a given finite set of emotions in short English texts. One approach is to solve it as an NLP multi-class text classification task, in which the classes to infer are the emotions we want to identify in the text. The input is a short English text, the output is exactly one of the 4 emotions.

Note that we have oversimplified the problem: what if there are no emotions for example. Also, we will not talk about emotional research here, but it is important for a data scientist to estimate early on what type of performance one can attain, reasonable enough so one does not chase the impossible, at least initially. Most of emotional research is being done on visual or vocal input that contain facial expressions or tone for example. Text does not contain such communication cues that our brain is used to evaluate, and these account for 70% of the cues in understanding conversations.The bottom line is that emotions recognition in texts is hard, so we can’t reasonably expect to reach very high performance levels only based on statistical methods.

For the sake of simplicity in this tutorial, we consider the standard 4 basic emotions: anger, fear, joy, and sadness, for which we have a sample training dataset. The objective here is to demonstrate that SparkNLP is a good resource to quickly test hypotheses in the discovery process, and identify the promising solutions that needs to be further studied. The goal is not to develop SOTA emotion classification. In an NLP multi-class text classification task, one needs examples of each class to predict. In our case, each example is in the form of a text and exactly one emotion that text text expresses. We will use an out-of-the-box TensorFlow-based deep neural network machine learning algorithm provided by SparkNLP for this problem, Indeed, it has been fine tuned for multi-class text classification tasks. We will use the same ML algorithm to train each model, as we want to compare the text transformation techniques.

Such machine learning algorithms require the input to be a vector of numerical values of same size for each input text. There are many ways to transform a text into a fixed-sized numerical vector. One of them is to use embeddings techniques. There are many embeddings techniques, like Word2Vec, GloVe, BERT, Electra, XLNet, USE, and so on. Also, each text can be pre-processed, such as should we keep the words the way they were typed? Should be lower case everything? Should we remove undesirable words? These normalizations techniques will be discussed in Text Pre-Processing (TPP) section. Finally, the data may contain more examples from one emotion than another, does that affect our solutions? This is balancing a training dataset, and as such, one should ask themselves the question whether that affect our solutions? This is called balancing a training dataset.. Our objective is to answer these types of questions. For this, we generate 3 overall hypotheses we want to test with the goal of quickly evaluating which strategy is most promising.

Hypothesis 1: Balanced vs Unbalanced Training Data. Should we balance our training dataset among the classes or use the data for training the way it is?

Hypothesis 2: Text Normalization (TPP) vs. Surface Form (no TPP). Should we pre-process our training data, or not?

Hypothesis 3: Which text embeddings technique is the best for our problem given the training data we have?

The outcome of this experiment is only valid for our problem of course, not as general recommendations for any problem NLP multi-class text classification problems! However, it shows we can test these very fast using SparkNLP.

We will generate all combinations of these hypotheses in our experiment and compare. Each combination is easily defined as a SparkML Pipeline. We will discuss in further sections the various possibilities for the 3 hypotheses and in the experiment section we will show how to quickly create these ML Pipelines and run them in parallel to generate all these models.

Note that we will run the experiment with only one random sample. However in a real world environment we would run multiple samples, and search the space of all models by re-sampling, and changing the various parameters in the learning algorithm as well as the embedding techniques. We will not do that here and we will use the default settings of each. The outcome of the experiment is to identify the best text transformation solution(s) we should pursue further in the modeling process.

3. Data for Modeling

The data we will use in solving any problem is critical. In the case we are getting data we did not create, we need to pay good care of the following:

  1. We need to understand the population from which the example dataset has been drawn from and whether it is the same or similar to the one we want to apply our model to. For example, modeling emotion recognition in short tweet texts during training and applying the model to a completely different population, say long scientific research papers is highly likely not give the proper results. The machine learning algorithm will always give some output, but whether it is the correct one needs to be trusted. There are ways to transfer models from one domain to the other, but that needs to be carefully tested for each problem being solved, and this is out of the scope of this tutorial.
  2. Whether the sample data we use for modeling is a true random sample of that population, as we may indirectly generate a model for a subset of the population that does not represent the scenarios we will have in production. With the example of the short tweets, if one takes, say a sample dataset based on political tweets during an election, and then try to apply it to any tweets, the results of the predictions are highly likely not to be correct. Indeed the terminology used in political tweets will be very different than the one of general tweets and our model will be based only on that.
  3. Similarly, whether the coverage of the linguistic phenomena we are trying to model correctly represents the population. For example, how emotions can be expressed in short texts. This step requires more understanding in emotional research, from linguistics, psycho linguistics, cognitive science, and socio-cultural studies, as emotions are not expressed the same way for every cultures and in every situations. This is out of the scope of this paper as well.

Our current problem is then a supervised machine learning task, so we need some short input texts, where each has been annotated with exactly one of the 4 emotions expressed by the writer in the short text. We already have such proprietary dataset and we will use a random sample of it. One can use any of their favorite emotions dataset to follow the same methodology.

Let’s look at our dataset, and whether it is balanced among all class values (the 4 emotions to predict). Let’s assume the annotated data is in CSV format in a text file located in some data_folder on an S3 bucket called here data_folder. Note that our original data we used for our production modeling contains millions of examples, with emotion intensity, multi-emotions per texts, as well as VAD dimensions, for more than 20 emotions already in a Dataframe on an S3 bucket. We use a CSV sample here for this article. Figure 1 depicts how to read the CSV file.

Figure 1: Loading Emotions Data

We further restrict our dataset to contain only the 4 basic emotions we want to model. The dataset contains 367,283 training instances across these 4 emotions, which distribution is shown below. As one can observe in Figure 2, the dataset is not balanced, meaning there is not the same number of instances in each class value. Whether the dataset should be balanced or not is a hypothesis to test and validate, and highly depends on the type of data in the training set, the machine learning algorithm used for modeling, the problem we are solving, and so on. For example, a large disproportion for one particular class is likely to have our classification model overfit on that class, and that might not be discovered if one is using metrics like accuracy to test a model. In the code, we also split the raw data into a training set and a test set, with a typical ratio of 80–20.

Figure 2: Unbalanced Training Dataset

Next, we create a balance training dataset to compare and contrast the models with unbalanced data. For this, we generate a balanced raw dataset, with all class values having the same number of instances (we used the amount of instances of the smallest class value). Figure 3 shows that we now have a balanced train dataset as shown below.

Figure 3: Balanced Training Dataset

4. Text Preprocessing for Normalization

Text preprocessing is the step in which each text is normalized through a sequence of transformations. SparkNLP provides many out of the box tools, or annotators, that we can assemble together using the concept of a Pipeline from SparkML. Some annotators need to be fit on our data, some use pretrained models.

The minimal transformation being used by the learning algorithm in this article is the DocumentAssembler. This transformer is necessary as the first step in SparkNLP and transform each text into a Document object that can be annotated. There is also a SentenceAssembler annotator to go deeper in the granularity of analysis, but since our texts are very small, we decided that was not necessary.

Figure 4: Text Pre-Processing (TPP) Pipeline for Text Normalization

We use the following sequence of annotators to pre-process the texts, as depicted by Figure 4. Each annotator can be futher specialized and one can check all the available options here (3).

  • Tokenizer identifies tokens with tokenization open standards
  • NorvigSweetingModel is used to spell-check each previous token, where corrections are made automatically to those token not found in an English dictionary. We used the pre-trained model, so we do not need to train it on an English dictionary.
  • Then, we normalize each token based on a set of tools found in the Tokenizer annotator. For this problem we only lower case each token, rendering the test case insensitive.
  • We then remove all English stop words using the StopWordCleaner annotator, specifying that the text is case insensitive, due to the prior step.
  • The next normalization is to generate the lemma normal form of each token, combining together various inflections. We provided the pretrained model here.

The Pipeline is implemented as a SparkML Pipeline where we specify the sequence of steps to execute on the source text data with the column names we want.

Running the pipeline tpp transforms the input Dataframe of raw texts into a succession of Dataframes, resulting in a final Dataframe with each annotation in a separate column, which gives us the flexibility to use any of them as the first step for text feature extraction, to text various combinations and hypotheses. For example, one might use the raw document or the normalized tokens, as will be shown later. These stages are executed in the specified order, and the input Dataframe is fitted (when not using a pre-trained model) and transformed sequentially. Defining a pipeline allows to have training and test data go through the same sequence of transformations. One can save the Pipeline as a regular SparkML pipeline and load it again to re-apply on new data. We have split our pipeline into text pre-processing and text classification approaches due to the fact that we want to test various text classification without re-running the text transformations. Otherwise the entire pipeline would include text classification and any other steps discussed in the remaining sections.

First, we start SparkNLP as shown in Figure 5 in our PySpark cluster. We use their latest version at the time of this tutorial.

Figure 5: Start SparkNLP on a Cluster

Figure 6 shows how to implement the text pre-processing pipeline.

Figure 6: Text Pre-Processing (TPP) Pipeline Implementation

To execute the pipeline, we first fit to the input data, then transform the input data, as depicted by Figure 7. We get a lot of metadata that is very useful. We pre-process both the unbalanced and balanced train sets.

Figure 7a: Execute TPP Pipeline
Figure 7b: Execute TPP Pipeline (Outcomes)

Next, Figure 8 shows examples of the effects of text pre-processing comparing the original text with the resulting pre-processed text. As one can observe, many tokens were transformed and/or dropped. Text pre-processing raw texts may or may not improve the accuracy of the models. It is important to test text-processing on each problem instead of automatically applying it. The code also shows how the get the outcomes of a particular column generated by the SparkNLP annotators.

Figure 8: Execute TPP Pipeline

5. Text Features Extraction

Text feature extraction is a necessary step in text classification tasks, as the kind of machine learning algorithm that we are using here does not take full text as an input. We are using a TensorFlow deep learning multi-class text classifier (ClassifierDL) and such neural networks take fixed size numerical vectors as input. Each text needs to be transformed into a numerical vector of the same size for each input text, independently on the number of tokens in the source text. There are many ways to do so. In this tutorial, we are using text embedding techniques, at the token level (word embeddings), but also at the full text level (sentence embeddings), both with various types of contexts. We will look at the following techniques to extract text features, and we will try each for our problem and compare the results for a quick approximation of which technique(s) would be the most promising to continue experimenting with for the problem at hand.

  • Global Vectors for Word Representation (GloVe)
  • Universal Sentence Encoders (USE)
  • Bidirectional Encoder Representations from Transformers (BERT)
  • Generalized Autoregressive Transformer-XL (XLNet)

It is important to keep in mind to try several techniques, as one that is reported to be best in some problems may not always be the best one for another problem, such as the problem you are trying to solve. Each problem requires it own full analysis. However, in this tutorial we will only run one model per type of machine learning algorithm. In a production environment, you would use techniques such as resampling and model space search. We will explain each text feature extraction approach in subsequent sections.

Most of these techniques generate embeddings at the word level, with some variation of what constitute the context around each word. That means each word in the input text will have its own vector of some dimensions. Assume we are using the Twitter pre-trained GloVe model above. Then, word embeddings generates one vector of size 100 for each token of the source text. Figure 9 depicts our basic pipeline to generate word embeddings without pre- processing. It transforms each source text into a document, split it into sentences, and then into tokens, for which we generate a word embedding vector of size 100. One can observe that each text is a sequence of n vectors, where n is the number of tokens for the source text. The first text has 15 tokens, thus will be represented by 15 independent vectors. Each text will then have a different vector size.

Figure 9: Basic Word Embeddings

Recall that we seek to have a single vector of fixed size for all our texts. We can resolve this by padding, meaning setting some fixed vector size of n tokens, use the concatenation of the first n tokens of the source text, and ignore the other ones for sentences longer than n, or fill with some default vector. Alternatively, and this is the approach taken here, we can take a mean average of all the word embeddings in a text, independently of their total number, and calculate the average vector representing all vectors. This is implemented as depicted in Figure 10. This technique statistically approximates the overall meaning of the sentence by a vector in a n-dimensional space from the independent meaning off all words in the sentence in the same dimensional space. Taking the mean average in a common technique that shows good results, but not the only one. For short texts, it is not a bad assumption, which is our case here. Remember that all these techniques are only statistical approximations of underlying semantics phenomenon, not purely symbolic psycholinguistic-based semantic and pragmatic analyses. Continuing the GloVe example above, we use SparkNLP SentenceEmbeddings that calculates the mean average vector of all word vectors automatically for us. Note that we could use other pooling strategies.

Figure 10: Sentence Embeddings by Mean Average of Word Embeddings

Now, each text is represented by exactly one vector of the same size, 100 dimensions. It is the average vector of all word vectors.

Now that we know how to transform a word embeddings into a sentence embedding vector, let’s look at all choices we want to evaluate.

5.1. Representing Texts as GloVe Sentence Embeddings Vectors

GloVe is an unsupervised learning algorithm for obtaining vector representations for single words. It was developed at Stanford University. GloVe is at the word/token level and has very limited contextual information, that is, only from aggregated global word-word co- occurrence statistics from the given corpus used to generate the pre-trained models. It is a good technique that has shown good results in some problems. Several GloVe pre-trained models are provided here. They have been trained on various types of source data, such as Twitter, Wikidata, and Common Crawl. Which model to use depends on the population from which the texts on a specific problem have been drawn, and one needs to carefully chose one, or try them when the population is not well understood. SparkNLP provides 3 of the available GloVe models, which allow us to test GloVe embeddings for particular problems. They are:

  • glove 100d, which may be the Twitter original model trained on a 2B tweets, 27B tokens over a 1.2M uncased vocabulary.
  • glove 6B 300 was trained on Wikidata corpus with 6B tokens and a 400K uncased vocabulary, each token is represented by a vector of 300 numbers.
  • glove 840B 300 is currently the largest GloVe model and has been trained on a Common Crawl, representing 840B tokens over a 2.2M cased vocabulary.

We will be using the Twitter pre-trained GloVe model above as it is the closest to the type of inout texts we are using here. Then, word embeddings generates one vector of size 100 for each token of the source text. We have already seen above how to create a pipeline using GloVe word embeddings and then generate the mean average vector for the entire text.

5.2. Representing Texts as BERT Sentence Embeddings Vectors

BERT stands for Bidirectional Encoder Representations from Transformer and are currently the state of the art word vectors developed and researched by Google since 2019:

  • Word context is bidirectional, which means both the context of a word is estimated from left to right and right to left for each word; there are several topologies we can experiment with (varying numbers of hidden layers, etc.)
  • Encoder means that the data is a vector; there are various sizes available
  • Representations means a word is represented as a vector of real-numbers
  • Transformers are a novel architecture
  • One of the main advantages of techniques such as BERT, or an earlier similar technique ELMo, is that the vector of a word changes depending on how it is used in a sentence. This allows for much richer meanings of embedded words. Pre-trained contextual representations like BERT have achieved great success in natural language processing. However, the sentence embeddings from the pre-trained language models without fine-tuning have been found to poorly capture semantic meaning of sentences. Research is going on to improve BERT.
  • SparkNLP provides 5 main BERT models we can experiment with to estimate quickly the effects on our problem. We can further the analysis by getting more models and pretrained transformers from various websites. SParkNLP provides 4 word embeddings models (cased/uncased and small/large vector representation) and one small BERT pre-trained sentence embeddings model. For the first 4, we will have to generate a sentence embeddings vector from the individual word embeddings vectors as we did for GloVe previously. We use the same sentence_embeddings as before so we can readily compare with other word embedding techniques, isolating the process to generate an average sentence embeddings vector for a given text. Two word embeddings models take into account the case found in a text, the other 2 ignore case, just like we did for TPP. Two models are small with 768 dimensional vectors, and 2 are large with 1024 dimensional vectors. We can easily compare these 4 models for our problem with minimal code resting the annotators we have defined earlier.
  • The last model is a pre-trained sentence embeddings model, for which there is no need to average the word vectors for each text. Figure 11 depicts the pipelines to use in our experiment.
Figure 11: Sentence Embeddings with BERT

5.3 Representing Texts as USE Sentence Embeddings Vectors

USE stands for Universal Sentence Encoder transforms each text into a high dimensional vector directly. Pre-trained models are publicly available in and come with two variations:

  • One model is trained with a Deep Averaging Network (DAN), and referred in SparkNLP as tfhub use
  • One model is trained with a Transformer encoder, and referred in SparkNLP as tfhub use lg

The two have a trade-off of accuracy and computational resource requirement. While the one with Transformer encoder has higher accuracy, it is computationally more intensive. The one with DNA encoding is computationally less expensive and with little lower accuracy. USE generates embeddings at the text level, so there is no need to compute the mean average as before. Figure 12 depicts the two pipelines we will use in the experiment.

Figure 12: Sentence Embeddings with USE

5.4. Representing Texts as XLNet Sentence Embeddings Vectors

For our final text classification technique, we look at XLNet word embeddings. XLNet is the most recent bi-directional contextual sentence embeddings technique at the time of this article. It is based on observed BERT discrepancies on some problems, and is worth trying on this kind of problems. XLNet is a generalized autoregressive pre-training method that enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and overcomes the limitations of BERT due to its autoregressive formulation. XLNet combines several concepts into pre-training, principally from Transformer-XL and state-of-the-art and autoregressive model. Empirically, under comparable experiment settings, XLNet outperforms BERT on 20 tasks, often by a large margin, including question answering, natural language inference, sentiment analysis, and document ranking. XLNet is a word embeddings technique, that needs to be averaged to get the sentence embeddings, as we did for all the other word-based embeddings techniques we use in this experiment, as shown in Figure 13.

Figure 13: Sentence Embeddings with XLNet

6. Generating an Emotion Inference Model

We use the general purpose multi-class text classifier learning algorithm provided by SparkNLP for all our pipelines, so we can isolate the comparison of the various text transformations that is the objective of the experiment. It is close to what we have in mind to use in production as well, say a TensorFlow- based neural network model. We should further explore these models in TensorFlow directly in the next phase of the process to evaluate the types of networks that give the best results.

We need to add the ClassifierDL multi-class classification algorithm to the pipeline, and fit the pipeline to generate a ML model for our problem on the chosen training data. The example in Figure 14 is a feature extraction pipeline using GloVe and text classification without text-preprocessing on the unbalance training dataset machine learning pipeline.

Figure 14: Multi-Class Text Classification using Classifier DL

We obtain a model, that is, an approximation of our problem based on all the assumptions we have made so far: the training data, whether to balance it, whether to pre-process it, which features to extract to abstract the texts in numerical form, which machine learning algorithm to use, and overall which parameters we have set to all these decisions!

7. Measuring How Good a Model Is

Now that we have a model, we need to evaluate its “goodness,” or how good does it solve our problem. There are many classification metrics one can use, such as accuracy and F1-score that we will use here. We generate predictions using unseen texts (test dataset), calculate our chosen 2 metrics as well as the confusion matrix to have a good sense of which class values under-perform or are over-fitted in our model.

First, we need to generate predictions using a model. Figure 15 shows predictions on the test data. The true value are what we expect (annotated text with exactly one emotion), and the prediction is what the model estimates the emotion to be.

Figure 15: Inferring Emotions with a Model

Next, we need to calculate the goodness of the model. Scikit-Learn ML library provides some options here. First is to use classification metrics, such as accuracy and F1 score. These are top level indicator of performance. We found an F1 score of 0.67, which is okay as a start. Not great, but not below chance. It is important to note that emotion identification in texts is a difficult subjective problem that we are attempting to resolve statistically with simple text features. We will try to get better at this. Remember, this is for the unbalanced training dataset, and no text pre-processing, using GloVe word embeddings, averaged to estimate the sentence embeddings.

Second, there is the classification report that provides at a glance the classification results per class value. It gives an appreciation of the classification result at a step deeper. Accuracy will not have captured if a class value was overfitted. The F1 score is better than accuracy, but for multi-class problems, it is an average when computed above, and may hide some issues in our model. Figure 16 is an example of a classification report.

Figure 16: Classification Report Example

In our case, we can observe that the emotion ‘joy’ is the one performing best with this model and training data, ‘fear’ is the least satisfactory. At this point, we could go back to our data and study the types of messages in each class.

Finally, we will use the confusion matrix that gives even a deeper appreciation of the classification model, giving the counts for each class, as depicted in Figure 17. It allows also to pinpoint class values that correlates, and so on. We show below how to calculate this matrix and also display it. The goal is top have the diagonal with the exact count of each class (the darker the color the better), and all other cells to be 0, which means the model correctly identifies each emotion and does not mis-classified any. For our first model, we can see that there are a lot of false positive and false negative for all emotions! There does not seem to be major problems like 2 classes being confused with each other. We hope to generate better models with better sentence embeddings and maybe with a balanced training dataset! See the experiment section.

Figure 17: Confusion Matrix Example

These 3 levels of granularity are very often sufficient to evaluate the performance of a multi-class text classifier and most importantly to pin-point the areas to look in more details to improve our modeling process, such as getting more data, getting better training data, that maybe 2 classes should be combined, what classes under-performed and need to be studied, and so on.

Let’s compare the 4 possibilities for GloVe sentence embeddings between balanced, un- balanced and with or without text-preprocessing, as depicted in Figure 18.

Figure 18: Confusion Matrix Example

With respect to using GloVe as a text embeddings for this problem, these 4 models indicate an initial result that a balanced dataset seems to improve classification of emotions versus a non-balanced one; furthermore, not using text-preprocessing on a balanced training dataset shows the best results with respect to average F1 score, but the balanced dataset with TPP shows fewer false positives and negatives. An inspection of the classification report shows clearly that the balanced dataset has better individual classes results on F1 score for all classes than the weighted average score. For the unbalanced one (left figure in Figure 19), it is obvious that the “joy” emotion has better F1 score and pulls the average F1 score up. However, in this problem, it would be better to classify more emotions better than just one emotion very good and all others not that great. It is important to look at the outcomes of a model from various angles, as different metrics shows a different views of the results. The balanced dataset with TPP would be our choice of being the best model because it will classify best all classes, instead of just one very good and not the other ones. However, in production, we would do much more tests generating more models to correctly conclude on this hypothesis. Nevertheless, both models give acceptable initial results in mid-0.70.

Figure 19: Classification Report Example

8. Experiment Results and Analysis

At this point, we are ready to run our experiment in parallel on our EC2 clusters. Recall that we want to test 3 hypotheses, each with several possibilities:

  • H1: balanced vs unbalanced training data: 2 training datasets
  • H2: text normalization vs. surface form: 2 possibilities
  • H3: text embeddings: GloVe (1), BERT (5), USE (2), XLNet (2): 9 possibilities

The experiment will generate 2*2*9=36 emotion inference models and for each we measure overall accuracy and F1 score, detailed accuracy and F1 score per class, and detailed true/false positives/negatives. We have defined in all the sections above all the necessary pipelines pretty easily. We put all this together in a Databricks notebook and generate all models and evaluations data.

Here are the results for each text embeddings technique used on balanced vs unbalanced training data and text normalization vs. surface form.

8.1. GloVe Models

We generated 4 models with GloVe. These results, as shown in Figure 20, show that, when using GloVe word embeddings, balancing the training data and normalizing the text is better.

Figure 20: GloVe Models Performance

8.2. BERT Models

We generated 20 BERT models that performed very differently, as shown in Figure 21. First, it seems that balanced training data using normalization is best. Second, word embeddings with mean average seem better for our problem, than using the pre-trained sentence em- beddings models. This is surprising as our intuition was that a model trained for sentence embeddings would be better than using the mean average of word embeddings. Third, normalization of text while keeping the original case is best. Fourth, adding more dimensions to the representation (from 768 to 1024) has only marginally improved the model, and in a real-world environment where cost is a factor, it might not be justified. Our best BERT model is then 768 dimensions to represent the texts, a cased BERT model, balanced training data, and text normalization through text pre-processing.

Figure 21: BERT Models Performance

8.3. USE Models

We generated 8 USE models, as shown in Figure 22. All models are promising. The Trans- former Encoder ones being the best based on averaged F1 score. The larger, more computational intensive one gave better overall results. It seems that using text normalization improved the models. However, whether to balance the training data or not is not that clear. Even though the overall F1 score is better for the model using unbalanced training dataset, an inspection of the confusion matrices favor the balanced one for each class. There are much more TP than for the unbalanced one. Again, on the unbalanced one, one class seems to lift all others. It is a matter of how the models will be used to decide between the two, favoring recall or precision, for example. This is problem specific.

Figure 22: USE Models Performance

When we inspect the classification reports for the 3 identified best model (see Figure 23), the macro average for F1 score shows that the 2 models balanced with TPP versus unbalanced with TPP have F1=0.78, more than unbalanced without TPP F1=0.76, which we rule out now. When you inspect both the TP numbers, and also the F1 score for each class in the classification report, we would chose the balanced with TPP model against the unbalanced with TPP one, even though the latter has marginally higher weighted average, but same micro average. Again, the final choice depends on how you intend to use your model.

Figure 23: USE Models Performance

8.4. XLNet Models

We generated 8 XLNet models, as shown in Figure 24. XLNet was obviously not a good choice for this problem, at least the way we have used it. Remember we want to compare the same methodologies. We will disregard this technique.

Figure24: XLNet Models Performance

8.5. What is the Best Model?

In conclusion, the most promising techniques for text transformation into numerical vectors for our problems are:

  • H1: balanced training dataset
  • H2: using text normalization, cased models when applicable
  • H3: USE DAN representation

The next step will be to continue the investigation with our best models identified previously for USE, BERT, and GloVe. SparkNLP has allowed us to quickly identify where to concentrate our efforts with the most promising models to fine tune, but also the worst models not to waste time and compute cost on them.

9. Conclusions

We have shown that using SparkML and SparkNLP, we can quickly evaluate various approaches to have a sense of what pipelines seems the most promising. We have shown how to quickly use various word and text embeddings approaches, how to normalize text, how to use a deep learning classifier to generate models, and how to compare them. We have shown that it is not sufficient to just look at the overall accuracy or F1 score for multi-class classification problems, and one should look at the evaluation of the model from various point of views: how each class performs, whether recall or precision is more important, and also the confusion matrix with details of true/false positives/negatives.

This experiment required almost no major coding on our part. SparkNLP is a valuable set of libraries for discovery, modeling, and also, after proper thorough experiments, to be used in a production environment. We have demonstrated this on the problem of inferring the 4 basic emotions in short texts.

Hope you have learned something to apply in your own projects! More to follow from Holler AI Research.

References

  1. https://towardsdatascience.com/introduction-to-spark-nlp-foundations-and-basic-components-part-i-c8
  2. https://www.johnsnowlabs.com
  3. https://nlp.johnsnowlabs.com/docs/en/annotators#tokenizer

More from Holler Developers

Following

Holler makes your texts, posts, payments, and DMs more expressive. How? By suggesting the most relevant content (Stickers & GIFs)– right when you need it most.

--

--