One of the problems our Innovations team is working on at Socialbakers is sentiment analysis. Sentiment analysis is an automated process of recognition of how an audience feels towards any given subject from written or spoken language. It is one of the most common classification tools exploiting artificial intelligence. Sentiment analysis algorithm analyzes given text and predicts whether the underlying sentiment is positive, neutral or negative.
Socialbakers is a global AI-powered social media marketing company and there are many use cases for sentiment analysis in our marketing software-as-a-service platform, called the Socialbakers Suite. For example, clients need to understand how people feel about their brand or campaign and we can analyze user conversations related to their campaign posts to discern how they feel. Another example is customer relationship management. Imagine thousands of messages filling a brand’s inbox every day — it’s extremely difficult to read and properly respond to each of them. This is when sentiment analysis can help them to filter and prioritize the importance of each case, and decide which must be dealt with first.
Knowing how important this feature is, we decided to deliver corresponding functionality to the Socialbakers Suite in the shortest possible time. We researched available 3rd-party solutions and realized that none of them fit our needs completely:
- They typically cover only a very limited set of languages (especially open-source projects) but our clients are spread all around the world
- They can be quite expensive, considering the tens of millions of messages we need to process every day
- They are not well-suited to social media messages (short texts on a wide range of topics containing slang and practically no context) a nightmare for most NLP (natural language processing) algorithms in general
Therefore, we decided to build our sentiment analysis solution in-house. The Innovations team has six members and is responsible for the design, research and development of “smart features” for Socialbakers products. This involves exploiting big data analysis and machine learning techniques. It’s our job to create concepts and bring them from concepts to working prototypes and production implementations. So, just another day in paradise.
By the way, an entertaining demonstration of our solution has been published by Socialbakers’ Data Analysis Team on LinkedIn recently. They have applied our sentiment analysis algorithm to user comments published on the Game of Thrones Facebook page. They did this one day prior and one day after the premiere of each episode of season 8. The results correlate with the feelings of Game of Thrones fans across the internet during the same time period. See image 1 below.
A state-of-the-art approach to sentiment analysis exploits recurrent neural networks. This blog post presents an architecture of our solution and briefly discusses the theoretical background necessary to understand our approach. The following blog posts will focus on how we gathered the data necessary for training our neural network and how we built and deployed the solution into production. Getting the right data and fine-tuning the system to the data are, in our experience, the crucial parts of such projects.
Introduction to Recurrent Neural Networks
Before we explain HOW we built our sentiment analysis algorithm, let’s look at the very basic theory about WHAT we built first. A recurrent neural network is an essential part of our solution and like any other neural network, it consists of neurons.
Every neuron has 1 or more inputs, 1 output and performs 2 elementary operations, see image 2 below. Firstly, it reads data from the inputs and calculates a weighted sum of all input values. Secondly, it applies the so-called activation function on the sum and exposes the result at its output. Value at the output of a neuron represents activation (strength of response) of a neuron caused by given input. The magnitude of the activation is affected by the specific weights applied in the weighted sum operation (the weights are set by training the neural network) and selecting the activation function, which transforms numbers from a potentially infinite range (resulting from the sum operation) to a very narrow range, typically between 0 and 1¹. Neurons also typically have a so-called bias term that is trained together with the weights but it is not shown in schemes. This term is added to the sum and shifts the whole input of the activation function. You can read about shifting a decision boundary in this context sometimes.
(Deep) Neural Networks
A neural network is a set of neurons, typically arranged into layers², see image 3 below. The first layer stores input values of a network (e.g. representation of words of a sentence, see Word Embeddings) and it is called an input layer. The last layer stores the result of the network (e.g. predicted sentiment classes) and it’s called an output layer. The rest of the layers are so-called hidden layers and that’s where most of the artificial intelligence of a neural network is hidden. The more neurons a neural network contains, the more expressive it is and the more complex problems it is capable of solving.
Arranging the neurons into a larger number of narrower hidden layers (in contrast to shallow networks with wide hidden layers) enables data scientists to significantly decrease the number of neurons in the network while maintaining the original expressiveness. Thus, deeper networks allow us to build solutions to more complex problems without needing to consume significantly more computational resources. Once the network has 2 or more hidden layers, we say it is a deep neural network.
The Weakness of Feed-forward Neural Networks
As you can see in image 3 above, the architecture of such a neural network allows the information to flow in only one direction, straight from the input to the output layer (see the arrows). The input and output layers have a fixed size and there is no a priori structure in the input — everything is processed at once. Such architecture has a bad notion of order and doesn’t perform well in the processing of sequential data (time series, speech, text, audio) in general. While processing a sequence of numbers, this architecture can grasp that a value in neuron 2 has to be followed by a larger value in neuron 3 to detect “growth” in the data (which is a very specific rule just for neurons 2 and 3). But it cannot grasp well that a value in neuron n has to be followed by a larger value in neuron n+1 to derive a general rule for recognition of “growth”. To achieve decent performance with such an architecture, a network has to have an unnecessarily large number of neurons and larger training data set.
Recurrent Neural Networks
Recurrent neural networks (RNNs) overcome the weakness of feed-forward neural networks by introducing a recurrent element with memory into the architecture. RNNs don’t process the whole input sequence at once (in contrast to the previously mentioned architecture) they process it entry by entry. In our case, instead of processing the whole sentence, RNN consumes the input sentence iteratively, word by word. In every iteration:
- a word is read by the recurrent unit (e.g. recurrent hidden layer)
- last known state of the recurrent unit is loaded from memory
- (1) and (2) are combined resulting in a new state of the recurrent unit
- the result is stored in memory (replacing last known state) and can be forwarded deeper in the network
An a priori structure or order is introduced into data this way. Simply, an RNN reads a word while keeping a representation of a sequence of previously read words in memory and considering them during the analysis of the word, see image 4 below.
RNN can for example read a sequence of words of arbitrary size and produce a single label at the output (N:1, N words are read and 1 sentiment class label is returned), or read a single word and produce whole sentence (sequence of words) at the output (1:N, one word is read and the whole poem is generated) or read a sequence of words and produce a sequence of words (N:M, an English sentence having four words is translated into a French sentence having six words).
In general, neural networks are made to process numbers, not words. Therefore applying them to NLP tasks, such as sentiment analysis, requires preprocessing of the input data to represent the analyzed texts with scalars (numbers) or vectors (lists of numbers).
A simple and widespread approach to a word representation is to represent a limited set of acceptable words with a set of orthonormal vectors — every word is represented with a unique vector consisting of all 0s and only one 1, it is called one-hot encoding. For example, for a set of four words: (good, bad, and, ugly), we get a set of four one-hot encoded word vectors: ([1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1]).
As you can see, every word occupies its own dimension (position in a vector) and makes individual word vectors independent from the mathematical point of view — that comes in handy when processing with machine learning algorithms. The dimension of the vectors equals the number of words in the dictionary (only words recognized by the network), which is the key drawback of this approach. With an increasing number of words increases also the dimension of the vectors and the size of the input layer and the number of neurons of the neural network. This leads to a higher demand for computational resources during the training and application of the network.
Word embedding can overcome the high-dimensionality issue at the cost of losing the independence of the word vectors. It’s a process of reprojection of (one-hot encoded) word vectors from high-dimensional space into low-dimensional space.
This transformation is frequently designed to not only significantly reduce the dimensionality of the data but also to place similar words close to each other in the new space (it depends on the definition of similarity). In other words, we move from representation, where every word occupies its own dimension and word vectors have nothing in common to representation, where words share dimensions and similar words have similar vector representation. Thanks to this new property, neural networks can more successfully process unseen combinations of words.
For example, we want to build a neural network that reads a sentence and outputs “true” or “false” value, based on whether the sentence is or isn’t a fact. A sentence “A banana is a fruit.” is used during training of the network and the network learns it is “true”. Then, it has to process the sentence “An orange is a fruit.”. It has never seen this sentence before. Thanks to word embeddings, the network knows “banana” and “orange” are similar to each other (they have very similar word vectors) and it decides to evaluate the sentence in a similar way — “true”.
The calculation of word embeddings is a lengthy, computationally expensive process requiring a lot of data. Therefore, a common practice is to use precalculated embeddings (word vectors), such as word2vec³ or GloVe⁴, prepared by research teams who have the necessary resources. Most of these word embeddings use a similarity metric based on the context within a sentence or frequency of co-occurring words — a semantic similarity. There are also more advanced approaches that consider subwords or even generate embeddings on the fly based on the given combination of a word and its actual context (ELMo⁵, BERT⁶). But for all of them, similar words are words with similar meanings. Which is not our case.
Sentiment Word Embeddings
We decided to build our own word embedding in-house because we think that using the embeddings mentioned previously could confuse our algorithm. They use similarity metrics based on the similarity of meaning or context of words. Considering our use case, we suppose that a similarity metric based on the similarity of the sentiment of given words will lead to a better separation of positive and negative words. This helps with better identification of what kinds of things make whole sentences positive, negative or neutral.
For example, the words “good” and “bad” have a similar context in sentences but very different sentiments. On the other hand, the words “science” and “politics” have a different context in sentences but the sentiment is rather neutral in both cases.
You can see an example of a sentiment-based embedding in image 5 below. We have selected the 50,000 most frequent tokens (words, emojis, symbols, etc.) and created an embedding into 300-dimensional space. Then, only a subset of emojis was selected and their word vectors were projected onto a 2D plane using the t-SNE⁷ algorithm.
A number of experiments led us to a recurrent neural network with surprisingly simple architecture. In fact, it was the first baseline architecture we came up with and no following modification of the architecture brought a significant improvement to its performance. Our initial theory was that the task is so simple that the word embedding itself largely solved it, or we just didn’t have enough data to fully exploit the potential of the network. A diagram of the network can be seen in image 6 below.
The input layer of the neural network can consume a sequence of up to 50 one-hot encoded tokens⁸ (word vectors) from a set of 50,000 tokens (a dictionary) most frequently appearing in our training data set. We expect most of the processed texts to be very short — they’re social media messages. Longer messages are truncated to the first 50 tokens found in the dictionary.
The sequence of up to 50-word vectors is sent into the embedding layer. There, the 50,000-dimensional word vectors are embedded into 300-dimensional space. This layer is trained together with the rest of the neural network to help the network to do its job in the best possible way. The example results in image 5 were generated right from this layer once we finished training the network.
The recurrent layer that consumes the sequence of the 300-dimensional word vectors outputs a single 300-dimensional vector representing the whole input of the network. Here, we decided to use the unidirectional gated recurrent unit (GRU⁹). It’s much faster to train than long short-term memory unit (LSTM¹⁰) or even their bidirectional¹¹ versions. We didn’t see any significant changes in the performance of the network except much longer training times when we were experimenting with them.
Next, the 300-dimensional representation of the input is processed with a fully connected layer. This is the very basic neural network layer we presented in image 3 and its only job is to derive more complex features from the 300-dimensional representation of the input sequence. We use the ReLU activation function in this and all previous layers. Again, it helps to train the network faster and didn’t inhibit its performance during our experiments.
Finally, the output layer combines all the 300 features from the previous layer and calculates values for the 3 output neurons. Here, we use the softmax activation function. Its key property is that it considers all neurons in the given layer and normalizes their output values such that their sum is equal to 1.0. Then, the values can be interpreted as probabilities. In other words, the output of the network is a probability of an input text being positive (neuron #1), neutral (neuron #2) and negative (neuron #3). Neuron with the strongest activation is used to assign a sentiment to the analyzed text.
We have presented a simple architecture of the recurrent neural network we use for analysis of a sentiment of social media messages. It’s not bad to end up with such a simple model if you need to efficiently analyze hundreds of millions of messages a day. Moreover, it works very well considering the type of texts you’ll encounter.
We have also briefly introduced some basic concepts from the area of neural networks to make it easier to understand what exactly we have built and to explain some decisions we made.
But, the largest building block of any successful deep learning model is the data. In the next blog post, we will focus on this topic: where and how to get the right data and how to prepare it for training our model. That and some implementations details, finally.
Notes and References
: There are activation functions, such as ReLU, that don’t map values into finite range. For more details, see: https://arxiv.org/pdf/1811.03378.pdf
: Neural networks may have more advanced architecture, see examples here: https://bura.brunel.ac.uk/bitstream/2438/14221/1/FullText.pdf
: Word2vec word embedding
Efficient Estimation of Word Representations in Vector Space, 2013
: GloVe word embedding
GloVe: Global Vectors for Word Representation, 2014
: ELMo word embedding
Deep contextualized word representations, 2018
: BERT word embedding
BERT: Pre-training of Deep Bidirectional Transformers for Language…, 2019
: The t-SNE is an ML algorithm performing a nonlinear dimensionality reduction, see: http://www.cs.toronto.edu/~hinton/absps/tsne.pdf
: One-hot encoded vectors are frequently stored in a table and calculations refer to entries in this table using indices instead of using the original vectors. It is one of many optimizations we can see in the implementations of neural networks.
: GRU recurrent unit
Learning Phrase Representations using RNN Encoder–Decoder for St…, 2014
: LSTM recurrent unit
Long short-term memory, 1997
: A bidirectional recurrent layer is basically a combination of two unidirectional layers reading an input from the opposing directions. When analyzing an entry of the input, it knows not just what preceded the entry but also what follows it.