A Method for Building a Strong Baseline Text Classifier

12 min readJun 7, 2019

This post is dealing with domain specific embeddings and how you can train your own powerful distributional model, to solve any NLP downstream task you want. First, we’ll go a bit deeper in the theory of word embeddings and later I’ll show you how you can easily transform them in powerful paragraph embeddings. I’ll also want to discuss why transfer learning is not always the best option when it comes down to domain specific text data.

Domain Specific Text Data

Let’s talk about what is so special about domain specific text data and why it’s so important to distinguish between “normal” and domain specific text. First, I want to refer a domain as definable problem field, which may use its own terminology, to share knowledge and to gain a common understanding of its concepts. For example, the field of machine learning is a domain for itself, which uses specific words and terms that in a specific context are given a specific meaning.

It’s important to understand that those terms and words can mean something completely differently in another domain. If we speak about neural networks within the domain of machine learning, we would probably associate something like: non-linear functions, loss-functions or maybe different neural network architectures. If you are an expert in neurosurgery, your first associations would be probably something else. In the field of linguistic this is referred as polysemy which refers to the coexistence of many possible meanings for a word or paragraph.

When it comes down to build a state-of-the-art text classifier, it basically can be reduced to two things - the distributed representation and the classification algorithm you’ll apply. Distributed representations are also known as embeddings and they are mainly based on the idea of the distributional hypothesis, which states that the meaning of a word is characterized by the company it keeps — In other words: the words meaning is defined by the context in which it occurs. In general, methods to calculate embeddings can be divided in word embeddings and paragraph embeddings. Word embedding methods include the semantic of a single word and paragraph embeddings represent chunks of text as a single semantically vector. This article will mainly focus on distributed representations and ignore the choice of the right classifier algorithm.

At this point you should know what I understand as domain specific text and what embeddings are. Before I’ll tell you why this is important to distinguish between a domain specific and a “normal” text. We need to grasp at least the fundamental theory behind word and paragraph embeddings (If you already familiar with embeddings techniques, you can skip the next chapter).

Word Embeddings

One of the pioneer methods is the skip-gram model by Mikolov et al. (2013). They used one of the simplest neural network architectures you can imagine. A one hot encoded vector is served as input. Furthermore, the neural network uses a projection layer which is a hidden layer without any activation function and outputs a probably distribution, which gives you information how much likely it is that your input word appears in the context of the other words in your vocabulary.

This means if two words appear in a very similar context, they have a similar meaning. The context of a word is determined by the windows size, a parameter you can adjust before training. It’s like a filter that feeds the neural network with word pairs (McCormick visualized and explained the idea very good!). The network is going to learn the statistics from the number of times each pairing shows up.

The Skip-Gram Architecture

When I first started to grasp the idea behind the model, my biggest issue was to understand the purpose and functionality of the projection layer. So, let’s take a more mathematical look, to understand the fundamentals completely. The input is the word encoded as a one hot vector, which has the size of our vocabulary. Let’s gw and gw’ the weight matrixes of the neural net, which is initialized randomly and p the number of neurons in the hidden layer.

The idea of the projection layer gets pretty clear if we look at the multiplication between the input and weight matrix. Due to the one hot encoding, only the weights corresponding to our input word will be selected — our hidden layer is like a look up table which hold the weights for every word in our vocabulary. From now on it’s pretty straight forward. The hidden layer is multiplied by its weight matrix and feed into the output layer represented by a softmax function to obtain the probability of the context words.

Since we are training a supervised model we need a cost function to calculate the perfect weights (embeddings) in our network. The Skip-Gram architecture tries to maximize the average negative log likelihood of the context words given a certain input word. Where the size of the vocabulary T is used as normalization technique. The context of an word wt is defined as follow:

The “cost” of the loss function is calculated and backpropagated trough the network. The network is going to tell us the probability for every word in our vocabulary of being a context word given a specific center word. In other words: the probability distribution is going to give us the probabilities which words are likely to appear in the context of a given input word.

But now the trick! This probability distribution will never be used as an embedding. The embeddings for each word are represented by the weight matrix gw, which will be adjusted trough the processes of backpropagation over time.

FastText

What you need to keep in mind, is that the neural net only can produce embeddings for words it has seen. If you want your neural net to be able to infer unseen words, you need to retrain it! Also keep in mind if a word only seldomly appears it won’t have a semantically strong embedding, because less context is given. In addition, the surrounding words or context is specifying the semantic of the dense vector.

So, you need to ask yourself two questions, if you want to train or reuse an embedding model.

Is the model trained on enough data to avoid out of vocabulary errors?
Is the embedding model trained on enough specific data corresponding to solve your downstream task?

Regarding to the first question, a more robust model for out of vocabulary error seems to be FastText. Bojanowski et al. (2017) used the logic behind the skip-gram model and trained it on word n-grams. The final word embedding is calculated by the average of his n-gram embeddings. Those sub word information allow to find distributed representation for raw words in the vocabulary. Since rare words could still be broken into character n-grams, they could still share these n-grams with common words. It can even give vector representation for words which are not in the vocabulary, by separating the word in in his n-grams and combining these.

Regarding the second question, it really depends on your problem you want to solve, if you have enough data. If you may want to distinguish between two classes and you don’t have to deal with very specify domain terminology you may don’t need much data.

The Issue About Domain Specific Text

When it comes down to domain specific vocabulary we normally don’t use it in our daily conversations and if we may use it then even with completely different meaning or context, which makes it really hard to distinguish between them. While using a not domain specific embedding model means, that the word or term may never was learned/seen by the model or that it so seldomly represented in the training data that the semantically value of the word goes near to zero. It’s also possible that word we want to infer, refers to a completely different semantically concept, which makes the embedding pretty useless.

Recently introduced models like ELMo, the Universal Sentence Encoder (USE) or BERT are able to produce high quality embeddings, which easily outperform simple word averaging or single word embeddings produced by Word2Vec or FastText. But it’s unlikely that you’ll have enough domain specific data to train these deep neural nets from scratch. However, it’s possible to use those models in the context of transfer learning.

The main idea behind transfer learning is, that you can reuse a model which was developed for one task for another task. It’s a common approach to retrain the model on your own annotated data to accomplish fast and good results according to your problem statement. In addition, the computational, time as well as data resourced required to build your neural network by scratch are limited to just a part of the normally required resources to build your model by scratch.

At this point you may ask yourself why you should use your own embedding model? Companies like Google provide highly trained models (e.g. TF-Hub) — you can use them and retrain them regarding to your own needs. Google has probably more know-how, data and computational resources as you have — right?

I can’t say anything about BERT but when it comes down to the USE and ELMO I experienced not good results while training my own distributional models. Chen et al. (2018) showed similar results with biomedical text data and the USE. Those pretrained models probably achieve pretty good results when it comes down to text we use in our normal conversations. But when it comes down to really domain specific text data, they probably suffer from the same drawbacks as mentioned before. Words which never have been appeared in the training or have learned on the basis of a completely wrong context, won’t have the same semantically expressiveness as embeddings trained on a vocabulary that put the right words in the right context.

I don’t say those are bad models to use for your specific NLP related downstream task! If you have enough specific text data to retrain those, they’ll probably outperform the described methodology. But if we are honest with our self, it’s hard enough to get annotated data like pictures or just some KPIs but extracting domain specific text, which is additionally labeled — that’s another story.

Deep Averaging Models

Deep Averaging Networks (DAN) are a method to infer paragraph embeddings. In general, every paragraph embedding method is based on a composition function, which is a mathematical process for combining multiple words into a single vector . Those composition functions can be divided in two classes: unordered and syntactic functions. Syntactic functions take the word order into account, to generate the paragraph embedding and unordered functions treat the word vectors in a classically bag-of-words manner.

The idea of deep averaging networks was originally proposed by Iyyer et al. (2015) and later picked up in one of the two USE version by Google. Based on existing word embeddings they achieve quite impressive results in comparison to other heavier weighted methods. The fundamental functionality of the neural network architecture is illustrated below:

The model works in three simple steps:

First, all corresponding word vectors of a paragraph are averaged. The authors mention that the model can be improved by applying a dropout-inspired regularization technique — for each training instance randomly, some of the word tokens will be dropped.

Afterwards the average of the word embeddings is passed through several hidden layers — It’s up to you to choose the right amount of it (later more).

The last step is to perform a classification on the final layer’s representation. Keep in mind you’ll work here with your labeled data you want to use in you succeeding classification task.

The Effect of DAN to your Embeddings

Let’s take a look at the picture below. It shows averaged FastText embeddings in comparison to DAN embeddings, trained on about 7.000 short paragraphs about machine learning. I trained those embeddings to distinguish if a written problem statement can be solved by a machine learning algorithm. The discrimination was based on the classes: Pattern Mining, Clustering and Prediction (Classification or Regression).

The right plot of the figure below uses a FastText model trained on those articles. I used example sentences like “I want predict costs” and visualized them in a 2-dimensional vector space after I applied a PCA dimensionality reduction. The DAN was trained on the basis of the FastText Embeddings, a dropout rate of 0.3 and 2 Hidden-Layer.

Using some example sentences for each class and a DAN, we can see that embeddings falling into the same class moved closer to each other. In general, it seems that the embeddings are better separated in the vector space.

Another interesting fact is that the number of hidden layers seems to influence how strong the embeddings are separated. The following picture illustrates ~7000 text data with averaged FastText embeddings. The class of pattern mining is separated quite good from the other classes but the categories clustering, and prediction are overlapping, which can influence a succeeding classification negatively.

The next picture illustrates the FastText embedding after there were fed into different DAN models. As you can see the number of used hidden layers highly influences the separation of those classes. I used six different DAN models with two up to seven hidden layers. Those models were trained 10 epochs long and had a word dropout out rate of 0.3%.

As you can see, even a small amount of two hidden layer separates the distributed representation quite well, in comparison to the FastText averages.

By using this methodology, you can tune our averaged word embeddings to be more separated in the vector space and due to that you can potentially gain more accuracy out of potential succeeding downstream tasks.

The authors did something similar in the DAN article and used the concept of perturbation. In particular, they used a template sentence and replaced words with increasingly negative polarity words (cool, okay, overwhelming, the worst).

The just compared the 1-norm of the paragraph embeddings at each layer, which is calculated by adding up the absolute of all values in the vector.

They wanted to show how much the hidden layers differ from those associated with the original one. The claimed that the deeper the network gets the differences between negative and positive sentences become increasingly amplified.

This research results were part of my master thesis, which I wrote together with my friend Richard. If you are interested in the work we did, you can check out the GitHub Page or the preceding paper of our master thesis. Just leave a comment if you are interested in a succeeding post about a specific downstream task where we used this methodology and compare them to other embedding techniques! :)

References

Mikolov, T., Chen, K., Corrado, G. & Dean, J. (2013). Efficient Estimation of Word Rep-resentations in Vector Space. arXiv preprint arXiv:1301.3781.

Bojanowski, P., Grave, E., Joulin, A. & Mikolov, T. (2017). Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics, 5, 135–146.

Chen, Q., Peng, Y. & Lu, Z. (2018). BioSentVec: creating sentence embeddings for bio-medical texts. arXiv preprint arXiv:1801.09536.

Iyyer, M., Manjunatha, V., Boyd-Graber, J. & Daumé III, H. (2015). Deep Unordered Com-position Rivals Syntactic Methods for Text Classification. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th Interna-tional Joint Conference on Natural Language Processing (Long Papers), 1, 1681–1691.

Perone, C. S., Silveira, R. & Paula, T. S. (2018). Evaluation of sentence embeddings in downstream and linguistic probing tasks. arXiv preprint arXiv:1806.06259.