Unsupervised NLP: How I Learned to Love the Data

13 min readApr 28, 2020

There has been vast progress in Natural Language Processing (NLP) in the past few years. The spectrum of NLP has shifted dramatically, where older techniques that were governed by rules and statistical models are quickly being outpaced by more robust machine learning and now deep learning-based methods. In this article, we’ll discuss the burgeoning and relatively nascent field of unsupervised learning: We will see how the vast majority of available text information, in the form of unlabelled text data, can be used to build analyses. In particular, we will comment on topic modeling, word vectors, and state-of-the-art language models. As with most unsupervised learning methods, these models typically act as a foundation for harder and more complex problem statements.

Topic Modelling

Traditionally topic modeling has been performed via mathematical transformations such as Latent Dirichlet Allocation and Latent Semantic Indexing. Such methods are analogous to clustering algorithms in that the goal is to reduce the dimensionality of ingested text into underlying coherent “topics,” which are typically represented as some linear combination of words. The standard way of creating a topic model is to perform the following steps:

Tokenize the text
Transform the tokenized text in each document into a vector of (weighted) counts of words and create a document-word count matrix
Train the topic model (e.g., LDA) on the resultant matrix

Let’s briefly examine these basic steps to just set the stage, as many in-depth guides and tutorials are available online:

Tokenization

The raw text is split into “tokens,” which are effectively words with the caveat that there are grammatical nuances in language such as contractions and abbreviations that need to be addressed. A simple tokenizer would just break raw text after each space, for example, a word tokenizer can split up the sentence “The cat sat on the mat” as follows:

Transformation

Having tokenized the text into these tokens, we often perform some data cleaning (e.g., stemming, lemmatizing, lower-casing, etc.) but for large enough corpuses these become less important. This cleaned and tokenized text is now counted by how frequently each unique token type appears in a selected input, such as a single document.

This is repeated for all documents with potentially all possible unique words in the entire corpus of documents, so that we end with many rows in this table, one for each document, and many columns, one for each unique word.

Notice that since punctuation and articles are more likely to appear frequently in all text, it is often common practice to down-weight them using methods such as Term Frequency — Inverse Document Frequency weighting (tf-idf), for simplicity we will ignore this nuance.

Topic Modeling Algorithm

Traditionally topic modeling has been performed via algorithms such as Latent Dirichlet Allocation (LDA) and Latent Semantic Indexing (LSI), whose purpose is to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text. In some sense, these examine words that are used in the same context, as they often have similar meanings, and such methods are analogous to clustering algorithms in that the goal is to reduce the dimensionality of text into underlying coherent “topics”, as are typically represented as some linear combination of words.

The more popular algorithm, LDA, is a generative statistical model which posits that each document is a mixture of a small number of topics and that each topic emanates from a set of words. The goal of LDA is thus to generate a word-topics distribution and topics-documents distribution that approximates the word-document data distribution:

[https://en.wikipedia.org/wiki/Non-negative_matrix_factorization#/media/File:NMF.png]

where V represents the tf-idf matrix of words along the vertical axis, and documents along the horizontal axis i.e., V = (words, documents), W represents the matrix (words, topics), and H the matrix (topics, documents).

In this way, the matrix decomposition gives us a way to look up a topic and the associated weights in each word (a column in the W-matrix), and also a means to determine the topics that make up each document or columns in the H-matrix. The important idea being that the topic model groups together similar words that appear co-frequently into coherent topics, however, the number of topics should be set. Typically, the number of topics is initialized to a sensible number through domain knowledge, and is then optimized against metrics such as topic coherence or document perplexity.

Example Use-Cases

Topic modeling, like general clustering algorithms, are nuanced in use-cases as they can underlie broader applications and document handling or automation objectives. The direct goal of extracting topics is often to form a general high-level understanding of large text corpuses quickly. One can thus aggregate millions of social media entries, newspaper articles, product analytics, legal documents, financial records, feedback and review documents, etc. while relating them to other known business metrics to form a trend over time.

Further downstream analysis, such as document classification, of which sentiment analysis is one, synonym finding, or language understanding can make use of topic models as an input building block in these broader or more task-specific pipelines.

Word Vectors

NLP tasks have made use of simple one-hot encoding vectors and more complex and informative embeddings as in Word2vec and GloVe. If a collection of words vectors encodes contextual information about how those words are used in natural language, it can be used in downstream tasks that depend on having semantic information about those words, but in a machine-readable format.

In the above case of a list of word tokens, a sentence could be turned into a vector, but that alone fails to indicate the meaning of the words used in that sentence, let alone how the words would relate in other sentences. To assuage this problem, the meaning of words should carry with them their context with respect to other words. To capture this, word vectors can be created in a number of ways, from simple and uninformative to complex and descriptive.

The simplest way of turning a word into a vector is through one-hot encoding. Take a collection of words, and each word will be turned into a long vector, mostly filled with zeros, except for a single value. If there are ten words, each word will become a vector of length 10. The first word will have a 1 value as its first member, but the rest of the vector will be zeros. The second word will have only the second number in the vector be a 1. And so on. With a very large corpus with potentially thousands of words, the one-hot vectors will be very long and still have only a single 1 value. Nonetheless, each word has a distinct identifying word vector.

However, such a vector supplies extremely little information about the words themselves, while using a lot of memory with wasted space filled with zeros. A word vector that used its space to encode more contextual information would be superior. The primary way this is done in current NLP research is with embeddings.

Embeddings

One way of encoding the context of words is to create a way of counting how often certain words pair together. Consider this sentence again: “The cat sat on the mat.” In this example, the pairing can be achieved by creating a co-occurrence matrix with the value of each member of the matrix counting how often one word coincides with another, either just before or just after it. Larger distances between words can also be considered, but it is not necessary to explore that for now.

This simple example shows that ‘cat’ is something that does something, ‘sat.’ Conversely, ‘the’ does not appear next to ‘sat’, indicating a point of grammar (namely, articles do not go with verbs). But with such a short sentence it is very difficult to know what a ‘cat’ is or what a cat does, let alone what it means for a cat to have ‘sat on’ something. However, with more sentences ingested, more context will be encoded into this simple counting matrix.

Instead of counting words in corpora and turning it into a co-occurrence matrix, another strategy is to use a word in the corpora to predict the next word. Looking through a corpus, one could generate counts for adjacent word and turn the frequencies into probabilities (cf. n-gram predictions with Kneser-Nay smoothing), but instead a technique that uses a simple neural network (NN) can be applied. There are two major architectures for this, but here we will focus on the skip-gram architecture as shown below.

[https://www.researchgate.net/figure/The-architecture-of-Skip-gram-model-20_fig1_322905432]

In skip-gram, you take a word and try to predict what are the most likely words to follow after that word. This strategy can be turned into a relatively simple NN architecture that runs in the following basic manner. From the corpus, a word is taken in its one-hot encoded form as input. The output from the NN will use the context words–as one-hot vectors–surrounding the input word. The number of context words, C, define the window size, and in general, more context words will carry more information.

The NN is trained by feeding through a large corpus, and the embedding layers are adjusted to best predict the next word. This process creates weight matrices which densely carry contextual, and hence semantic, information from the selected corpus.

Use Cases & Applications

Some examples where word vectors can be directly used include synonym generation, auto-correct, and predictive text applications. Similar types of methods are used to perform fuzzy searches by Google and similar searching tools, with an almost endless amount of internal search capabilities that can be applied within organizations’ catalogs and databases. Further, since the embedding spaces are typically well-behaved one can also perform arithmetic operations on vectors. This allows for unique operations that embeddings capture not just similarities between words, but encode higher-level concepts. This embedding system allows for logical analogies as well. For example, Rome is to Italy as Beijing is to China–word embeddings are able to take such analogies and output plausible answers directly.

Finally, almost all other state-of-the-art architectures now use some form of learnt embedding layer and language model as the first step in performing downstream NLP tasks. These downstream tasks include: Document classification, named entity recognition, question and answering systems, language generation, machine translation, and many more.

Language Models

Language modeling is the task of learning a probability distribution over sequences of words and typically boils down into building a model capable of predicting the next word, sentence, or paragraph in a given text. Note that the skip-gram models mentioned in the previous section are a simple type of language model, since the model can be used to represent the probability of word sequences. The standard approach is to train a language model by providing it with large amounts of samples, e.g. text in the language, which enables the model to learn the probability with which different words can appear together in a given sentence.

Recently, neural network-based language models (LM) have shown better performance than classical statistical language models such as N-grams, Hidden Markov Models, and rules-based systems. Neural network-based models achieve such higher performances by:

Training on increasingly larger corpus sizes with only a linear increase in the number of parameters
Parameterizing word embeddings and using them as inputs to the models
Capturing contextual information at both the word and sentence level

NN based language models are the backbone of the latest developments in natural language processing, an example of which is BERT, short for Bidirectional Encoder Representations from Transformers.

As the name suggests, the BERT architecture uses attention based transformers, which enable increased parallelization capabilities potentially resulting in reduced training time for the same number of parameters. Thanks to the breakthroughs achieved with the attention-based transformers, the authors were able to train the BERT model on a large text corpus combining Wikipedia (2,500M words) and BookCorpus (800M words) achieving state-of-the-art results in various natural language processing tasks.

BERT, like other published works such as ELMo and ULMFit, was trained upon contextual representations on text corpus rather than context-free manner as done in word embeddings. Contextual representation takes into account both the meaning and the order of words allowing the models to learn more information during training. The BERT algorithm, however, is different from other algorithms aforementioned above in the use of bidirectional context which allows words to ‘see themselves’ from both left and right.

The figure above shows how BERT would represent the word “bank” using both its left and right context starting from the very bottom of the neural network.

BERT introduced two different objectives used in pre-training: a Masked language model that randomly masks 15% of words from the input and trains the model to predict the masked word and next sentence prediction that takes in a sentence pair to determine whether the latter sentence is an actual sentence that proceeds the former sentence or a random sentence. The combination of these training objectives allows a solid understanding of words, while also enabling the model to learn more word/phrase distance context that spans sentences. These features make BERT an appropriate choice for tasks such as question-answering or in sentence comparison.

Example Use Case & Applications

A pre-trained BERT model can be further fine-tuned for a specific task such as general language understanding, text classification, sentiment analysis, Q&A, and so on. Fine-tuning can be accomplished by swapping out the appropriate inputs and outputs for a given task and potentially allowing for all the model parameters to be optimized end-to-end.

Source: Devlin et al. 2019

For example, a publicly available dataset used for the question-answering task is the Stanford Question Answering Dataset 2.0 (SQuAD 2.0). SQuAD 2.0 is a reading comprehension dataset consisting of over 100,000 questions [which has since been adjusted] where only half the question/answer pairs contain the answers to the posed questions. Thus, the goal of this system is to not only provide the correct answer when available but also refrain from answering when no viable answer is found. Using the SQuAD 2.0 dataset, the authors have shown that the BERT model gave state-of-the-art performance, close to the human annotators with F1 scores of 83.1 and 89.5, respectively.

In addition to the end-to-end fine-tuning approach as done in the above example, the BERT model can also be used as a feature-extractor which obviates a task-specific model architecture to be added. This is important for two reasons: 1) Tasks that cannot easily be represented by a transformer encoder architecture can still take advantage of pre-trained BERT models transforming inputs to more separable space, and 2) Computational time needed to train a task-specific model will be significantly reduced. For instance, fine-tuning a large BERT model may require over 300 million of parameters to be optimized, whereas training an LSTM model whose inputs are the features extracted from a pre-trained BERT model only require optimization of roughly 4.5 million parameters.

Another example is where the features extracted from a pre-trained BERT model can be used for various tasks, including Named Entity Recognition (NER). The goal in NER is to identify and categorize named entities by extracting relevant information. CoNLL-2003 is a publicly available dataset often used for the NER task. The tokens available in the CoNLL-2003 dataset were input to the pre-trained BERT model, and the activations from multiple layers were extracted without any fine-tuning. These extracted embeddings were then used to train a 2-layer bi-directional LSTM model, achieving results that are comparable to the fine-tuning approach with F1 scores of 96.1 vs. 96.6, respectively.

Final Thoughts

As a quick summary, the reason why we’re here is because machine learning has become a core technology underlying many modern applications, we use it everyday, from Google search to every time we use a cell phone. This is especially true in utilizing natural language processing, which has made tremendous advancements in the last few years. Today, enterprise development teams are looking to leverage these tools, powerful hardware, and predictive analytics to drive automation, efficiency, and augment professionals. Simple topic modeling based methods such as LDA were proposed in the year 2000, moving into word embeddings in the early 2010s, and finally more general Language Models built from LSTM (not covered in this blog entry) and Transformers in the past year. This remarkable progress has led to even more complicated downstream use-cases, such as question and answering systems, machine translation, and text summarization to start pushing above human levels of accuracy. Coupled with effectively infinite compute power, natural language processing models will revolutionize the way we interact with the world in the coming years.

Michael Luk has more than 10 years of experience in developing and delivering hybrid product and service solutions to life sciences and healthcare clients. As the CTO of SFL Scientific, he focuses on managing and developing business operations from start-ups to multi-billion dollar companies. Michael has provided clients with innovative, practical solutions by improving operations and integrating technology through the development of novel data-driven systems. Michael Luk is an expert in machine learning and AI and has vast experience in time-series modeling. He studied theoretical physics at Imperial College London, Mathematics at the University of Cambridge before completing his doctorate in Particle Physics at Brown University. SFL Scientific is a turn-key data science consulting firm that offers custom development and solutions in data engineering, machine learning, and predictive analytics. SFL uses specific domain knowledge, innovation, and latest technical advances to solve complex and novel business problems.

Original post here.

Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday.