A light introduction and roadmap to Natural Language Processing.
In the journey of a Data Science Student, there are four huge bumps in the learning curve: being introduced to Deep Learning, Computer Vision, Reinforcement Learning, and probably one of the most theory intensive application of Machine Learning — NLP. This article aims to ease out your Natural Language Processing learning journey by giving you a roadmap and a light introduction on when and what to learn to become a practitioner with decent application skills.
Remember that your best guide is referring to courses, research papers, Kaggle notebooks, and articles by actual researchers. But if you’re lost in the hubbub of the data science world and the mammoths of learning material seem intimidating, this article is for you. I’ve tried to incorporate my knowledge of different NLP systems so you know how to go about it, so I hope you enjoy this read!
1. What is NLP and how to go about it?
We know that the computer does not comprehend human language on its own, we use numbers and words and complicated phrases and grammar, while it uses machine code, binary digits and comprised of these — high level programming languages. This is where natural language processing comes in to play, it is nothing but a way from programming languages to learn from human languages using some techniques that I will give you an overview of.
Research in the field of NLP is evolving, as a spoken language has varied structure, is sometimes multilingual, with different languages following different rules. Even so, some of its simple applications we can see all around us such as autocorrect, sentiment analysis of texts (mood detecting features in some apps), chatbots, question answering bots et cetera. The possibilities at the intersection of computers starting to understand human languages are limitless. So let us begin with python:
NLTK and Scikit-Learn for simple NLP :
This is where your NLP journey should begin. NLTK (Natural Language Tool Kit) is the most robust suite of python libraries for making sense (disambiguation) of the language data. The first step in any machine learning task is to clean the data, process it into a form fit for numerical conversion(since machine learning can be used only on numerical data), and then fit it into a model as per your requirement: classification or regression.
- Suppose your data is a huge dataframe of sentences in every row, which, it will often be. Your first task is to — Tokenize the data. This means organising sentences split into their words and stored in a 2-dimensional vector. Tokenising can be of many kinds — paragraphs into sentences, documents into paragraphs, it essentially means making smaller units to operate on huge data. You can do it manually using string manipulation, or use tokenizers from nltk, keras from their documentation.
- Cleaning the data is too general and is different for different purposes. As you delve into NLP you will find novel approaches to accomplish preprocessing suited to your task, but the most generally applicable one is to remove stopwords — words that are not contributing to learning from the data such as articles. you can import stopwords from nltk.corpus.
from nltk.corpus import stopwords
- Another technique when using corpuses of text data would be is to implement stemming to clean the data. This means removing prefixes and affixes from words to obtain their root word, so they’re considered the same by your algorithm which will obviously save time. example musically becomes music, happily becomes happy, etc.
- We have more or less cleaned data. The next step is to create word embeddings — As the name suggests these are nothing but words mapped into vectors of numbers which we will call word vectors. There are many techniques to approach this, the most simple one being CountVectorizer and TFIDF (read up on how this works) which is called the Bag Of Words Algorithm. A general code to show you implementation:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformerfit_BOW = CountVectorizer(analyzer = Custom_text_preprocess_fn)
fit_Tfidf = TfidfTransformer()
fit_BOW.fit(data)
BOW_word_vectors = fit_BOW.fit_transform(data)
fit_Tfidf.fit(BOW_word_vectors)
wordvectors = fit_Tfidf.fit_transform(BOW_word_vectors)# Note that analyzer can be passed a custom text cleaning function.
- Your data is now in a matrix of numbers and you may use any machine learning algorithm you see fit, from here on its a standard ML problem.
2. An Overview of Word-Embeddings for Deep NLP
So the intuition behind forming Word Embeddings is simple: algorithms learn from numbers and its easier to fit matrices into architectures than words to operate on. This requires tokenisation and cleaning as mentioned above but there are many techniques used to form word vectors from the data you’d want to operate on. The first and foremost would be: Bag Of Words.
- A Bag Of Words model is essentially a count of how many times a word occurs in a message (called term frequency), and written in a vector side by side other word counts in a message. Along with this, we implement the inverse document frequency transform (TFIDF) which makes the counts weighted in a sense that lower frequency(rare words) get lower weight(less importance). However, this is not enough as it misses out on important information in the data(eg, leadership and the metal can both be called “lead” and it will be treated the same).
- The most simple way to check if the word embeddings made by your algorithm is actually making sense (or accurate) is to use them to implement word analogies. Simply put you will find 4 words in a proportional relationship. For example, king-queen=man-woman. let me put this into a general format of x-y=a-b. The goal is to see if my word embeddings are making sense so if I take 3 words in such a relationship, and loop through my embeddings to find a fourth word that matches the three words. If that word is correct then your embeddings are accurate! Let me explain with an example: x-y=a-b so i can write (theoretically) that a=x-y+b. (note that variables here are vectors). Now if value of x-y+b is known the closest value to a in the embeddings is the answer and in the given analogy if a = “man” then our algorithm works well.
- In deep NLP, we use pre-trained word vectors like Word2Vec and GloVe, and use that data to create word vectors for the words that we have as data. This is accomplished using the python Gensim library. The major difference between the two is that Word2Vec was made using supervised learning algorithms such as Continuous Bag Of Words and Skip-Gram and Glove was made using Unsupervised Learning algorithms such as Matrix Factorisation and Singular Value Decomposition(SVD). Word2Vec is Google’s baby and GloVe is Stanford’s baby. That’s all you need to know for a light introduction.
import gensim
from gensim.models import Word2Vec
from gensim.scripts.glove2word2vec import glove2word2vec
- Another approach is to treat each word in your data as a category its own and perform standard Categorical encoding on the data like One Hot Encoding. This results in huge sparse matrices as each vector is essentially one “1” and all “0”s, to this we apply a matrix E, similar to convolution to obtain features from data in an embedding vector.
from sklearn.preprocessing import OneHotEncoder
onh = OneHotEncoder()
wordvectors = onh.fit_transform(data)
3. Important algorithms to take note of
Markov and Hidden Markov Models
Few of the most popular language models out there are the bigram, trigram and N-gram models. It operates on the Markov Assumption that to predict a next word all that we need to see is the previous word. (or N-previous words) For N=2, We call it a bigram, mathematically this means:
The tri- and N-gram models just take a larger context of words to consider to predict the occurence of the next word using the infamous Bayes Theorem coupled with the markov assumption. You should already be guessing that autocomplete is an application of such models. This is best implemented through Neural Networks and is popularly called the Neural Bigram Model.
Some Classic NLP Problems are — Parts of Speech Tagging (ie, recognising grammatical components in speech) and Named Entity Recognition (ie, finding proper nouns in speech). A useful unsupervised learning construct that helps in solving these problems is the Hidden Markov Model. You may have heard of Markov Chains, in an undergrad math course — This is similar because it uses the Markov assumption but in a league of its own as it is a deep networked approach using unsupervised learning. To explain simply, it has hidden variables called hidden states that learn features in the data through algorithms like Vertibi Algorithm, Forward Algorithm and Expectation Maximisation Algorithm (additional reading suggested). These algorithms are intimidating to any beginner, so don’t worry much about it.
C-BOW and Skip-Gram :
Although word2vec can be implemented quite easily through genism it is essential to understand the underlying language model under it that works. So there are two types of word2vec algorithms, in different situations, they give different accuracies. Skip-Gram — Takes input as a single word and predicts it’s context, Continuous Bag of Words (CBOW) — Takes input a list of words and predicts the word in between. For code implementation and simple theoretical explanation of these, I feel this GeeksforGeeks article is sufficient :
Deep Neural Network Models :
The application of these models are word generation, Q/A bots, Chatbots. I won’t go too deep into them, just an overview of some popular architectures:
- 1D Convolution: Using one dimensional convolutional networks work surprisingly well on a sequence like data. This neural network essentially performs convolution operation to obtain important features from a sequence of data, you should already be well acquainted with this technique, just using it here is an unforeseen approach that actually works!
- Sequence to Sequence Model: This is an encoder and decoder-like structure, which uses Recurrent Neural Networks like LSTMs or GRUs stacked to learn from sequences (compress the data in a sense encode it) and use another layer of a decoder RNN to learn from the encoded data. Sometimes we can use something called teacher-forcing, to make the decoder learn from the ground truth (which otherwise learns from it’s own predictions)
- Attention Models : This is a modification of the seq2seq model it essentially incorporates something called the attention layer between the encoder and decoder the networks. The main function of this network is to learn even more features from the encoded data, so it takes the encoded sequences, adds a weighted neural network for additional learning from each vector, and dot products the result with the original encoded sequences.
- Recursive Neural Tensor Networks: The technique is an absolute genius if you think about it. Sentences are essentially trees where nodes are phrases and leaves are words. We treat input as sequence (of tree node traversal algotihms like preorder and post order) and feed them into a custom-made recurrent neural network that does not require a neural network made for each sentence (sentences are trees here), because it can learn from its previous timestep(sentence in this case).
- Memory Networks: Uses Deep Learning techniques to learn from a context of data to answer a simple question based on the context. (for examples search for the Babi dataset). These models are best implemented as Question answering bots, because of their superior comprehensive skills, although not so superior once you learn about it.
4. Mapping of Word Embeddings into 2-D Space
This is the last but not the least topic that I would like to cover. For your revision word embeddings are numbers that represent words. But why would you want to plot them? The answer is exploratory data analysis (EDA). Basically on a 2-D space when you can actually visualise how your word embeddings are clustered together, you are able to comprehend a lot from this data and implement your models accordingly. EDA is the key to having models better suited for your data and for NLP the simplest way to accomplish this is plotting Word Embeddings. One more advantage of plotting word vectors is that you can measure the distance (commonly — Euclidean or cosine distance) between word vectors which is also a way to see if your embedding is accurate
To accomplish this task you can use the t-SNE (T-Distributed Stochastic Neighbour Embedding) Algorithm, which is a non-linear dimensionality reduction technique for high dimensional data (huge datasets). the code implementation is given in this article pretty well:
Alternatively you can use a simpler dimensionality reduction technique, which you may have come across in machine learning, called Principal Component Analysis (PCA). This is preferable for smaller datasets, what the algorithm really does is divide the data points in n-components (a parameter you can control) or orthogonal sectors and the lines on which majority of the data points lie are selected to be mapped. Implementing this is pretty easy in scikit-learn:
from sklearn.decomposition import PCA
from matplotlib import pyplotpca = PCA(n_components = 2)
#taking only 2-dimensions to plot
reduced_vectors = pca.fit_transform(word_vectors)
param = (reduced_vectors[:,0],reduced_vectors[:,1])# making a scatter plot
pyplot.scatter(param[0], param[1])# annotating the plot so we can make sense of which word is plotted
for j,word in enumerate(word_list):
pyplot.annotate(word, xy = (param[0],param[1]))pyplot.show()
These days, Google’s BERT literally outperforms most of the algorithms mentioned here, but these are still pretty useful in simplistic problems. In the next article I will explain BERT and implement some fine tuning, so stay tuned. The goal of this article was to provide a kind of sensible syllabus for readers to look up and learn, so I hope you enjoyed it — on that note, happy learning!