The Ultimate NLP Mind Map to Have

Overview of all the components of NLP Analysis in a nutshell

Seungjun (Josh) Kim
Geek Culture
8 min readOct 31, 2022

--

Free for Use Photo from Pexels

Introduction

Natural Language Processing (NLP) is a gigantic field that encompasses various components from cleaning, lemmatization, stemming to deep learning models for text classification and State-of-the-Art (SOTA) pre-trained language models for advanced tasks such as summarization, text generation and sentiment analysis. In this article, a comprehensive mind map of different components of NLP is introduced for those who want to view a big picture of NLP and how to further study about each of the specific components.

Exploratory Data Analysis (EDA)

EDA for text data may be a bit different from that for tabular data. Let us look at some common methods used for exploring text data.

Visualization of the top n-grams

Note that n-grams are sequence of n words or tokens in a sentence. We can visualize the top k most frequently appearing n-grams using bar graphs. The integer n here is defined by the user depending on what number of consecutive words may be meaningful in the text data being dealt with.

Word Cloud

Wordcloud is, as the name suggests, a cluster of words with different sizes. The size here denotes the importance of the word or token. The bigger the size, the more important it is. The importance metric can be specified in various ways but usually tf-idf is used. There is Python package called wordcloud you can install. Once you install it, you can import the following:

See below for an example of a word cloud.

Source: From the Author

Density or Distribution Visualizations of Meta Features

Meta features are features that are handmade based on some descriptive statistics about the text. They include things like average length of word, number of punctuation occurrences, number of stop word occurrences and so on. We can visualize these information using kde density plots or box plots.

Language Breakdowns

Often times, the text data we collected may include multiple languages. It would be interesting to see a breakdown of proportions of those languages in the corpus. That information may suggest an underlying pattern of the data we collected. A bar graph would be enough to visualize this.

Pre-Processing

Cleaning

We may want to remove punctuation, special characters, hashtags, and stopwords (often grabbed from the NLTK package using nltk.corpus.stopwords.words(‘english’)). This is because they are not usually informative or adding value to the model’s ability to classify to discriminate between different tokens.

We also may want to expand contractions, acronyms and informative usernames and hashtags (for internet platform collected data).

Some words may need to grouped into one especially when those words only mean something as a group.

Lastly, typos and slang may want to be corrected.

Stemming and Lemmatization

According the Stanford NLP Group, the goal of stemming and lemmatization is to “reduce inflectional forms and sometimes derivationally related forms of a word to a common base form”. However, stemming and lemmatization are different in that stemming may be a more simplistic approach than lemmatization. Stemming usually involves getting rid of deriving affixes or parts at the end while lemmatization may go further to retrieve the morphological base form of the word called the “lemma”.

These two are implemented within the stem class of the NLTK package.

Feature Extraction and Representation

Text data need to be transformed into some numerical form so that they can be used as inputs for models. Note that models cannot understand non-numerical forms of data and that is why we perform things like encoding for categorical data, for example.

Meta Features

Examples of some meta features, mentioned earlier under the EDA section, are the following:

  • number of words
  • unique word count
  • number of stop words
  • average character count in words
  • average word length
  • punctuation count
  • hashtag count

Count Vectors

Tokens can be represented into vectors using the number of occurrences of them within the document and corpus.

  • Word Count: The sklearn package has the CountVectorizer function that you can easily use to transform data into a count vector.
  • TF-IDF: Short for Term Frequency–Inverse Document Frequency, this not only considers the frequency of the word within a document which the word count above does but is also offset by the number of documents in the corpus that contain the word to adjust for the fact that some words appear more frequently in general. Take a look at this article for further information!

Word Embeddings

Word embedding is basically representing tokens in a vector space so that they acquire both direction and magnitude. In this way, it becomes possible for us to consider concepts like “similarity” and “difference” between words by calculating the distance between those vectors.

  • GloVe: Short for Gloval Vectors, it is from the Stanford’s NLP Group which performed training on aggregated global word-word co-occurrence statistics from a corpus. It showcases interesting linear substructures of the word vectors. More information are available on their website.
  • Word2Vec: Devised by the researcher Tomáš Mikolov in 2013, this algorithm represents each word in a vector form. As an extension of this, algorithms such doc2vec also appeared in the NLP scene owing to the development Word2Vec.
  • Fast-Text: It is an extension of the Word2Vec algorithm in an attempt to improve on its limitations. One such limitation of Word2Vec is that it cannot handle any words it has not encountered during its training which we refer to as “Out of Vocabulary(OOV)” words. Fast-Text can be implemented using the gensim package.

The following two articles illustrate the key differences between Word2Vec and Fast-Text in great detail and also ways on how to implement them with code. Please check them out! (link 1, link 2)

When it comes to visualizing word embedding, we may want to use algorithms for visualizing high dimension data. These methods include t-distributed stochastic neighbor embedding (TSNE) and Uniform Manifold Approximation and Projection (UMAP).

NLP Tasks

Now we are ready to dive into some NLP tasks. Some of the most common tasks are the following:

  • Text Classification
  • Topic Modelling
  • Sentiment Analysis
  • Clustering
  • Investigating Concordance
  • Query Answering Model
  • Summarization
  • Text Generation
  • Mask Filing (Predicting which word should fill the blank or mask in a sentence)

and so much more.

I will look at only a handful of them!

Topic Modelling

Topic modelling is a NLP task that makes use of unsupervised machine learning algorithms to unearth underlying topic clusters in the text data being dealt with. The following two algorithms are the most common used ones for topic modelling:

  • Latent Dirichlet Allocation (LDA)
  • Non-negative Matrix Factorization (NMF)

LDA was first proposed in the context of population genetics in 2000 but was refined into an algorithm for machine learning in 2003 by David Blei and Andrew Ng. NMF is an algorithm that is not only used for topic modelling but for a wide array of applications such as recommendation systems.

Take a look at this wonderful article that walks you through an example of how to using LDA and NMF to topic model Quora questions.

Text Classification

This is a more traditional task that has been around for quite a while now. Say we need to classify documents of medical text that contain either content related to depression or content that is non depression related. Which algorithms would you use?

  • Support Vector Machines (SVM): Remember that this algorithm is sensitive to distance and hence require the standardization of data before fitting the model. SVMs often perform well when you do not have enough observations but have relatively high dimensions. When the data scales though, computational efficiency and performance may both suffer.
  • Naive Bayes (NB): This is a pretty standard classification algorithm that is worth using to check the baseline performance.
  • Logistic Regression: This is probably the first algorithm we learn in every textbook for classification. It is always good to use this algorithm to see what the baseline looks like. It is also informative in that it is “explainable”, meaning the user can interpret the coefficients and understand what they mean and how each of the features contribute to the outcome.
  • Recurrent neural network (RNN): It is a type of neural network algorithm where connections between nodes can create a cycle which allows output from some nodes to affect subsequent input to the same nodes. For this reason, it is widely used especially for non-tabular data such as text and time series data which often have different nodes affecting one another in a non-sequential manner.
  • Long short-term memory (LSTM): Similarly to RNNs, this algorithm is effective for various unstructured non-tabular data. It was developed to address thevanishing gradient problem[16] in traditional RNNs.
  • Gated Recurrent Unit (GRU): It is one type of RNN that has been proved to perform well on various text data.

Sentiment Analysis

As the name suggests, it is a pretty straightforward task where you need to classify different tasks into various sentiment categories. The categories are often binary but depending on the research question and the objective of the project, people tinker with them and allow spectrum of sentiments to constitute the categories (e.g. neutral, a bit negative, very negative, a bit positive and very positive).

There are three open source packages that allow us to perform sentiment analysis without having to build our own sentiment analysis model.

  • TextBlob
  • Vader Sentiment Analysis Tool
  • Stanford NLP Sentiment Analysis Tool

Take a look at this article for more knowledge on these packages and their usages!

Keep in mind though that these packages should probably serve as baselines since they often perform poorly on more specific types of datasets (e.g. medical dataset). This is why we often need to build our custom sentiment analysis models that are trained on the specific dataset we are working with.

Conclusion

In this article, I gave you a comprehensive overview of the entire flow of NLP analysis and what to keep in mind in each of the stages. I wish that this post would help those who are starting out on NLP to get a basic big picture of what the landscape looks like although there are so many other topics that are not covered in this post.

If you found this post helpful, consider supporting me by signing up on medium via the following link : )

joshnjuny.medium.com

You will have access to so many useful and interesting articles and posts from not only me but also other authors!

About the Author

Data Scientist. 1st Year PhD student in Informatics at UC Irvine.

Former research area specialist at the Criminal Justice Administrative Records System (CJARS) economics lab at the University of Michigan, working on statistical report generation, automated data quality review, building data pipelines and data standardization & harmonization. Former Data Science Intern at Spotify. Inc. (NYC).

He loves sports, working-out, cooking good Asian food, watching kdramas and making / performing music and most importantly worshiping Jesus Christ, our Lord. Checkout his website!

--

--

Seungjun (Josh) Kim
Geek Culture

Data Scientist; PhD Student in Informatics; Artist (Singing, Percussion); Consider Supporting Me : ) https://joshnjuny.medium.com/membership