NLP: Zero To Hero [Part 1: Introduction, BOW, TF-IDF & Word2Vec]

Prateek Gaurav
10 min readMar 23, 2023

--

Link to Part 2 of this article:
NLP: Zero To Hero [Part 2: Vanilla RNN, LSTM, GRU & Bi-Directional LSTM]
Link to Part 3 of this article:
NLP: Zero To Hero [Part 3: Transformer-Based Models & Conclusion]
Link to the Colab File:
https://github.com/PrateekCoder/NLP_Zero_To_Hero

Natural Language Processing (NLP) has become an integral part of various industries, including healthcare, finance, and e-commerce, to name a few. NLP is a subfield of artificial intelligence (AI) that deals with the interaction between computers and humans in natural language. With the rise of big data and the need to extract meaningful insights from unstructured data, NLP has gained a lot of attention in recent years.

In this series of articles, we will take a deep dive into NLP and cover everything from pre-processing tasks to advanced models like recurrent neural networks (RNNs), long short-term memory (LSTM), and transformer-based models, the focus will be more on practical implementation, evaluation and comparing the models. We will build a sentiment analysis model using different NLP techniques and evaluate their performance. Whether you are new to NLP or looking to enhance your skills, this series will provide you with a comprehensive understanding of NLP and equip you with the knowledge to apply it to real-world problems.

Before we dive into the various NLP techniques, it’s important to understand the initial steps involved in NLP tasks. The first step is to acquire the textual data, which can be sourced from APIs, web scraping, or databases, among other methods. Once we have the data, we need to perform text pre-processing to ensure that the data is clean and ready for analysis. In this article, we will cover various pre-processing techniques, including tokenization, stop-word removal, stemming, and lemmatization. Once the data is pre-processed, we will compare and contrast several vectorization methods, such as Bag-of-Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF), and Word2Vec. Finally, we will delve into several advanced NLP models, such as RNNs, LSTMs, GRUs, and transformers, and build a sentiment analysis model using these techniques. By the end of this series, you’ll have a solid understanding of NLP and the tools to create robust NLP models for real-world applications.

Let’s compare each model/technique with its predecessor, highlighting their advantages and limitations:

  1. Bag of Words (BoW) vs. TF-IDF: BoW simply counts the occurrences of words in a document, whereas TF-IDF assigns a weight to each word based on its importance in the document and the entire corpus. TF-IDF overcomes the limitation of BoW, which is assigning equal importance to all words. By giving higher weight to rare words, TF-IDF can better capture the meaning of a document. However, both methods ignore word order and context, resulting in a loss of semantic information.
  2. Word2Vec vs. BoW and TF-IDF: Word2Vec is a neural network-based technique that learns continuous word embeddings, capturing the semantic relationships between words. It overcomes the limitations of BoW and TF-IDF by preserving contextual information and representing words in a dense vector space. Word2Vec embeddings can capture semantic relationships like synonyms, antonyms, and analogies. However, Word2Vec has its limitations: it does not handle out-of-vocabulary (OOV) words well, and it cannot capture different meanings of a word based on context (polysemy).
  3. RNN (including LSTM and GRU) vs. Word2Vec: While Word2Vec learns word representations, RNNs (Recurrent Neural Networks) are used for modeling sequences of data, including text. RNNs can process input sequences of varying lengths, maintaining a hidden state that captures information from previous time steps. Compared to Word2Vec, RNNs can model the temporal dependencies in text, making them suitable for tasks like sentiment analysis or machine translation.
  4. LSTM and GRU vs. Vanilla RNN: However, vanilla RNNs suffer from the vanishing gradient problem, making it difficult to learn long-range dependencies. LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit) are RNN variants designed to overcome this limitation by incorporating gating mechanisms that allow them to capture long-term dependencies more effectively. Still, RNNs (including LSTMs and GRUs) can be computationally expensive, especially for long sequences.
  5. Bi-directional LSTM vs. LSTM: Bi-directional LSTMs are an extension of LSTMs that process the input sequence in both forward and backward directions. This allows the model to capture information from both past and future contexts, often leading to better performance in tasks like named entity recognition and sentiment analysis. However, bi-directional LSTMs are more computationally expensive than regular LSTMs due to the additional backward pass.
  6. Transformer vs. RNN (LSTM, GRU): Transformers are a type of neural network architecture that relies on self-attention mechanisms to model the dependencies between words in a sequence. Unlike RNNs, transformers can process input sequences in parallel, making them more computationally efficient, especially for long sequences. Transformers can also capture long-range dependencies more effectively, as they do not have the same constraints as RNNs regarding sequence length. However, transformers can be memory-intensive due to the self-attention mechanism, and they often require large amounts of training data to perform well.

By learning these models and techniques, you’ll gain a solid understanding of various NLP approaches and their strengths and weaknesses, which will enable you to choose the best method for a given task.

Step 01: Business Problem

For this article, our business problem is to classify the sentiments of IMDB reviews.

Step 02: Collecting the Data

We will be getting our data from Kaggle, the dataset I am using is in CSV format and has already been split into train, test, and validation. You can use the following dataset split the data using train test split. Here is the link to the dataset:
https://www.kaggle.com/datasets/columbine/imdb-dataset-sentiment-analysis-in-csv-format

Step 03: Text Pre-Processing

Text pre-processing is an essential step in natural language processing (NLP) that involves cleaning and preparing textual data before it can be used for analysis or modeling. Text pre-processing typically involves several steps, which may vary depending on the specific task and dataset.

  1. Special characters removal: The first step in text pre-processing is typically to remove any special characters, punctuation, or other non-alphanumeric characters from the text. This step is important to ensure that the text is clean and standardized, which can make it easier to analyze and model.
  2. Lowercasing: After removing special characters, the next step is often to convert all the text to lowercase. This is done to ensure that words with different capitalization are treated as the same, which can improve the accuracy of downstream tasks like classification or clustering.
  3. Tokenization: Once the text has been cleaned and standardized, the next step is to split it into individual tokens or words. This is typically done using a tokenizer, which can split the text based on spaces, punctuation, or other delimiters.
  4. Stop words removal: Stop words are common words that do not carry much meaning in a sentence, such as “the”, “a”, or “an”. Removing stop words can reduce the dimensionality of the text data and improve the performance of downstream tasks like classification or clustering.
  5. Stemming or lemmatization: The final step in text pre-processing is typically to apply a stemming or lemmatization algorithm to reduce words to their base form. Stemming involves removing the suffixes from words to create a stem, while lemmatization involves mapping words to their root form based on their part of speech. The goal of stemming or lemmatization is to reduce the variability in the text data and improve the accuracy of downstream tasks.

When it comes to choosing between stemming and lemmatization, there are a few factors to consider. Stemming is typically faster and more computationally efficient than lemmatization, but it can sometimes produce less accurate results since it does not take into account the context of the word. Lemmatization, on the other hand, is slower and more computationally intensive, but it can produce more accurate results since it maps words to their dictionary form.

Overall, the choice between stemming and lemmatization depends on the specific task and dataset, and there is no one-size-fits-all approach. It’s often a good idea to experiment with both techniques and evaluate their performance on your specific task before making a final decision.

Here is the video and the code to perform text pre-processing on our IMDB dataset.

Step 04: Vectorization or Feature Extraction

Vectorization, also known as feature extraction, is a critical step in natural language processing (NLP) that involves converting textual data into numerical vectors that can be used by machine learning algorithms. There are several popular methods for vectorizing textual data, including bag-of-words (BoW), term frequency-inverse document frequency (TF-IDF), and word2vec.

The boW is a simple and widely used method for vectorizing text data, which involves creating a dictionary of all the unique words in a corpus and counting the number of times each word appears in a document. This results in a vector representation of the document, where each element represents the count of a particular word. The boW is often used in text classification tasks and can be a good starting point for vectorizing text data.

TF-IDF is a more advanced method for vectorizing text data that considers the importance of each word in a document and across a corpus. TF-IDF involves computing a weight for each word in a document based on its frequency in the document and its rarity across the corpus. This can help to give more weight to words that are important for distinguishing between different documents and can improve the accuracy of downstream tasks like clustering or retrieval.

Word2vec is a neural network-based method for vectorizing text data that learns dense, low-dimensional embeddings of words based on their co-occurrence patterns in a corpus. Word2vec can be trained using either the continuous bag-of-words (CBOW) or skip-gram architectures, which involve predicting the context words given a target word or vice versa. Word2vec can be used to capture semantic relationships between words, such as synonyms or analogies, and can be a powerful tool for tasks like sentiment analysis or information retrieval.

Overall, the choice of vectorization method depends on the specific task and dataset, and there is no one-size-fits-all approach. It’s often a good idea to experiment with multiple methods and evaluate their performance on your specific task before making a final decision.

Step 05: Building Models For Sentiment Analysis

SVM with Bag Of Words Vectorizer

Sentiment Analysis Model using SVM and BOW Vectorizer
Accuracy of SVM and BOW Vectorizer Model

Accuracy of SVM and BOW Vectorizer Model: 88.04%

The whole process took about 50 minutes, from vectorization to fitting and predicting but gave a very good accuracy of 88%.

SVM with Tf-IDF Vectorizer

Sentiment Analysis Model using SVM and Tf-IDF Vectorizer
Accuracy of SVM and Tf-IDF Vectorizer Model

Accuracy of SVM and Tf-IDF Vectorizer Model: 90.04%

Using SVM and Tf-IDF, the process again took over an hour for feature extraction, model fitting, and prediction but the accuracy was really good at 90%.

SVM with Custom Word2Vec Vectorizer

Sentiment Analysis Model using SVM and Custom Word2Vec Vectorizer
Accuracy of SVM and Custom Word2Vec Vectorizer Model

Accuracy of SVM and Custom Word2Vec Vectorizer Model : 49.92%

I tried to build a custom word2vec model here, building a custom word2vec which is a neural network-based model that requires a lot of data for training, computation resource, and experimentation with hyperparameters. If you want to optimize this model, you can try a few other things like:

  1. Increase the vector size: By increasing the vector size to a larger number (e.g., 300 or 500), you can capture more information about each word and potentially improve the quality of your embeddings.
  2. Use a pre-trained word2vec model: Instead of training your own word2vec model from scratch, you can use a pre-trained model that has been trained on a large corpus of text, such as Google’s pre-trained word2vec model or Stanford’s GloVe model. These pre-trained models have been shown to produce high-quality embeddings and can be fine-tuned for your specific task.
  3. Use a better classification model: SVM may not be the best choice for sentiment analysis, especially when working with high-dimensional embeddings. You can try using other classifiers such as logistic regression, random forests, or neural networks.
  4. Use a larger dataset: If your dataset is small, it may not be representative of the full range of sentiment expressions that you are interested in. Using a larger dataset can help improve the quality of your model.
  5. Preprocess your data: While using raw text data may work well for RNN models, preprocessing your data (e.g., removing stop words, stemming/lemmatizing, etc.) may still improve the quality of your word2vec embeddings and your sentiment analysis model.
  6. Use cross-validation: To get a better estimate of the true accuracy of your model, you can use k-fold cross-validation instead of just training on a single train-test split. This can help you better evaluate your model’s performance and tune your hyperparameters.

Or, you can use pre-trained word2vec models like Glove or Google News 300 word2vec model. I am going to try again with the google news word2vec model.

SVM with Google News 300 Word2Vec Vectorizer

Sentiment Analysis Model using SVM and Google News 300 Word2Vec Vectorizer
Accuracy of SVM and Google News 300 Word2Vec Vectorizer Model

Accuracy of SVM and Google News 300 Word2Vec Vectorizer Model : 85.74%

As you can see when we used a pre-trained word2vec model, our accuracy improved from 49% to ~86% which is a drastic improvement, but it is still not better than SVM with BOW or Tf-IDF contrary to my expectation.

This basically means we should not always be going with the most complex models available, it's always better to start with the simple models.

In the next parts of this article, I will be building sentiment analysis models using Recurrent Neural Networks like Vanilla RNN, LSTM, GRU, Bi-Directional LSTM, and Transformer based models like Distilbert and Roberta and compare their performance.

Here is the link to NLP: Zero To Hero [Part 2: Vanilla RNN, LSTM, GRU & Bi-Directional LSTM].

--

--

Prateek Gaurav

Sr. DS Manager @ LGE | Ex - Amazon Data Scientist | Data Science Mentor | Boston University Graduate www.letsdatascience.com