Natural Language Processing Series Part 1: Fundamentals of NLP (Text Preprocessing, Tokenization, and Vectorization)

Rohollah
20 min readApr 14, 2024

--

Welcome to my Natural Language Processing (NLP) tutorial series. My goal is to provide a comprehensive, yet simple and concise course covering NLP topics from beginner to advanced levels. Let’s start with some introductory topics that are essential for the rest of the series.

Introduction to Natural Language Processing

In this story, I will discuss the following topics:

Introduction to NLP

Text Tokenization

Stop words removal using NLTK and Spacy

Punctuation removal using NLTK and Spacy

Text vectorization techniques: BOW, TFIDF, Word2Vec

Stemming and Lemmatizing Text

What is NLP?

Natural Language Processing (NLP) helps computers understand human language to perform useful tasks.

Here are three examples:

  • Recommendation Systems: NLP suggests movies you might like based on your previous interests.
  • Machine Translation: It translates text from one language to another, like Google Translate.
  • Spam Detection: NLP helps email systems filter out unwanted spam messages.

What is document?

In the context of Natural Language Processing (NLP), a document (doc) can be any piece of text, such as a text file, PDF file, a single paragraph, a sentence, or even a single word

Reading Sample Text File

For the next steps, we need some text to work with. Let’s use a sample text file to get started on the basics.In the code below, we want to read text from a text file and store the text in a Python list named lines.

# Define the file path for the text document.
file_path = 'sample_text.txt'

# Open and read the file from 'file_path' in read mode.
with open(file_path, 'r') as file:
lines = file.read() # Store the content of the file in 'lines'. The type of 'lines' is initially a string.

# Split the content into individual lines using the newline character '\n' as the separator.
lines = lines.split('\n')
# Now the type of lines is list.
# Loop through each line, printing its index and content. We use 'enumerate' because we need both the index and the text lines.
for i, line in enumerate(lines):
print(f'{i+1}:{line} \n')

Here is the text from the text file above:

# 1:Natural language processing is a fascinating field of artificial intelligence. 

# 2:It involves analyzing and understanding human language.

# 3:This can help computers perform a variety of tasks.

# 4:For example, translating languages automatically is one major application.

# 5:Another is creating chatbots that can converse with humans.

# 6:Text summarization and sentiment analysis are also important.

# 7:These technologies are used in many applications today.

# 8:Such as automated customer support, content recommendations, and more.

# 9:Each application brings unique challenges and opportunities.

# 10:The potential for NLP to impact our everyday lives is immense.

What is a token?

After we split the text into words or letters, each one is called a “token.”

Until now, we have discussed the first topic, ‘Introduction to NLP’ Let’s move on to the third section.

Splitting text

There are three methods for splitting text:

  • split() function
  • word_tokenize
  • spacy_tokenizer

split() function

The first method is to use the ‘split()’ function as shown in the code below.This code splits the first line of the text into words, and then into characters:

# Split the first line into words.
words = lines[0].split()
print(f'words of first line are \n:{words}')
# Convert the first line into a list of characters.
chars = list(lines[0])
print(f'The characters of first line are \n {chars}')

The result is shown below:

words of the first line are 
:['Natural', 'language', 'processing', 'is', 'a', 'fascinating', 'field', 'of', 'artificial', 'intelligence.']
Characters of the first line are
['N', 'a', 't', 'u', 'r', 'a', 'l', ' ', 'l', 'a', 'n', 'g', 'u', 'a', 'g', 'e', ' ', 'p', 'r', 'o', 'c', 'e', 's', 's', 'i', 'n', 'g', ' ', 'i', 's', ' ', 'a', ' ', 'f', 'a', 's', 'c', 'i', 'n', 'a', 't', 'i', 'n', 'g', ' ', 'f', 'i', 'e', 'l', 'd', ' ', 'o', 'f', ' ', 'a', 'r', 't', 'i', 'f', 'i', 'c', 'i', 'a', 'l', ' ', 'i', 'n', 't', 'e', 'l', 'l', 'i', 'g', 'e', 'n', 'c', 'e', '.']

word_tokenize

The second method for splitting text is using word_tokenize function from NLTK library:

# Import nltk library.
import nltk
# Import the word_tokenize function from nltk.tokenize module.
from nltk.tokenize import word_tokenize
# Download the 'punkt' package, necessary for the word_tokenize function.
nltk.download('punkt')
# Tokenize the first line of the text into words using NLTK's word_tokenize.
nltk_words = word_tokenize(lines[0])
# Print the list of words from the first line.
print(f'The words of first line are : \n {nltk_words}')
# Convert the first line into a list of its characters.
chars = list(lines[0])
# Print the list of characters from the first line.
print(f'The characters of first line are \n {chars}')

The result is shown below:

The words of first line are : 
['Natural', 'language', 'processing', 'is', 'a', 'fascinating', 'field', 'of', 'artificial', 'intelligence', '.']
The characters of first line are:
['N', 'a', 't', 'u', 'r', 'a', 'l', ' ', 'l', 'a', 'n', 'g', 'u', 'a', 'g', 'e', ' ', 'p', 'r', 'o', 'c', 'e', 's', 's', 'i', 'n', 'g', ' ', 'i', 's', ' ', 'a', ' ', 'f', 'a', 's', 'c', 'i', 'n', 'a', 't', 'i', 'n', 'g', ' ', 'f', 'i', 'e', 'l', 'd', ' ', 'o', 'f', ' ', 'a', 'r', 't', 'i', 'f', 'i', 'c', 'i', 'a', 'l', ' ', 'i', 'n', 't', 'e', 'l', 'l', 'i', 'g', 'e', 'n', 'c', 'e', '.']

Notice the word ‘intelligence.’ Unlike the split() function, NLTK’s tokenizer splits it into two tokens: ‘intelligence’ and the ‘.’ after it.

Comparing the NLTK Word Tokenizer and the `split()` Function

Consider the sentence: “Hello, world! How are you?” When using the “split()” function, the result is [“Hello,”] [“world!”] [“How”] [“Are”] [“you?”]. However, when using NLTK, the result is [“Hello”] [“,”] [“world”] [“!”] [“How”] [“Are”] [“you”] [“?”]. This demonstrates how NLTK separates punctuation from words, unlike the “split()” function.

spacy_tokenizer

The third method for splitting text is using spacy_tokenizer function from spacy library:

# Import the spaCy library.
import spacy
# Load the English language small model into spacy_tokenizer.
spacy_tokenizer = spacy.load('en_core_web_sm')
# Tokenize the first line of text using the loaded spaCy model.
spacy_words = spacy_tokenizer(lines[0])
# Extract the text of each tokenized word into a list.
spacy_tokens= [token.text for token in spacy_words]
# Print the list of words from the first line.
print(f'The words of first line are : \n {spacy_tokens}')
# Convert the first line into a list of its characters.
chars = list(lines[0])
# Print the list of characters from the first line.
print(f'The characters of first line are: \n {chars}')

The result is shown below:

The words of first line are : 
['Natural', 'language', 'processing', 'is', 'a', 'fascinating', 'field', 'of', 'artificial', 'intelligence', '.']
The characters of first line are:
['N', 'a', 't', 'u', 'r', 'a', 'l', ' ', 'l', 'a', 'n', 'g', 'u', 'a', 'g', 'e', ' ', 'p', 'r', 'o', 'c', 'e', 's', 's', 'i', 'n', 'g', ' ', 'i', 's', ' ', 'a', ' ', 'f', 'a', 's', 'c', 'i', 'n', 'a', 't', 'i', 'n', 'g', ' ', 'f', 'i', 'e', 'l', 'd', ' ', 'o', 'f', ' ', 'a', 'r', 't', 'i', 'f', 'i', 'c', 'i', 'a', 'l', ' ', 'i', 'n', 't', 'e', 'l', 'l', 'i', 'g', 'e', 'n', 'c', 'e', '.']

Until now, we have discussed the second topic, ‘tokenizing (splitting) text into tokens.’ Let’s move on to the third section.

Removing Stop Words from the Text

What are stop words :

Stop words are common words like “and,” “the,” and “is” that are often removed from text to help focus on more important words

Removing Stop Words using NLTK

# Import the stopwords list from the NLTK library.
from nltk.corpus import stopwords as nltk_stopwords
# Download the list of stopwords.
nltk.download('stopwords')
# Create a set of English stopwords for fast lookup.
nltk_stop_words = set(nltk_stopwords.words('english'))
# Filter out stopwords from the list of tokenized words.
filtered_nltk_tokens = [word for word in nltk_words if word.lower() not in nltk_stop_words]
# Print the words before and after removing stopwords to compare.
print(f'Words from first line before removing stopwords are : \n {words} \n and after removing stopwords with NLTK are: \n {filtered_nltk_tokens}')

The result is shown below:

Words from first line before removing stopwords are : 
['Natural', 'language', 'processing', 'is', 'a', 'fascinating', 'field', 'of', 'artificial', 'intelligence.']
and after removing stopwords with NLTK are:
['Natural', 'language', 'processing', 'fascinating', 'field', 'artificial', 'intelligence', '.']

As you can see, words (tokens) such as “is,” “a,” and “of” have been removed.

Removing Stop Words using Spacy.

# Filter out stopwords from the spaCy tokenized words.
filtered_spacy_tokens = [token.text for token in spacy_words if not token.is_stop]
# Print the words before and after removing stopwords to show the effect of the filtering.
print(f'Words from first line before removing stop words are : \n {spacy_words} \n and after removing stopwords with spaCy: \n {filtered_spacy_tokens}')

The result is shown below:

Words from first line before removing stop words are : 
['Natural', 'language', 'processing', 'is', 'a', 'fascinating', 'field', 'of', 'artificial', 'intelligence.']
And after removing stopwords with spaCy:
['Natural', 'language', 'processing', 'fascinating', 'field', 'artificial', 'intelligence', '.']

What is the difference between NLTK and Spacy with respect to stop word removal?

In the example above, you might not see any difference because the text is simple. However, generally speaking, the differences between them are as follows:

NLTK always takes out the same common words from text, which is simple and predictable. Spacy lets you change which words it takes out depending on the text’s needs, making it more adaptable for different situations.

Until now, we have discussed the third topic, ‘Removing Stop Words’ Let’s move on to the forth section.

Removing Punctuations

What are punctuations?
Punctuations are special symbols that help clarify how sentences should be read and understood. Examples include the period (.), comma (,), question mark (?), and exclamation mark (!). They are used to organize and give meaning to the words in writing.

Removing Punctuations using Python string

import string

# Create a list without punctuation from nltk_words.
tokens_no_punc = [word for word in nltk_words if word not in string.punctuation]

# Display the words before and after punctuation removal.
print(f'Tokens before removing punctuation are \n {nltk_words} \n and Tokens after removing punctuation are : \n {tokens_no_punc}')

Removing Punctuations using Spacy library

# Create a list of words without punctuation using spaCy.
tokens_no_punc = [token.text for token in spacy_words if not token.is_punct]

# Print the list of words without punctuation.
print("Tokens without punctuation:", tokens_no_punc)

Until now, we have discussed the forth topic, ‘Removing Punctuations’ Let’s move on to the fifth section

Text Vectorizing

What is Vectorizing ?

Vectorizing in NLP means turning words into numbers so that computers can understand them. For example, the sentence “I am a student” can be changed to something like [1, 2, 3, 4], where each number represents a different word.

Vectorizing text by BOW

What is BOW?

The Bag of Words (BOW) method counts how many times each word appears in a text and uses these counts to represent the content.

Let’s look at an example: Suppose we have these two sentences:

"The dog barks and the dog plays."
"The cat sleeps and the cat jumps, and also the cat eats better than the dog."

To convert these sentences to vectors using the BOW approach, we first need to create a list of unique words. Then, we assign each word in each sentence a number indicating the frequency of that word in the sentence.

Unique Words are:

The,dog,barks,and,plays,cat,sleeps,jumps,also,eats,better,than

Now, to convert each word in the sentence, we replace the number of times it appears in the sentence with the position of that word. For example, the word ‘dog’ appears twice in the first sentence and once in the second sentence. Therefore, under the word ‘dog’ for the first sentence, we place the number 2, and for the second sentence, we place the number 1.

In the table below, you can see the final result.

Frequency Distribution of Words: Bag of Words Vectorization Example

Implementing BOW in Python

To implement Bag of Words (BOW), you can use the `CountVectorizer` from the `sklearn` library.

# Import the CountVectorizer class for text vectorization.
from sklearn.feature_extraction.text import CountVectorizer
# Initialize the CountVectorizer.
vectorizer = CountVectorizer(stop_words='english')
# Fit the vectorizer to the text data and transform it into a matrix.
X = vectorizer.fit_transform(lines)
# Convert the matrix to a dense NumPy array for easy manipulation.
bag_of_words = X.toarray()
# Retrieve the feature names (unique words) from the vectorizer.
feature_names = vectorizer.get_feature_names_out()
# Print all unique words identified by the vectorizer.
print(f"Feature names(All unique words) are: \n {feature_names}")
# Print the entire bag-of-words matrix.
print(f"\nBag-of-words matrix: \n {bag_of_words}")
# Create and print a dictionary for the first line with words and their respective counts.
print(f"\nBag-of-words for first line is : {dict(zip(feature_names, bag_of_words[0]))}")

The result is shown below:

Feature names(All unique words) are: 
['also' 'analysis' 'analyzing' 'and' 'another' 'application'
'applications' 'are' 'artificial' 'as' 'automated' 'automatically'
'brings' 'can' 'challenges' 'chatbots' 'computers' 'content' 'converse'
'creating' 'customer' 'each' 'everyday' 'example' 'fascinating' 'field'
'for' 'help' 'human' 'humans' 'immense' 'impact' 'important' 'in'
'intelligence' 'involves' 'is' 'it' 'language' 'languages' 'lives'
'major' 'many' 'more' 'natural' 'nlp' 'of' 'one' 'opportunities' 'our'
'perform' 'potential' 'processing' 'recommendations' 'sentiment' 'such'
'summarization' 'support' 'tasks' 'technologies' 'text' 'that' 'the'
'these' 'this' 'to' 'today' 'translating' 'understanding' 'unique' 'used'
'variety' 'with']

Bag-of-words matrix:
[[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0
1 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0]
[0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1
0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1
0]
[0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0
1 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
1]
[1 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
0]
[0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0
0]
[0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0]
[0 0 0 1 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 0 0
1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0
0]]

Bag-of-words for first line is : {'also': 0, 'analysis': 0, 'analyzing': 0, 'and': 0, 'another': 0, 'application': 0, 'applications': 0, 'are': 0, 'artificial': 1, 'as': 0, 'automated': 0, 'automatically': 0, 'brings': 0, 'can': 0, 'challenges': 0, 'chatbots': 0, 'computers': 0, 'content': 0, 'converse': 0, 'creating': 0, 'customer': 0, 'each': 0, 'everyday': 0, 'example': 0, 'fascinating': 1, 'field': 1, 'for': 0, 'help': 0, 'human': 0, 'humans': 0, 'immense': 0, 'impact': 0, 'important': 0, 'in': 0, 'intelligence': 1, 'involves': 0, 'is': 1, 'it': 0, 'language': 1, 'languages': 0, 'lives': 0, 'major': 0, 'many': 0, 'more': 0, 'natural': 1, 'nlp': 0, 'of': 1, 'one': 0, 'opportunities': 0, 'our': 0, 'perform': 0, 'potential': 0, 'processing': 1, 'recommendations': 0, 'sentiment': 0, 'such': 0, 'summarization': 0, 'support': 0, 'tasks': 0, 'technologies': 0, 'text': 0, 'that': 0, 'the': 0, 'these': 0, 'this': 0, 'to': 0, 'today': 0, 'translating': 0, 'understanding': 0, 'unique': 0, 'used': 0, 'variety': 0, 'with': 0}

Important note :

The CountVectorizer performs various preprocessing steps on text, such as tokenizing, converting to lowercase, and removing punctuation, before converting the text into a matrix of token counts. So It expects raw text as input, not a list of text lines. Additionally, removing stopwords can be enabled by setting the `stop_words` parameter.

Text Vectorization Using TFIDF

What is TFIDF ?

TFIDF is a complex concept that typically requires a full article to explain, but I’ll provide a brief overview using an example.
Suppose we have 100 documents. If the word “student” appears 5 times in document number 3, and is present in 13 documents in total (including document number 3), the TFIDF value for “student” in document 3 can be calculated as follows:

Term Frequency (TF):
Calculate how frequently the word appears in the document. Since “student” appears 5 times in document number 3, its term frequency is 5 divided by the total number of words in document 3, which is 500. This can be represented as :

TF Formula
TF Formula

So the TF value equals:

Inverse Document Frequency (IDF):
This measures how common or rare the word is across all documents. It can be calculated as the logarithm of the ratio of the total number of documents to the number of documents containing the word “student”. In mathematical terms:

IDF Formula
IDF Formula

So the IDF value equals:

Calculating TFIDF

Multiply TF and IDF to get the TFIDF value.

TFIDF Formula

So the TFIDF value equals:

TFIDF Value of word “student” in document 3
TFIDF Value of word “student” in document 3

This value indicates the importance of the word “student” specifically in document 3 relative to the whole corpus of documents.

Sparse and Dense Matrix

Before continuing to implement TF-IDF in Python, you need to understand the concepts of sparse and dense matrices.

A sparse matrix is mostly empty, with many zeros, meaning it only stores a few values. It’s efficient for saving space when dealing with large datasets where most elements are zeros. On the other hand, a dense matrix is fully populated, storing all values explicitly, including zeros. This makes dense matrices easier to work with for calculations, but they can take up much more space, especially if they contain many zeros.

Implementing TFIDF in Python

The code below uses `TfidfVectorizer` from `sklearn` to convert text data into a TF-IDF matrix, which reflects the importance of words in the documents. It then retrieves the unique words (features), converts the matrix to a dense array, and prints the unique words, the full matrix, and the TF-IDF scores for the first document.

Important note about TfidfVectorizer:

The TfidfVectorizer performs several preprocessing steps on text, such as tokenizing, converting to lowercase, removing stopwords, and eliminating punctuation, before vectorizing the text. Therefore, it requires raw text as input, not a list of text lines.

# Import the TfidfVectorizer class for text vectorization using TF-IDF.
from sklearn.feature_extraction.text import TfidfVectorizer
# Initialize the TfidfVectorizer.
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
# Fit the vectorizer to the text data and transform it into a TF-IDF matrix.
tfidf_matrix = tfidf_vectorizer.fit_transform(lines)
# Retrieve the feature names (unique words) from the vectorizer.
feature_names = tfidf_vectorizer.get_feature_names_out()
# Convert the sparse TF-IDF matrix to a dense array format.
dense_tfidf_matrix = tfidf_matrix.toarray()
# Print all unique words identified by the vectorizer.
print(f"Feature names: \n {feature_names}")
# Print the entire TF-IDF matrix.
print(f"\nTF-IDF matrix: \n {dense_tfidf_matrix}")
# Create and print a dictionary for the first line with words and their respective TF-IDF scores.
print(f"\nTF-IDF vector for first line : {dict(zip(feature_names, dense_tfidf_matrix[0]))}")

The result is displayed below: (For brevity, only a portion of the output is shown. It includes the list unique words and the TFIDF vectors for the first two vectors only.)


Feature names:
['also' 'analysis' 'analyzing' 'and' 'another' 'application'
'applications' 'are' 'artificial' 'as' 'automated' 'automatically'
'brings' 'can' 'challenges' 'chatbots' 'computers' 'content' 'converse'
'creating' 'customer' 'each' 'everyday' 'example' 'fascinating' 'field'
'for' 'help' 'human' 'humans' 'immense' 'impact' 'important' 'in'
'intelligence' 'involves' 'is' 'it' 'language' 'languages' 'lives'
'major' 'many' 'more' 'natural' 'nlp' 'of' 'one' 'opportunities' 'our'
'perform' 'potential' 'processing' 'recommendations' 'sentiment' 'such'
'summarization' 'support' 'tasks' 'technologies' 'text' 'that' 'the'
'these' 'this' 'to' 'today' 'translating' 'understanding' 'unique' 'used'
'variety' 'with']
TF-IDF matrix:
[[0. 0. 0. 0. 0. 0.
0. 0. 0.35617798 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0.35617798 0.35617798 0. 0. 0. 0.
0. 0. 0. 0. 0.35617798 0.
0.23551514 0. 0.30278382 0. 0. 0.
0. 0. 0.35617798 0. 0.30278382 0.
0. 0. 0. 0. 0.35617798 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. ]

[0. 0. 0.40291544 0.2664193 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.40291544 0.
0. 0. 0. 0. 0. 0.40291544
0. 0.40291544 0.34251494 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0.40291544 0. 0. 0.
0. ]

How the shape of TFIDF is calculated?

The shape of the TF-IDF matrix is pretty simple: it’s just the number of lines of text you have by the number of unique words in all those lines. So, if you have 10 lines and 73 unique words, the matrix will be 10 by 73. Each spot in the matrix shows how important a word is in a specific line of text, compared to the whole set of lines. It’s like a grid that helps you see which words stand out in each line.

Text Vectorization Using Word2Vec

In a simple explanation, when Word2Vec converts words into vectors, words with similar meanings will have similar vectors. For instance, words like “apple,” “fruits,” and “oranges” will have vectors that are alike.

Important note about Word2Vec:

Word2Vec requires manually preprocessed and tokenized text as input, specifically a list of lists of tokens. It does not perform any preprocessing tasks like tokenization, lowercasing, or removing stopwords automatically.

Therefore, unlike TfidfVectorizer or CountVectorizer, which process raw text directly and handle basic text preprocessing internally, Word2Vec requires that these preprocessing steps be completed beforehand and expects the text in a tokenized, list-of-lists format.

The code below uses TfidfVectorizer to convert the text into vectors.

# Import the gensim downloader API to access pre-trained models.
import gensim.downloader as api
# Load the pre-trained GloVe model trained on Twitter data with a 25-dimensional vector.
word2vec_model = api.load("glove-twitter-25")
# Import numpy for numerical operations.
import numpy as np
# Tokenize each line into words.
tokenized_lines = [word_tokenize(line) for line in lines]
# Remove stopwords from each tokenized line.
lines_without_stopwords = [[word for word in line if word.lower() not in nltk_stop_words] for line in tokenized_lines]
# Filter words to include only those present in the Word2Vec model's vocabulary.
lines_without_stopwords = [[word for word in line if word in word2vec_model] for line in lines_without_stopwords]
# Define a function to convert a line of text into a vector using the Word2Vec model.
def line_to_vector(line, model):
if not line:
return np.zeros(model.vector_size) # Return a zero vector if line is empty.
vector = np.mean([model[word] for word in line], axis=0) # Compute the mean vector for the line.
return vector

# Apply the function to each line to get their vector representations.
vectorized_lines = [line_to_vector(line, word2vec_model) for line in lines_without_stopwords]

# Print the vector representation for each line.
print("Word2Vec vector representation of each line:")
for i, vec in enumerate(vectorized_lines):
print(f"Line {i + 1}: {vec}")

The result is displayed below:

Word2Vec vector representation of each line:
Line 1: [ 0.13912429 -0.45609003 -0.52310145 0.12469415 0.73219126 0.15934701
0.8763452 -0.8681757 0.5430557 0.04922428 0.02454229 0.06156857
-2.8047001 0.37813643 0.62697 -0.04839572 0.28524214 0.5972243
0.5107514 -0.03612701 -0.06519858 -0.12287001 -0.44006997 -0.15197301
-0.351226 ]
Line 2: [ 0.13991089 0.04233001 -0.541205 0.5766915 0.4072998 0.182835
1.1932567 -0.8187633 0.11366666 0.12753665 -0.4631367 0.3158566
-2.9498332 0.27296102 0.37957215 0.38319 -0.11297 0.46653235
0.6017683 -0.40257168 0.04816033 -0.14876516 -1.064075 -0.11255068
-0.566025 ]

How the shape of TFIDF is calculated?

When using the `word2vec` model, each word in a sentence is turned into a vector with 25 numbers (dimensions). If you have a sentence with several words, you might think that each word’s vector would be kept separately, so you’d end up with something like 7 words * 25 numbers per word, making 175 numbers per sentence. However, instead of keeping all these separate, the model averages all the word vectors in a sentence into one single vector with just 25 numbers. So, no matter how many words are in the sentence, each sentence is always represented by a vector of just 25 numbers. That’s why, if you have 10 sentences, you get a 10 * 25 array, not 10 * 7 * 25.

Section 5 (Final section )

what is Stemming and Lemmatizing Text?

Stemming :

Stemming reduces words to their base form by chopping off endings, often leading to incomplete but similar forms. For example, “running” and “runner” both stem to “run.”

Lemmatization:

Lemmatization also reduces words to their base form but transforms them into their correct dictionary form, ensuring that the result is a valid word. For instance, “better” is lemmatized to “good.”

The following code demonstrates how to perform these tasks using the NLTK library:

# Import stemmer and lemmatizer from NLTK library.
from nltk.stem import PorterStemmer, WordNetLemmatizer
# Download the WordNet corpus, used by the lemmatizer.
nltk.download('wordnet')
# Initialize the Porter Stemmer.
stemmer = PorterStemmer()
# Initialize the WordNet Lemmatizer.
lemmatizer = WordNetLemmatizer()
# Apply stemming to each word in the filtered list.
stemmed_words_nltk = [stemmer.stem(word) for word in filtered_nltk_tokens]
# Apply lemmatization to each word in the filtered list.
lemmatized_words_nltk = [lemmatizer.lemmatize(word) for word in filtered_nltk_tokens]
# Print original and stemmed words to compare.
print(f'Original words before stemming are : \n {filtered_nltk_tokens} \n Stemmed words with NLTK: \n {stemmed_words_nltk}')
# Print original and lemmatized words to compare.
print(f'Original words before lemmatizing are : \n {filtered_nltk_tokens} \n Lemmatized words with NLTK: \n {lemmatized_words_nltk}')

The result is displayed below:

Original words before stemming are : 
['Natural', 'language', 'processing', 'fascinating', 'field', 'artificial', 'intelligence', '.']
Stemmed words with NLTK:
['natur', 'languag', 'process', 'fascin', 'field', 'artifici', 'intellig', '.']
Original words before lemmatizing are :
['Natural', 'language', 'processing', 'fascinating', 'field', 'artificial', 'intelligence', '.']
Lemmatized words with NLTK:
['Natural', 'language', 'processing', 'fascinating', 'field', 'artificial', 'intelligence', '.']
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Package wordnet is already up-to-date!

And the code below demonstrates how to perform these tasks using the Spacy library:

# Extract the lemmatized form of each token from the spaCy tokenized words.
lemmatized_words_spacy = [token.lemma_ for token in spacy_words]
# Print the lemmatized words to demonstrate the effect of lemmatization.
print(f'Lemmatized words with spaCy: {lemmatized_words_spacy}')

--

--