Text Analytics in Python — Text Preprocessing and Feature Vectorization

Changhyun Kim
8 min readJan 2, 2023

--

Text analyticsis a Machine Learning technique used for extracting valuable information or patterns from text data. Text analytics is different to text analysis in that text analytics yields quantitative results while text analysis returns qualitative results. Both have been playing important roles in various industries as well as in academics, and this article will focus on text analytics in Python.

https://www.displayr.com/how-to-set-up-your-text-analysis-in-displayr/

Machine Learning based text analytics is generally performed in a following set of processes:

  1. Text preprocessing: Preprocessing text data before featurizing/vectorizing text data. This may include lowercasing/uppercasing, elimitating stopwords, stemming, lemmatization etc.
  2. Feature vectorization: Extracting features from preprocessed texts and vectorizing them.
  3. Building an ML model: Applying the vectorized data to an ML model. This includes data training, predicting and evaluating the results.

In this article, we will focus on text preprocessing and feature vectorization.

Python provides various packages and libraries for natural language processing (NLP) and text analytics.

NLTK (Natural Language Toolkit for Python): NLTK is one of the leading NLP packages in Python. NLTK covers most parts of NLP inclduing sentence detection, tokenization, lemmatization, stemming and so on.

Gensim: Gensim is a library specialized in topic modeling.

spaCy: spaCy is an open-source NLP Pyton library designed for production usage, allowing developers to develop applications.

This article will not cover all the NLP libraries in Python, but we will cover some, so let’s now go straight to examples.

Text Preprocessig

We cannot convert texts straight into features without preprocessing. Preprocessing of raw text data is essential. This may include the following:

  • Cleansing
  • Tokenization
  • Filtering/Removing stopwords
  • Stemming
  • Lemmatization

Let’s cover these preprocessing processes using NLTK packages.

Text Tokenization

Text tokenization may include sentence tokenization and word tokenization. Sentence tokenization means extracting sentences from raw data, and word tokenization means extracting words from raw text data.

For sentence tokenization, it is general to tokenize sentences based on period(.) or \n. Let’s try this with sent_tokenize() provided by NLTK package. Let’s first import and install necessary libraries.

import nltk
from nltk import sent_tokenize
nltk.download('punkt')
sample_text = "A gallery of Lionel Messi celebrating Argentina's World Cup win has become the most-liked Instagram post ever. \
Hours after posting it, the footballer received more than 65 million likes - and the number is constantly rising. \
Argentina defeated France on penalties in Sunday's final in Qatar - their first World Cup triumph in 36 years."

print('-----sample text-----\n', sample_text, '\n')

sentences = sent_tokenize(text = sample_text)

print('-----tokenized sentences-----\n', sentences)
print(type(sentences), len(sentences))

After using sent_tokenize() you can see that the sample text has been successfully tokenized into three sentences, which are included in a single list as an element.

For word tokenization, NLTK provides a function word_tokenize() which allows you to tokenize texts into words. Let’s try out with an example.

from nltk import word_tokenize

sentence = "A gallery of Lionel Messi celebrating Argentina's World Cup win has become the most-liked Instagram post ever."
words = word_tokenize(sentence)
print(words)

You can see that a single sentence is successfully tokenized into words that are included in a list.

Using both sent_tokenize() and word_tokenize() , let’s now create a function that allows us to tokenize a document into sentences, then sentences into words.

def tokenize(document):
sentences = sent_tokenize(document)
words = [word_tokenize(i) for i in sentences]
return words

print(tokenize(sample_text))

You can see that the setences are tokenized into words, which are included in lists that are contained in one list. However, simply tokenizing sentences into words does not tell us anything. For this reason, n-gram language model is used to overcome such limitations which tokenizes n number of continuous words together. For example, for a sentence “The previous holder of the most-liked status was an egg.”, 2-gram(bigram) extracts two sequential words: (The, previous), (holder, of), (the, most-liked), (status, was), (an, egg). We will talk about n-gram language model later.

Removing Stopwords

Stopwords are the words that are filtered out before analyzing texts because they are insignificant. Stopwords should be considered carefully since they are different in different languages. Some of the stopwords in English are is, the, a, will and so on since they do not hold any significant meaning.

NLTK provides stopwords in many different languages, and you can download them by running nltk.download('stopwords')

"""Download stopwords from NLTK"""
import nltk
nltk.download('stopwords')

print('-----Examples of Stopwords in English-----\n', nltk.corpus.stopwords.words('english')[:10], '\n')
print(f"There are {len(nltk.corpus.stopwords.words('english'))} number of stopwords in English\n\n")

print('-----Examples of Stopwords in English-----\n', nltk.corpus.stopwords.words('spanish')[:10], '\n')
print(f"There are {len(nltk.corpus.stopwords.words('spanish'))} number of stopwords in Spanish")

As can be seen, NLTK provides 179 stopwords in English and 313 stopwords in Spanish.

Now let’s try to filter out stopwords from a sample text document.

text = "Its invasion in February managed to startle in every way. To those who thought Moscow was sane enough to not attempt such a massive and foolhardy undertaking. To those who felt the Russian military would waltz across a land of 40 million people and switch to clean-up operations within 10 days. And to those who felt they had the technical and intelligence prowess to do more than just randomly bombard civilian areas with ageing artillery; that the Kremlin’s military had evolved from the 90s levelling of Grozny in Chechnya."
stopwords = nltk.corpus.stopwords.words('english') #stopwords

all_tokens = []
words = tokenize(text) #tokenize() is the function we created previously

for sentence in words:
for word in sentence:
if not word.lower() in stopwords: #word.lower() because all stopwords are in lowercase
all_tokens.append(word.lower())

print(all_tokens)

Keep in mind that all the stopwords provided by NLTK are provided in lowercase.

Stemming & Lemmatization

Stemming and lemmatization are word normalization techniques. These two are similar to each other in that they all look for the root word, but lemmatization is more elaborate than stemming in that it applies a morphological analysis to words.

NLTK provides various stemmers such as Porter, Lancaster and Snowball Stemmer. For lemmatization, NLTK provides WordNetLemmatizer. Let’s now compare stemming and lemmatization using the libraries provided by NLTK.

from nltk.stem import LancasterStemmer

stemmer = LancasterStemmer()

print(stemmer.stem('printing'), stemmer.stem('printer'), stemmer.stem('printed'))
print(stemmer.stem('debating'), stemmer.stem('debates'), stemmer.stem('debated'))
print(stemmer.stem('happier'), stemmer.stem('happiest'))
print(stemmer.stem('earlier'), stemmer.stem('earliest'))

The root word of ‘debating’ should be ‘debate’, and that of ‘earlier’ should be ‘early’. However, you can see that these words are not successfully stemmed using LancasterStemmer. ‘deb’ was returned for ‘debating’ and ‘ear’ was returned for ‘earlier’.

Now let’s try lemmatizing with WordnetLemmatizer.

First import and download necessary libraries and packages.

from nltk.stem import WordNetLemmatizer
import nltk

nltk.download('wordnet')
nltk.download('omw-1.4')

lem = WordNetLemmatizer()

print(lem.lemmatize('debating', 'v'), lem.lemmatize('debating', 'n'))
print(lem.lemmatize('earliest', 'a'), lem.lemmatize('earlier', 'a'))

You can see that these words are successfully lemmatized to the root words. As you might have assumed, ‘v’ is for verbs, ‘n’ is for nouns and ‘a’ is for adjectives.

You can see that lemmatization returns more accurate results for root words compared to stemming.

Bag of Words — BOW

Bag of Words model is a way of representing text data when modeling text with machine learning algorithms. A BOW is a representation of text that describes the occurrence of words within a document. It is called a bag of words as any information on the order of words is not considered in this model. In other words, the BOW model is only concered with the occurrence of words not where in he document.

One of the most significant advantages of BOW model is that it is simple and easy to use. It can be used for creating an initial draft model before performing more sophisticated models. However, at the same time, there are limitations of BOW model.

  • Firstly, a BOW model does not reflect the semantic context enough. Since it does not consider the order of words, semantic context of texts is discarded.
  • Secondly, feature vectorization using a BOW model could lead to a sparse matrix, the values of which are mostly zero. Performing a BOW model on a large size text document will yield a lot of columns that contain zero’s.

In general, there are two ways of feature vectorization with bag of words.

  1. Count Vectorization

This is a way of vectorizing texts that considers the count of each word in a document/documents. In count vectorization, a word with more counts is considered more significant. However, only giving counts does not guarantee that the feature of the document is well expressed with this type of vectorization. There could be certain words that appear repetitively but do not have much significance.

2. TF-IDF (Term Frequency — Inverse Document Frequency)

TF-IDF is a measure that can quantify the importance or relevance of words in a document amongst a collection of documents. TF-IDF is expressed as below.

Among multiple documents, the number of documents that contain the word i, frequency of word i in each document and the total number of documents are considered in TF-IDF.

Just reading through these concepts might seem unclear. Let’s go straight to practices in Python.

For count vectorization, we are going to use CountVectorizer() from Scikit-Learn.

from sklearn.feature_extraction.text import CountVectorizer


text = ['hello world, welcome to the world of python', 'python is world', 'python is difficult', 'python is not difficult at all', 'i do not agree']
cv = CountVectorizer()
count_matrix = cv.fit_transform(text)
count_array = count_matrix.toarray()
count_df = pd.DataFrame(count_array, columns = cv.get_feature_names())

count_df

The text in our example is a list that contains five elements, each of which is a sentence with lowercased words. Using CountVectorizer() , we created a data frame that contains five rows with multiple columns that represent words appearing in the texts. As you can see, the order of each text document (sentence) is not considered here.

For TF-IDF, we are also using TfidfVectorizer() from Scikit-Learn.

Let’s use the same sample text that we used for count vectorization.

from sklearn.feature_extraction.text import TfidfVectorizer


text = ['hello world, welcome to the world of python', 'python is world', 'python is difficult', 'python is not difficult at all', 'i do not agree']
tfidf = TfidfVectorizer()
tfidf_array = tfidf.fit_transform(text).toarray()
tfidf_df = pd.DataFrame(tfidf_array, columns = tfidf.get_feature_names())
tfidf_df

Running the codes above return a TF-IDF value for each word in each text document.

In this article, we focused on text preprocessing and feature vectorization in Python. Upcoming articles will cover more on ML model on text data and more text analytics related contents. Hope this helped!

--

--

Changhyun Kim

Korea Advanced Institute of Science and Technology — Business & Technology Management (Ph.D)