Exploring Feature Extraction Techniques for Natural Language Processing

Sahel Eskandar
12 min readApr 26, 2023

In natural language processing (NLP), feature extraction is a fundamental task that involves converting raw text data into a format that can be easily processed by machine learning algorithms. There are various techniques available for feature extraction in NLP, each with its own strengths and weaknesses. As a data scientist, it’s important to have a good understanding of the different feature extraction techniques available and their appropriate use cases.

In this article, I will explore several common techniques for feature extraction in NLP, including CountVectorizer, TF-IDF, word embeddings, bag of words, bag of n-grams, HashingVectorizer, Latent Dirichlet Allocation (LDA), Non-negative Matrix Factorization (NMF), Principal Component Analysis (PCA), t-SNE, and Part-of-Speach (POS) tagging.

Photo by Alexander Sinn on Unsplash

In an effort to provide a comprehensive comparison of various text feature extraction methods, I have collected a chart comparing the main features, and common use cases of some of the most popular techniques. By comparing the strengths and weaknesses of each approach, I hope to provide a useful resource for data scientists and researchers looking to choose the best technique for their specific task.

Comparison of Text Feature Extraction Techniques

I will provide an example Python code demonstrating how to implement each technique on a sample dataset, and discuss the advantages and limitations of each technique. By the end of this article, you will have a better understanding of the different feature extraction techniques available in NLP and be better equipped to choose the appropriate technique for your specific use case.

import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import NMF, PCA
from sklearn.manifold import TSNE
import numpy as np
import spacy

# Input text
text = "Natural Language Processing (NLP) is a subfield of computer science, " \
"artificial intelligence, and computational linguistics concerned with " \
"the interactions between computers and human (natural) languages. " \
"It focuses on how to program computers to process and analyze large " \
"amounts of natural language data."

# Tokenize the text
tokens = word_tokenize(text)
print(len(tokens), tokens[:20])

Choose a simple text and create a list of tokens by word_tokenize function.

50 ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'a', 'subfield', 'of', 'computer', 'science', ',', 'artificial', 'intelligence', ',', 'and', 'computational', 'linguistics', 'concerned']

CountVectorizer

CountVectorizer is a class in the scikit-learn library of Python that is used for converting a collection of text documents into a matrix of token counts. It creates a bag-of-words representation of the text corpus, where each document is represented as a vector of term frequencies. The CountVectorizer class provides various options for preprocessing the text data, such as tokenization, removing stop words, and stemming. It also allows for the specification of the maximum vocabulary size and minimum document frequency required for a term to be included in the vocabulary. The resulting matrix can be used as input to various machine-learning models for tasks such as text classification and clustering.

# CountVectorizer
count_vec = CountVectorizer()
X_count = count_vec.fit_transform([text])
print('CountVectorizer:')
print(count_vec.get_feature_names_out()[:10])
print(X_count.toarray()[0][:10])
CountVectorizer:
['amounts' 'analyze' 'and' 'artificial' 'between' 'computational'
'computer' 'computers' 'concerned' 'data']
[1 1 3 1 1 1 1 2 1 1]

TF-IDF

TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure that reflects the importance of a word in a document or corpus. It is calculated as the product of the term frequency (number of times a word appears in a document) and the inverse document frequency (logarithm of the total number of documents divided by the number of documents containing the word). The resulting TF-IDF vectors represent each document as a vector in a high-dimensional space where words that are more important in the document have higher weights. TF-IDF is a popular method for feature extraction in text classification tasks.

# TF-IDF
tfidf_vec = TfidfVectorizer()
X_tfidf = tfidf_vec.fit_transform([text])
print('TF-IDF:')
print(tfidf_vec.get_feature_names_out()[:10])
print(X_tfidf.toarray()[0][:10])
TF-IDF:
['amounts' 'analyze' 'and' 'artificial' 'between' 'computational'
'computer' 'computers' 'concerned' 'data']
[0.12803688 0.12803688 0.38411064 0.12803688 0.12803688 0.12803688
0.12803688 0.25607376 0.12803688 0.12803688]

Word Embeddings

Word embeddings represent each word as a vector in a high-dimensional space where similar words are close together and dissimilar words are far apart. Word embeddings are typically learned from large amounts of text data using unsupervised learning methods like Word2Vec and GloVe. The basic idea is to use a neural network to predict the context words of each target word in a large text corpus. The learned weights of the neural network are used as the word embeddings. Once the word embeddings are learned, they can be used to represent words in downstream natural language processing tasks like sentiment analysis and text classification.

# Word embeddings (using spaCy)
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
embeddings = [token.vector for token in doc]
print('Word embeddings:')
Word embeddings:
(50 items) [ndarray with shape (96,), ndarray with shape (96,), ....]

Word Embeddings vs TF-IDF

Here, I used two different methods, word embeddings and TF-IDF, to compute the similarity between two words, “natural” and “language.” As mentioned before, word embeddings are dense vector representations of words that capture their semantic meaning, while TF-IDF is a numerical statistic that reflects the importance of a word in a document corpus. I used the cosine similarity metric to compare the vector representations of these two words in both methods and found that the similarity score using TF-IDF was higher than the one using word embeddings.

# Compare similarities using cosine similarity
from sklearn.metrics.pairwise import cosine_similarity

# Similarity between "natural" and "language"
word1 = "natural"
word2 = "language"
embedding_sim = cosine_similarity(embeddings[tokens.index(word1)].reshape(1, -1),
embeddings[tokens.index(word2)].reshape(1, -1))
tfidf_sim = cosine_similarity(X_tfidf[:, tfidf_vec.vocabulary_[word1]].reshape(1, -1),
X_tfidf[:, tfidf_vec.vocabulary_[word2]].reshape(1, -1))
print(f'Similarity between "{word1}" and "{word2}" using word embeddings:', embedding_sim[0][0])
print(f'Similarity between "{word1}" and "{word2}" using TF-IDF:', tfidf_sim[0][0])
Similarity between "natural" and "language" using word embeddings: 0.23813576
Similarity between "natural" and "language" using TF-IDF: 1.0

Word embeddings represent words as dense vectors in a high-dimensional space, where the distance between vectors represents the semantic similarity between words. The cosine similarity is a common way to measure the similarity between two vectors, and in this case, the similarity between the embeddings of “natural” and “language” is 0.23813576.

TF-IDF stands for term frequency-inverse document frequency and is a technique used to quantify how important a word is to a document. The TF-IDF score for a word in a document is proportional to the frequency of the word in the document, and inversely proportional to the frequency of the word in the corpus. In this case, the similarity between “natural” and “language” using TF-IDF is 1.0, which means that both words have the same score and are considered equally important in the context of the document.

Bag of words

Bag of words approach involves breaking down a piece of text into individual words, and then representing the text as a frequency distribution of those words. In other words, we’re creating a “bag” of all the words in the text, without any regard for their order or context, and then counting how many times each word appears in that bag. This simple yet effective method allows us to extract meaningful insights from large volumes of text data, such as identifying the most frequent words, analyzing sentiment, or even predicting future trends. While the bag of words method may seem rudimentary, it’s an essential tool in any NLP toolkit. It can be used for a wide variety of applications, from content classification and spam detection to sentiment analysis and chatbot development.

# Bag of words
bag_of_words = {word: tokens.count(word) for word in set(tokens)}
print('Bag of words:')
print(list(bag_of_words.items())[:10])
Bag of words:
[('amounts', 1), ('language', 1), ('(', 2), ('natural', 2), ('a', 1), ('concerned', 1), (')', 2), ('between', 1), ('program', 1), ('It', 1)]

Bag of ngrams

One of the fundamental tasks is to convert raw text data into a numerical form for machine learning models. While the bag-of-words method is a popular approach, it fails to capture the order and sequence of words in a text. This is where the bag-of-ngrams method comes in, where ngrams refer to contiguous sequences of n words. By including ngrams of different lengths, I can capture both local and global patterns in the text. However, as the length of the ngram increases, the number of unique features also increases, leading to the curse of dimensionality. To overcome this, I can use techniques such as feature selection and dimensionality reduction. Overall, the bag-of-ngrams method is a powerful tool for text representation, particularly for tasks such as sentiment analysis and text classification.

# Bag of n-grams
n = 2
ngrams = [tuple(tokens[i:i+n]) for i in range(len(tokens)-n+1)]
bag_of_ngrams = {ngram: ngrams.count(ngram) for ngram in set(ngrams)}
print('Bag of n-grams:')
print(list(bag_of_ngrams.items())[:10])
Bag of n-grams:
[(('amounts', 'of'), 1), (('subfield', 'of'), 1), (('It', 'focuses'), 1), (('computational', 'linguistics'), 1), (('computer', 'science'), 1), ((')', 'languages'), 1), (('language', 'data'), 1), (('Natural', 'Language'), 1), (('of', 'natural'), 1), (('NLP', ')'), 1)]

HashingVectorizer

HashingVectorizer is another method for converting text into numerical representations that can be used for machine learning tasks. It is similar to CountVectorizer, but instead of building a vocabulary dictionary to map words to indexes, it uses a hashing function to directly convert words into numerical indices. This makes the process of vectorizing text much faster and requires less memory, but it comes at the cost of not being able to retrieve the original words from the indices.

Compared to CountVectorizer and TfidfVectorizer, HashingVectorizer has the advantage of being more memory-efficient and faster, which makes it useful for processing large amounts of text data. However, the downside is that it does not allow for the same level of control over the vocabulary as the other methods, which can lead to collisions where different words are mapped to the same index. This can result in a loss of information and lower accuracy in certain cases.

The common use case for HashingVectorizer is in scenarios where memory and processing speed are a concern, such as when working with large datasets or in real-time applications. It is often used in combination with online learning algorithms or streaming data pipelines, where the model needs to be updated on-the-fly and quickly adapted to new data. It is also commonly used in text classification, sentiment analysis, and topic modeling.

# HashingVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
hash_vec = HashingVectorizer(n_features=100)
X_hash = hash_vec.fit_transform([text])
print('HashingVectorizer:')
print(X_hash.shape, X_hash.toarray()[0][:10])
HashingVectorizer:
(1, 100) [ 0. 0. -0.13483997 0. 0. 0.13483997
0. 0. 0. -0.13483997]

Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA) is a generative probabilistic model and is different from previous methods because it assumes that each document in a corpus is generated from a mixture of topics, and each topic is a probability distribution over words. LDA works by iteratively assigning words in each document to topics and adjusting the topic-word probabilities based on the resulting distribution of topics across documents. The end result of LDA is a set of topics, each represented by a distribution of words.

LDA is commonly used in applications like topic modeling, document clustering, and information retrieval. For example, a company that wants to understand the topics discussed in customer reviews could use LDA to identify the key themes in the reviews. Another example is a researcher who wants to analyze a large corpus of scientific papers to identify the main topics and trends in a field. Overall, LDA is a powerful tool for uncovering hidden patterns and structures in text data.

# Latent Dirichlet Allocation (LDA)
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components=10, random_state=42)
X_lda = lda.fit_transform(X_tfidf)
print('LDA:')
print(X_lda)
LDA:
[[0.01600128 0.01600128 0.01600128 0.01600128 0.01600128 0.85598847
0.01600128 0.01600128 0.01600128 0.01600128]]

Non-negative Matrix Factorization (NMF)

Non-negative Matrix Factorization (NMF) is a technique used for dimensionality reduction and feature extraction in text data. It factorizes a non-negative matrix into two smaller matrices: a low-rank approximation of the original matrix and a coefficient matrix representing the weights of the extracted features. The resulting matrices are non-negative, which makes them suitable for feature extraction on non-negative datasets such as text data.

Compared to other methods, NMF provides a more interpretable representation of the data, as the resulting features are represented as weighted combinations of the original features. Additionally, NMF can handle sparse data and can be used for unsupervised feature selection. One of the common use cases for NMF is in topic modeling, where it is used to extract latent topics from a corpus of text data. It has also been used for text clustering, collaborative filtering, and image processing.

# Non-negative Matrix Factorization (NMF)
nmf = NMF(n_components=10, random_state=42)
X_nmf = nmf.fit_transform(X_tfidf)
print('NMF:')
print(X_nmf)
NMF:
[[6.55551346e-01 4.28578416e-01 1.47212861e-02 2.40446075e-16
1.54113864e-01 1.05308938e-01 5.73261840e-02 1.65705288e-01
1.43820169e-01 2.48592727e-16]]

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a dimensionality reduction technique used to transform high-dimensional datasets into a smaller set of variables called principal components. PCA works by identifying the variables that contribute the most to the variance in the data and then projecting the data onto a lower-dimensional space while still retaining the maximum amount of variance. PCA is useful in many data analysis applications, including image and signal processing, finance, and biology.

Compared to previous methods, PCA is a more general technique that can be applied to any type of data, not just text. PCA can be used to reduce the dimensionality of the data without losing too much information, whereas other methods are more focused on extracting meaningful features from the text. PCA is also computationally efficient and can handle large datasets with millions of variables.

The common use case of PCA is to reduce the dimensionality of data in order to visualize it or to make it easier to analyze. For example, PCA can be used to reduce the dimensionality of image data to create a 2D or 3D visualization of the image, which can help in image recognition tasks. PCA can also be used in finance to reduce the dimensionality of a large set of financial variables to identify the most important factors that drive financial performance.

t-SNE

t-SNE (t-Distributed Stochastic Neighbor Embedding) is a nonlinear dimensionality reduction technique that is commonly used for visualizing high-dimensional data in a low-dimensional space. It is especially useful for visualizing complex datasets such as those with many clusters, outliers, or nonlinear relationships.

t-SNE works by first constructing a probability distribution over pairs of high-dimensional objects such that similar objects have a high probability of being chosen, while dissimilar objects have an extremely low probability of being chosen. It then constructs a similar probability distribution over pairs of low-dimensional objects and minimizes the divergence between the two distributions using gradient descent.

Compared to previous methods such as PCA or LDA, t-SNE is more effective at preserving local structure in the data, making it better suited for visualization tasks. However, t-SNE is also more computationally intensive and can be sensitive to the choice of hyperparameters, such as perplexity.

The common use case for t-SNE is exploratory data analysis, where it can be used to visualize high-dimensional data and identify patterns or clusters. It is also commonly used in machine learning for preprocessing high-dimensional data before feeding it into models such as neural networks or clustering algorithms.

Part-of-Speech (POS)

POS (Part-of-Speech) tagging is the process of assigning grammatical tags to each word in a text based on its definition and context. This is done by analyzing the structure of a sentence and identifying the function of each word in the sentence, such as noun, verb, adjective, adverb, preposition, conjunction, etc. POS tagging is commonly used in natural language processing (NLP) for various applications like text classification, sentiment analysis, machine translation, and speech recognition.

Compared to previous methods like Bag-of-Words and TF-IDF, which mainly focus on word frequency and occurrence, POS tagging provides more granular information about each word’s function in a sentence. This can help improve the accuracy of NLP models by taking into account the grammatical structure of a sentence. Furthermore, POS tagging is useful for disambiguating words that can have multiple meanings depending on their context. For example, the word “bank” can refer to a financial institution or the edge of a river, but POS tagging can help distinguish which meaning is intended based on the surrounding words in a sentence.

nltk.download('averaged_perceptron_tagger')

# Tokenize the text into sentences and words
sentences = nltk.sent_tokenize(text)
words = [nltk.word_tokenize(sentence) for sentence in sentences]

# Perform POS tagging on the words
pos_tags = [nltk.pos_tag(sentence) for sentence in words]

# Print the POS tags
for sentence in pos_tags:
print(sentence)
[('Natural', 'JJ'), ('Language', 'NNP'), ('Processing', 'NNP'), ('(', '('), ('NLP', 'NNP'), (')', ')'), ('is', 'VBZ'), ('a', 'DT'), ('subfield', 'NN'), ('of', 'IN'), ('computer', 'NN'), ('science', 'NN'), (',', ','), ('artificial', 'JJ'), ('intelligence', 'NN'), (',', ','), ('and', 'CC'), ('computational', 'JJ'), ('linguistics', 'NNS'), ('concerned', 'VBN'), ('with', 'IN'), ('the', 'DT'), ('interactions', 'NNS'), ('between', 'IN'), ('computers', 'NNS'), ('and', 'CC'), ('human', 'JJ'), ('(', '('), ('natural', 'JJ'), (')', ')'), ('languages', 'NNS'), ('.', '.')]
[('It', 'PRP'), ('focuses', 'VBZ'), ('on', 'IN'), ('how', 'WRB'), ('to', 'TO'), ('program', 'NN'), ('computers', 'NNS'), ('to', 'TO'), ('process', 'VB'), ('and', 'CC'), ('analyze', 'VB'), ('large', 'JJ'), ('amounts', 'NNS'), ('of', 'IN'), ('natural', 'JJ'), ('language', 'NN'), ('data', 'NNS'), ('.', '.')]

In this article, I explored various feature extraction methods such as Bag-of-Words, TF-IDF, and word embeddings. I also looked at dimensionality reduction techniques such as PCA and t-SNE, as well as topic modeling techniques such as LDA and NMF. Finally, I covered Part-of-Speech (POS) tagging and its use case. It was a pleasure to share my knowledge and experience with natural language processing and help others understand this exciting field. Please find the Python code in my repo.

👏 Don’t forget to give this article some claps and share it with your network to support my work! Feel free to follow my Medium profile for more insightful content on machine learning and data science. Thank you for being so supportive! 🚀

--

--

Sahel Eskandar

Data Scientist | Ph.D. Teaching and working with people brings me a sense of purpose. I believe in systems! Motivated to create a better one!