NLP Pipeline 101 With Basic Code Example — Feature Extraction

Haitian Wei
Voice Tech Podcast
Published in
5 min readMar 27, 2019

Introduction

In the previous article NLP Pipeline 101 With Basic Code Example — Text Processing I have talked about the first step of building a NLP pipeline. In this article I will focus on the next step: feature extraction.

source: Udacity

Feature Extraction

Feature extraction step means to extract and produce feature representations that are appropriate for the type of NLP task you are trying to accomplish and the type of model you are planning to use.

Bag of Words

The bag-of-words model is a simplifying representation used in NLP. In this model, a text is represented as the bag of its words, disregarding grammar and even word order but keeping multiplicity.

source: https://slideplayer.com/slide/7073400/

And luckily for us, there are ready-to-use python package for this model.

from sklearn.feature_extraction.text import CountVectorizer
corpus = ["The first time you see The Second Renaissance it may look boring.",
"Look at it at least twice and definitely watch part 2.",
"It will change your view of the matrix.",
"Are the human people the ones who started the war?",
"Is AI a bad thing ?"]
# initialize count vectorizer object
# use your own tokenize function
vect = CountVectorizer(tokenizer=tokenize)
# get counts of each token (word) in text data
X = vect.fit_transform(corpus)
# convert sparse matrix to numpy array to view
X.toarray()
# view token vocabulary and counts
vect.vocabulary_
>>> {'first': 6,
'time': 20,
'see': 17,
'second': 16,
'renaissance': 15,
'may': 11,
'look': 9,
'boring': 3,
'least': 8,
'twice': 21,
'definitely': 5,
'watch': 24,
'part': 13,
'2': 0,
'change': 4,
'view': 22,
'matrix': 10,
'human': 7,
'people': 14,
'one': 12,
'started': 18,
'war': 23,
'ai': 1,
'bad': 2,
'thing': 19}

Build better voice apps. Get more articles & interviews from voice technology experts at voicetechpodcast.com

TF-IDF

TF-IDF is short for term frequency–inverse document frequency. It’s designed to reflect how important a word is to a document in a collection or corpus.

The TF-IDF value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.

And similar to bag of words, sklearn.feature_extraction.text provide method. Below are sample codes.

from sklearn.feature_extraction.text import TfidfTransformer# initialize tf-idf transformer object
transformer = TfidfTransformer(smooth_idf=False)
# use counts from count vectorizer results to compute tf-idf values
tfidf = transformer.fit_transform(X)
# convert sparse matrix to numpy array to view
tfidf.toarray()
>>>array([[ 0. , 0. , 0. , 0.36419547, 0. ,
0. , 0.36419547, 0. , 0. , 0.26745392,
0. , 0.36419547, 0. , 0. , 0. ,
0.36419547, 0.36419547, 0.36419547, 0. , 0. ,
0.36419547, 0. , 0. , 0. , 0. ],
[ 0.39105193, 0. , 0. , 0. , 0. ,
0.39105193, 0. , 0. , 0.39105193, 0.28717648,
0. , 0. , 0. , 0.39105193, 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0.39105193, 0. , 0. , 0.39105193],
[ 0. , 0. , 0. , 0. , 0.57735027,
0. , 0. , 0. , 0. , 0. ,
0.57735027, 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0.57735027, 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0.4472136 , 0. , 0. ,
0. , 0. , 0.4472136 , 0. , 0.4472136 ,
0. , 0. , 0. , 0.4472136 , 0. ,
0. , 0. , 0. , 0.4472136 , 0. ],
[ 0. , 0.57735027, 0.57735027, 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0.57735027,
0. , 0. , 0. , 0. , 0. ]])

Moreover, we can use TfidfVectorizer.

TfidfVectorizer = CountVectorizer + TfidfTransformer

from sklearn.feature_extraction.text import TfidfVectorizer# initialize tf-idf vectorizer object
vectorizer = TfidfVectorizer()
# compute bag of word counts and tf-idf values
X = vectorizer.fit_transform(corpus)
# convert sparse matrix to numpy array to view
X.toarray()

word2vec

Word2vec is a group of related models that are used to produce word embeddings. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space.

source:https://medium.com/@jayeshbahire/introduction-to-word-vectors-ea1d4e4b84bf

Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space.

There are typically two models: CBOW and Skip-grams.

After we get the word vectors, we can use it to extract features from a given document. One simple technique that seems to work reasonably well for short texts (e.g., a sentence or a tweet) is to compute the vector for each word in the document, and then aggregate them using the coordinate-wise mean, min, or max.

Glove

Like word2vec, Glove is another commonly used word embedding method. Glove is short for global matrix factorization ,it is the process of using matrix factorization methods from linear algebra to perform rank reduction on a large term-frequency matrix.

And this is what feature extraction part of the NLP pipeline do. In the next article, I will go through the model part.

--

--