NLP Pipeline with Code

Archit Saxena
4 min readMar 1, 2022

--

In my previous article Natural Language Processing — Pipeline, I tried to give a basic understanding of the NLP pipeline, explaining each step theoretically with an example.

This article will emphasize more on the coding part of those examples.

If you have not gone through my previous article, I would suggest you please go through that so that you will have a better understanding of this article, since this is in continuation of that article.

We are all set now! Let us bring the pipeline of NLP which was discussed in my previous article-

PIPELINE OF NLP PROJECT

Typically, these are the different steps in NLP-

  • Collecting data
  • Segmentation
  • Tokenization
  • Stopword removal
  • POS Tagging
  • Lemmatization
  • Text vectorization
  • Model Traning and Prediction

1. Collecting data

Let us suppose we have got a CSV file with the sentences as one column. We will first load the data-

import pandas as pd
df = pd.read_csv('pipeline.csv')
df
The DataFrame

Now, we will extract the column we are interested in, i.e., ‘Sentence’ and merge all the sentences.

string = ' '.join(df['Sentence'])
print(string)

Our text is ready. Time to move on to the 2nd step.

2. Segmentation

We will use the Natural Language Toolkit (nltk), which is a suite of libraries and programs for symbolic and statistical natural language processing for English written in the Python programming language.

The ‘sent_tokenize’ segments our text into different sentences-

import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
tokens = (sent_tokenize(string))
print(tokens)

3. Tokenization

To further break the sentence into words, we will use ‘word_tokenize’. We will consider the second sentence from now onwards-

from nltk.tokenize import sent_tokenize, word_tokenize
tokenized_text = tokens[1]
word_tokens = word_tokenize(tokenized_text)
print(word_tokens)

4. Stopword Removal

There are a few stopwords in our sentence. We can get rid of them by using ‘stopwords’ from nltk-

from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
after_stopword = [w for w in word_tokens if not w in stop_words]
print(after_stopword)

5. POS Tagging

However, POS tagging will not serve any significant purpose here, we will anyway do that to understand the working of it-

nltk.download('averaged_perceptron_tagger')
pos = nltk.pos_tag(after_stopword)
print(pos)

6. Lemmatization

‘WordNetLemmatizer’ comes to the rescue for lemmatizing the sentence-

nltk.download('wordnet')from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lem = ''
for i in after_stopword:
lem = lem + '\'' + lemmatizer.lemmatize(i, pos ="a") + '\'' + ', '
print(lem.rstrip(', '))

7. Text vectorization

Here we will see the working of the Bag-of-words (BoW) model to convert text to vectors.

But before that, I performed all the above steps on the first sentence and also discarded the full stop(.) after the end of each sentence. Then I created a list of those sentences. Now, the list looks something like this-

Now we will use ‘CountVectorizer’ from the ‘sklearn’ library to convert all words into vectors and eventually replace the sentences with the vectors having the values at the places where the words appear. The code for this is-

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(sen_corpus)
print('All the unique words in our corpus-')
print(vectorizer.get_feature_names())
print('\nTransfomed sentences now look like-')
print(X.toarray())

As we can see here, it has given us all the unique words and the sentences are now transformed such that there is one(1) if the word is present in the respective sentence, otherwise zero(0).

To have more clarity over it, let us look at the first sentence. This sentence contains the words ‘this’, ‘first’ and ‘sentence’. The vector we got for this is [1 0 0 0 1 1]. The 1s here represent the presence and the position of these 3 words. The first 1 is for ‘first’. The next three 0s are for ‘long’, ‘one’ and ‘second’ since these are not present. The last two 1s are for ‘sentence’ and ‘this’.

By doing so, we get a uniform length for all the sentences.

Text to vector is essential to us as we can now leverage the power of linear algebra. We feed these vectors into our ML model and the model then trains itself.

This completes the code for the NLP pipeline. Now we are good to go to handle (almost) any text data and preprocess it.

If you like it, please leave a 👏.

Feedbacks or suggestions are always welcome.

--

--