Natural Language Processing — Pipeline

Archit Saxena
Analytics Vidhya
Published in
4 min readMay 13, 2020

--

Computers are great machines to handle structured data (like sheets, databases etc). But we humans speak in words which is unstructured data for computers.

Do you ever wonder how computers understand human language?

How do Siri and her friends Alexa and Cortana comprehend what we say and respond?

The answer is Natural Language Processing.

Natural Language Processing

Natural Language Processing (NLP) is a sub-field of linguistics, computer science, information engineering and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.

(Source: Wikipedia)

Source: Xoriant

NLP deals with text data, whether it is structured, unstructured or semi-structured data.

If you want to start a project in NLP, there is a pipeline that you can follow.

PIPELINE OF NLP PROJECT

There are different steps that you may choose to follow (some are based on the requirements) in NLP-

  • Collecting data
  • Segmentation
  • Tokenization
  • Stopword removal
  • POS Tagging
  • Lemmatization
  • Text vectorization
  • Model Traning and Prediction

1. Collecting data

You could be provided with data or generate/download the data. The Data Engineer might help you with getting the data on which you can work.

Let us suppose we have got the following text as our data-

This is first sentence. This is second sentence and is longer than first one.

2. Segmentation

Once you get the data and extract the text from it, the first step is to segment/split the sentences from the text. This is done since it is easier to work on sentences rather than the whole text.

If we segment the above text, we will get this as output-

1. ‘This is first sentence.’,

2. ‘This is second sentence and is longer than first one.’

In the most basic form, we can say that we look for punctuation to split the text.

3. Tokenization

After breaking the paragraph into sentences (segmentation step), we will further break the sentence into words. This process is called tokenization (or word tokenization).

If we tokenize the second sentence, we will get this as the output-

‘This’, ‘is’, ‘second’, ‘sentence’, ‘and’, ‘is’, ‘longer’, ‘than’, ‘first’, ‘one’, ‘.’

We split the sentence whenever there is a space between the words.

4. Stopword Removal

Stopwords are words that do not add any value to the meaning of the sentence. Sometimes it is better to identify and remove the stopwords from the text.

Some of the stopwords are-

“i”, “me”, “my”, “myself”, “we”, “our”, “ours”, “ourselves”, “you”, “your”, “yours”, “yourself”, “yourselves”, “he”, “him”, “no”, “nor”, “not”, “only”, “own”, “same”

So, our example sentence now becomes-

‘This’, ‘second’, ‘sentence’, ‘longer’, ‘first’, ‘one’, ‘.’

Here, the words ‘is’, ‘and’ and ‘than’ are removed.

NOTE that the selection of stopwords highly depends on the requirement. Based on the requirements, you may add/remove the stopwords.

For example, the default list of stopwords contains words like “no”, “not”, and “nor”. Removing these in a project like sentiment analysis is not advisable as it will change the whole sentiment.

5. POS Tagging

Parts-of-speech(POS), as the name suggests, is the process of tagging a text to the corresponding parts of speech (noun, verb, adjective, adverbs, etc).

For the same sentence, after performing POS tagging, we will get the following -

(‘This’, ‘DT’), (‘second’, ‘JJ’), (‘sentence’, ‘NN’), (‘longer’, ‘JJR’), (‘first’, ‘JJ’), (‘one’, ‘CD’), (‘.’, ‘.’)

Here’s a list of the tags and what they mean-

  • DT — determiner
  • JJ — adjective
  • NN — noun
  • JJR — adjective, comparative ‘bigger’
  • JJ — adjective ‘big’
  • CD — cardinal digit
  • . — punctuation

6. Lemmatization

Lemmatization is the process of breaking any word into its base form.

For example, the word ‘Caring’ will be converted to the word ‘Care

For our sentence, it works as-

‘This’, ‘second’, ‘sentence’, ‘long’, ‘first’, ‘one’, ‘.’

Here, the word ‘longer has been changed to ‘long’.

7. Text vectorization

We need to convert the text into mathematical data (vector) which would be fed to the Machine Learning algorithms.

We can use different techniques/models to represent the words we get after the preprocessing. Some of them are-

  1. Bag-of-words (BoW)
  2. tf-idf
  3. word2vec

Each one has a different way to represent the words but eventually gives a vector for each sentence which we can further feed into our Machine/Deep Learning model.

8. Model training and prediction

After getting the vector, now is the time to build our model.

We split the data into a train set and a test set (typically 20–30% of the dataset).

The vectorized sentences are then fed with the expected output (actual labels) to our model so that it would learn and is ready for any new sentence.

We then test it with our testing set. Using the metrics (topic for another day 😉 ), if the model is accurate enough, we will go ahead with that.

If not, we can always retrain the model.

CONCLUSION

As you can see, the NLP pipeline involves multiple steps. The best part is that you have different libraries for all the mentioned steps which will ease your work.

But it is not always required to implement all the steps (or even in the same order). It totally depends on your use case and the dataset.

If you like it, please leave a 👏.

Feedbacks or suggestions are always welcome.

--

--