NLP- Neuro Linguistic Programming and Spam Detection

TINU ROHITH D
Analytics Vidhya
Published in
3 min readMay 26, 2020

Hi my fellow people, “Rhymes the rhyme” — The tune for abcdefg song is same as the tune of twinkle twinkle little star!. Here let's look into the requirements for building up a NLP model.

What is NLP?

Natural Language processing is a branch of AI that helps computers to understand, interpret and manipulate human language.NLP helps developers to organize and structure knowledge to perform tasks like translation, summarization, named entity recognition, relationship extraction, speech recognition, topic segmentation etc.

Components of NLP:

  1. Morphological Analysis: Break chunks of language input into sets of token corresponding to paragraphs, sentences and words. Eg: Uneasy to “un” + “easy”.
  2. Syntax Analysis: To check that a sentence is well formed or not and to break it up into a structure that shows syntactic relationship between the different words. Eg: The school goes to the boy -Sentence “rejected”.
  3. Semantic Analysis: Draws exact meaning or dictionary meaning from the text. It shows how words are associated with each other. Eg: Hot Coffee — accepted as two words extracts to be meaningful.
  4. Pragmatic analysis: Fits the actual objects/events which exits in a given context with object reference obtained during semantic phase. It discovers the intent of the text. Eg: “Close the window” -interpreted as request instead of order. Another Eg: “Place the apple in the basket on the shelf” -It has 2 semantic interpretations and pragmatic analyzer will choose between these two possibilities.

Lets cover the steps involved in NLP and build a model using a spam detection dataset:

Step 1: Read the data.

Step 2: Preprocessing the data.

  1. Convert all the characters to lowercase.
  2. Remove stopwords. “Stopwords are the most common words used in a language. We remove these words from the text to have the model perform better with good statistical preferences”.
  3. Stemming: Extract base form of words from text by chopping the ends of word. There are different types of stem function SnowballStemmer, PorterStemmer, LancasterStemmer.
  4. Lemmatization: Another way to extract base form of words, depending upon whether it is used as noun or verb. Commonly, used are WordNetLemmatizer.

Step 3: Word embedding or Word2Vec:

Word embedding is, Mapping of words to real numbers. Each word is allocated with a unique number. Commonly used vectorizers are CountVectorizer and TfidfVectorizer from sklearn.feature_extraction.text packages.

Step 4: Creating labels and data-split for model building.

  1. Naive Bayes:Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of features given the value of the class variable.
  2. MultinomialNB: It implements the naive Bayes algorithm for multinomial distributed data, and is one of the two classic naive Bayes variants used in text classification (where the data are typically represented as word vector counts, although tf-idf vectors are also known to work well in practice).

Step 5: Checking the performance of the model

The accuracy of the model was 97%

Note: Have not displayed the output of each step in this article, as because, i want you people to explore it in your notebook and get a clear picture of what's happening across each step for gaining better knowledge out of it.

Conclusion:

We have looked into few initial requirements needed for building an NLP model. Thank my fellow people, for investing your dedicated time for reading this article. Appreciate IT!

--

--