Introduction to NLP

Rajat Pal
Geek Culture
Published in
5 min readMay 4, 2021

What is Natural Language Processing (NLP)?

Natural Language Processing can be defined as the branch of computer science which helps a person with no computer science knowledge to make a computer work for him.

As we all know, since humans have invented computers accuracy, precision, and speed of doing a certain task has improved exponentially. Now human are solving Complex problems in secs which used to take hours when done manually. Computers have drastically changed our life, But this change is restricted to the peoples which have computer knowledge.

since the invention till know lot of development have happened in this field which helps the people with little knowledge can also operate computers for certain tasks, we have invented certain software which helps humans with limited knowledge of the computer to perform their task effectively on computers.

Nowadays we can see that there are a lot of new things which are done by peoples on computers without taking the help of experts.

  • Smart assistant (eg Siri, Alexa)
  • Doing sentiment analysis
  • Predicting text
  • Language Translation

Understanding How computer do this all

As we know that computers only understand numerical languages, they don't understand words but we humans only deal in words, so let's understand how these words are converted into useful numerical data which is feed to computers.

when a human writes or speak in his language he makes use of many words which have minimal or no effect on the message which he wants to convey.

In earlier days letters and telegrams were used to send messages, letter consist of many words where telegrams used only those words which are helpful in conveying the message, there we are not concern about grammar so by this we understand not every word present in the text is important to pass a message.

ML algorithms interpret the frequency of words in a particular document as a feature, so documents can have N numbers of unique words (eg 10000) so here we will have N features for this algorithm but as the number of features increases our model performance decreases, so to overcome this we can remove certain words which will not change the meaning of the documents as we use to do in telegrams.

Text Processing

  • Removing Punctuation: including punctuation in our machine learning model might be a very bad idea since punctuation hold no physical significance when it comes to figuring out topics and understanding sentiments of texts, so we need to remove all the punctuation marks present in the text.
  • Removing Stopwords: Stopwords are the most frequent accruing words which have no significance in the message. words such as ‘is’, ‘the’, ‘and’ , etc are some of the English stop words. removing this will improve our model performance
  • Tokenization: we convert text present in the corpus into a collection of token/ words /phrases. Tokenization in which a token consists of only one word is known as a 1-gram tokenizer. Similarly, the tokenizer which splits our texts such that Each token consists of a maximum of n number words is known as an n-gram tokenizer.
This how tokenization is done
  • Stemming and Lemmatisation: stemming and lemmatisation is used to convert the word into its root words, eg going, gone, goes all have the same root words go, so that means all these words have the same meaning, but our algorithm will consider these as three separate words which will increase the features present in the dataset. in stemming we remove the suffix from the words whereas Lemmatization takes the help of the linguistic analysis of the words. It is necessary to have detailed dictionaries which the algorithm can look through to link the form back to its lemma. It takes the help of various linguistic insights of that particular word.
difference between stemming and lemmatization
  • Removing max and min frequent words: the words which are occurring in each sentence will not help in classification or clustering as they are present everywhere, while in case of words that are occurring very less are also useless as they don't play that importance in classification.

Vectorization

In vectorization, we find the value of the features present in each document and make a sparse matrix.

CounterVectorization is one of the techniques used in which we count the feature present in documents.

in this, the words which are occurring more will have high value whereas low occurring words will have low values so the importance of low occurring values will decrease giving the poor performance of the model. It might happen that these low values words can give better performance in classification or clustering the problem to overcome this problem we come up with other Vectorization technique known as TF-IDF Vectorization

TF-IDF Vectorization gives equal importance to all words in documents. this thing is achieved by multiplying term frequency and Inverse documents frequency.

Term Frequency (TF): It can be calculated by dividing the number of times a word appears in a document divided by the total words in documents

Inverse Document Frequency (IDF): this is calculated by finding the log of (number of documents divided by the number of documents that contain the word)

so by this, we give equal weightage to the words present in the documents

By doing all these processes we have successfully converted our text into a numerical form that can be fed to the machine and get desired output

That’s all in this blog, feel free to reach me if u think anything is wrong in the blog, live love to improve the mistake till then goodbye and have a nice day!

--

--