Text Processing using spaCy | NLP Library

Named Entity Recognition is the most important, or I would say, the starting step in Information Retrieval. Information Retrieval is the technique to extract important and useful information from unstructured raw text documents. Named Entity Recognition NER works by locating and identifying the named entities present in unstructured text into the standard categories such as person names, locations, organizations, time expressions, quantities, monetary values, percentage, codes etc. Spacy comes with an extremely fast statistical entity recognition system that assigns labels to contiguous spans of tokens.

Spacy Installation and Basic Operations | NLP Text Processing Library | Part 1

Spacy provides an option to add arbitrary classes to entity recognition systems and update the model to even include the new examples apart from already defined entities within the model. …

Text Preprocessing steps using spaCy, the NLP library

spaCy” is designed specifically for production use. It helps you build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems or to pre-process text for deep learning. In this article, you will learn about Tokenization, Lemmatization, Stop Words and Phrase Matching operations using spaCy.

This is article 2 in the spaCy Series. In my last article, I have explained about spaCy Installation and basic operations. If you are new to this, I would suggest starting from article 1 for a better understanding.

Article 1 — spaCy-installation-and-basic-operations-nlp-text-processing-library/


Tokenization is the first step in text processing task. Tokenization is not only breaking the text into components, pieces like words, punctuation etc known as tokens. However, it is more than that. spaCy do the intelligent Tokenizer which internally identifies whether a “.” is punctuation and separate it into token or it is part of an abbreviation like “U.S.” …

Bayes Theorem is the extension of Conditional probability. Conditional probability helps us to determine the probability of A given B, denoted by P(A|B). So Bayes’ theorem says if we know P(A|B) then we can determine P(B|A), given that P(A) and P(B) are known to us.

In this post, I am concentrating on Bayes’ theorem assuming you have a good understanding of Conditional probability. In case you want to revise your concepts, you may refer my previous post on Conditional probability with examples.

Formula derivation:

From conditional probability, we know that

  • P(A|B) = P(A and B)/P(B)
  • P(A and B) = P(B) * P(A|B) — — — —…

As the name suggests, Conditional Probability is the probability of an event under some given condition. And based on the condition our sample space reduces to the conditional element.

For example, find the probability of a person subscribing for the insurance given that he has taken the house loan. Here sample space is restricted to the persons who have taken house loan.

To understand Conditional probability, it is recommended to have an understanding of probability basics like Mutually Exclusive and Independent Events, Joint, Union and Marginal Probabilities and Probability vs Statistics etc. …

Step by Step Explanation of PCA using python with example

Principal Component Analysis or PCA is a widely used technique for dimensionality reduction of the large data set. Reducing the number of components or features costs some accuracy and on the other hand, it makes the large data set simpler, easy to explore and visualize. Also, it reduces the computational complexity of the model which makes machine learning algorithms run faster. It is always a question and debatable how much accuracy it is sacrificing to get less complex and reduced dimensions data set. …

Logistic regression is the most widely used machine learning algorithm for classification problems. In its original form, it is used for binary classification problem which has only two classes to predict. However, with little extension and some human brain, logistic regression can easily be used for a multi-class classification problem. In this post, I will be explaining about binary classification. I will also explain the reason behind maximizing the log-likelihood function.

To understand logistic regression, it is required to have a good understanding of linear regression concepts and it’s cost function that is nothing but the minimization of the sum of squared errors. I have explained this in detail in my earlier post and I would recommend you to refresh linear regression before going deep into logistic regression. Assuming you have a good understanding of linear regression let’s start deep diving to logistic regression. However, there arises one more question why can’t we use linear regression for classification problems. …

In R, stepAIC is one of the most commonly used search method for feature selection. We try to keep on minimizing the stepAIC value to come up with the final set of features. “stepAIC” does not necessarily mean to improve the model performance, however, it is used to simplify the model without impacting much on the performance. So AIC quantifies the amount of information loss due to this simplification. AIC stands for Akaike Information Criteria.

If we are given two models then we will prefer the model with lower AIC value. Hence we can say that AIC provides a means for model selection. …

There are different questions related to Multicollinearity as below:

  • What is Multicollinearity?
  • How Multicollinearity is related to correlation?
  • Problems with Multicollinearity.
  • Best way to detect multicollinearity in the model.
  • How to handle/remove Multicollinearity from the model?

We will try to understand each of the questions in this post one by one.


Multicollinearity occurs in a multilinear model where we have more than one predictor variables. So Multicollinearity exists when we can linearly predict one predictor variable (note not the target variable) from other predictor variables with a significant degree of accuracy. It means two or more predictor variables are highly correlated. …

Feature selection is a way to reduce the number of features and hence reduce the computational complexity of the model. Many times feature selection becomes very useful to overcome with overfitting problem. It helps us in determining the smallest set of features that are needed to predict the response variable with high accuracy. if we ask the model, does adding new features, necessarily increase the model performance significantly? if not then why to add those new features which are only going to increase model complexity.

So now let's understand how can we select the important set of features out of total available features in the given data set. …

In any business there are some easy to measure variables like Age, Gender, Income, Education Level etc. and there are some difficult to measure variables like the amount of loan to give, no of days a patient will stay in the hospital, price of the house after 10 years etc. So Regression is the technique which enables you to determine difficult to measure variables with the help of easy to measure variables.

Recommended: What is Linear Regression? Part:1

Linear Regression is one of the regression technique and can be defined as the following:

“Linear Regression is a field of study which emphasizes on the statistical relationship between two continuous variables known as Predictor and Response variables”.
(Note: when there are more than one predictor variables then it becomes multiple linear regression.)


Ashutosh Tripathi

Certified Data Scientist. Technical Content Creator. Follow me on instagram.com/ashutosh_ai/, linkedin.com/in/ashutoshtripathi1/, blog @ ashutoshtripathi.com

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store