NLP: Text Processing Via Stemming And Lemmatisation In Data Science Projects
How We Can Normalise And Reduce The Number Of Common Words Into A Single Word For Text Analytics
Reducing the number of common words into a single word is a useful text analytics technique. This article aims to explain the two key techniques known as Stemming and Lemmatisation.
These techniques are widely used in the SEO, tagging, search engines and indexing systems.
Data science projects can take advantage of these techniques in the text analytics projects.
Once we extract the text, we are required to clean it. Once the text is cleaned, we need to analyse whether we can reduce the number of common words into a single word. There are a number of techniques available which can help us achieve the goal.
The article below explains how to extract the text for the NLP project:
NLP: Python Data Extraction From Social Media, Emails, Documents, Web Pages, RSS & Images
Clear Overview Of Python Libraries & Techniques To Fetch Textual Data From All Of The Common Sources
The article below explains how to clean the extracted text:
NLP: Text Processing In Data Science Projects
Learn The Data Science Techniques To Process Text To Use For NLP Projects In Python
What Is Stemming?
Some words can be reduced to a single word. Stemming revolves around the concept of removing the last characters of a word until we can get a common word to represent a number of words. Often this process ends up reducing many words into a single word. The final word is known as lemma.
Stemming is about stripping suffixes
As an example, blogging, blogged and blogs can be reduced to the single word “blog”.
There are a number of stemming algorithms but the most common algorithms are Porter, Lancaster and Snowball.
Snowball is the most improved and accurate algorithm amongst them.
from nltk import SnowballStemmer# Function to apply stemming to a list of words
stemmer = SnowballStemmer()
for word in ['blogging','blogged','blogs']:
print(stemmer.stem(word))#This will return blog, blog,
What Is Lemmatisation?
Lemmatisation is the process of reducing the number of words into a single word by combining common words together. It is the process of transforming to the dictionary base form.
NLTK library in Python contains a lexical database for English words. These words are linked together based on their semantic relationships. The linking is dependent on the meanings of the words. In particular, we can utilise the WordNet.
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
In the article below, I will demonstrate how Part Of Speech tagging works. We can lemmatise sentences using PoS which can yield superior results.
Reducing the number of common words into a single word is a useful text analytics technique. This article explained the two key techniques known as Stemming and Lemmatisation. These techniques are widely used in the SEO, tagging, search engines and indexing systems.
Hope it helps.