NLP: Text Processing In Data Science Projects
Learn The Data Science Techniques To Process Text To Use For NLP Projects In Python
Once we have gathered the text, the next stage is about cleaning and consolidating the text. It is important to ensure the text is standardised, the noise is removed so that efficient analysis can be performed on the text to derive meaningful insights.
It’s important to note that the cleaning and processing of text is highly dependent on the nature of the NLP project. As an instance for your project, numbers might be important.
This article aims to explain the steps we can perform to clean the text for NLP projects.
Article Aim
I will explain following key techniques:
- Convert Text To Lowercase
- Tokenise Paragraphs To Sentences
- Tokenise Sentences To Words
- Remove Numbers
- Remove Punctuation
- Remove Stop words
- Remove Whitespaces
I will demonstrate how we can achieve the goal by using the NLTK library in Python and the regular expressions. We can install NLTK library…