Automated NLP Pre-Processing and EDA using Data-Purifier Library

Abhishek Manilal Gupta
3 min readOct 3, 2021

--

Automated NLP Pre-Processing

Automated NLP Preprocessing

Natural Language Processing (NLP) is a branch of Data Science that deals with Text data. Apart from numerical data, Text data is available to a great extent and is used to analyze and solve business problems. But before using the data for analysis or prediction, processing the data is important.

Text preprocessing is traditionally an important step for natural language processing (NLP) tasks. It transforms text into a more digestible form so that machine learning algorithms can perform better.

To prepare the text data for the model building we perform text preprocessing. It is the very first step of NLP projects. Some of the preprocessing steps are:

  • Removing punctuations like . , ! $( ) * % @
  • Removing URLs
  • Removing Stop words
  • Lower casing
  • Tokenization
  • Stemming
  • Lemmatization and so on

So we have come up with an automated data preprocessing library for NLP applications, named “Data-Purifier”. It can perform all the preprocessing tasks automatically.

Data-Purifier

A Python library for Automated Exploratory Data Analysis, Automated Data Cleaning, and Automated Data Preprocessing For Machine Learning and Natural Language Processing Applications in Python.

Installation

To use Data-purifier, it’s recommended to create a new environment:

conda create -n <your_env_name> python=3.6 anaconda
conda activate <your_env_name> # ON WINDOWS

Install the required dependencies:

pip install data-purifier
python -m spacy download en_core_web_sm

Performing Automated NLP Preprocessing using Data-Purifier library

Example Code Snippet

https://gist.github.com/Elysian01/d0b7e8be4408fe2c0ecb6179217e8dec

Automated NLP Exploratory Data Analysis (EDA)

We can perform two types of exploratory data analysis

  1. Basic EDA Analysis
  2. Word Analysis

Basic EDA Analysis

It will check for null rows and drop them (if any) and then will perform following analysis row by row and will return data-frame containing those analysis:

  1. Word Count
  2. Character Count
  3. Average Word Length
  4. Stop Word Count
  5. Uppercase Word Count

Later you can also observe distribution of above mentioned analysis just by selecting the column from the drop-down list, and our system will automatically plot it.

  • It can also perform sentiment analysis on data-frame row by row, giving the polarity of each sentence (or row), later you can also view the distribution of polarity.

Word Analysis

  • Can find count of specific word mentioned by the user in the textbox.
  • Plots wordcloud plot
  • Perform Unigram, Bigram, and Trigram analysis, returning the dataframe of each and also showing its respective distribution plot.

Code Implementation

For Automated EDA and Automated Data Cleaning of NL dataset, load the dataset and pass the data-frame along with the targeted column containing textual data.

nlp_df = pd.read_csv("./datasets/twitter16m.csv", header=None, encoding='latin-1')
nlp_df.columns = ["tweets","sentiment"]

Basic Analysis

For Basic EDA, pass the argument basic as argument in constructor

eda = Nlpeda(nlp_df, "tweets", analyse="basic")
eda.df

Word Analysis

For Word based EDA, pass the argument word as argument in constructor

eda = Nlpeda(nlp_df, "tweets", analyse="word")
eda.unigram_df # for seeing unigram datfarame

Check out the example Colab notebook for examples and implementation details.

Video Tutorial

--

--