Automated NLP Pre-Processing and EDA using Data-Purifier Library
Automated NLP Pre-Processing
Natural Language Processing (NLP) is a branch of Data Science that deals with Text data. Apart from numerical data, Text data is available to a great extent and is used to analyze and solve business problems. But before using the data for analysis or prediction, processing the data is important.
Text preprocessing is traditionally an important step for natural language processing (NLP) tasks. It transforms text into a more digestible form so that machine learning algorithms can perform better.
To prepare the text data for the model building we perform text preprocessing. It is the very first step of NLP projects. Some of the preprocessing steps are:
- Removing punctuations like . , ! $( ) * % @
- Removing URLs
- Removing Stop words
- Lower casing
- Tokenization
- Stemming
- Lemmatization and so on
So we have come up with an automated data preprocessing library for NLP applications, named “Data-Purifier”. It can perform all the preprocessing tasks automatically.
Data-Purifier
A Python library for Automated Exploratory Data Analysis, Automated Data Cleaning, and Automated Data Preprocessing For Machine Learning and Natural Language Processing Applications in Python.
Installation
To use Data-purifier, it’s recommended to create a new environment:
conda create -n <your_env_name> python=3.6 anaconda
conda activate <your_env_name> # ON WINDOWS
Install the required dependencies:
pip install data-purifier
python -m spacy download en_core_web_sm
Performing Automated NLP Preprocessing using Data-Purifier library
Example Code Snippet
Automated NLP Exploratory Data Analysis (EDA)
We can perform two types of exploratory data analysis
- Basic EDA Analysis
- Word Analysis
Basic EDA Analysis
It will check for null rows and drop them (if any) and then will perform following analysis row by row and will return data-frame containing those analysis:
- Word Count
- Character Count
- Average Word Length
- Stop Word Count
- Uppercase Word Count
Later you can also observe distribution of above mentioned analysis just by selecting the column from the drop-down list, and our system will automatically plot it.
- It can also perform
sentiment analysis
on data-frame row by row, giving the polarity of each sentence (or row), later you can also view thedistribution of polarity
.
Word Analysis
- Can find count of
specific word
mentioned by the user in the textbox. - Plots
wordcloud plot
- Perform
Unigram, Bigram, and Trigram
analysis, returning the dataframe of each and also showing its respective distribution plot.
Code Implementation
For Automated EDA and Automated Data Cleaning of NL dataset, load the dataset and pass the data-frame along with the targeted column containing textual data.
nlp_df = pd.read_csv("./datasets/twitter16m.csv", header=None, encoding='latin-1')
nlp_df.columns = ["tweets","sentiment"]
Basic Analysis
For Basic EDA, pass the argument basic
as argument in constructor
eda = Nlpeda(nlp_df, "tweets", analyse="basic")
eda.df
Word Analysis
For Word based EDA, pass the argument word
as argument in constructor
eda = Nlpeda(nlp_df, "tweets", analyse="word")
eda.unigram_df # for seeing unigram datfarame
Check out the example Colab notebook for examples and implementation details.