Data Cleaning in Natural Language Processing

Published in

Analytics Vidhya

6 min readMar 7, 2020

The post will go through basic of NLP data processing . We would go through the most popular libraries used for data cleaning in NLP space and provide code for reusing in your project

For this post I am going to use a the google News dataset (Download) which is a csv file with two columns :

publish_date
headline_text

There are various applications to the data like

Creating word2vec models (Google word2vec trained model on 100 billion words ) [LINK]
Creating Trading Systems to classify good/bad sentiments

What all the applications have in common is data cleaning , usually the raw data over a properly or improperly formed sentence is not always desirable as it contains lot of unwanted components like null/html/links/url/emoji/stopwords etc . In this blog I am going to go through the most commonly used data processing that is used to clean Natural language text

Fixing null values

The most common and the first thing you should check is null values , there are various tool for doing that .

First we start with a basic overview of how many null values are we dealing with . This can be found out by running :

train.isnull().sum()

(Note: Actual dataset used in this blog does not have any null , this is only for demonstration)

The above gives you a overview into how many nulls are you dealing with in each column of the pandas dataframe .

Next step is to either define a value to replace or remove the null values in the dataset , to remove and create a new dataframe of not null values you can use the below code :

train[pd.notnull(train["headline_text"])]

This provides you with a dataframe without null headline_text

You can also decide to replace the null values to something that makes sense for your algorithms so for instance in my case I want to make all headline_text have text as “IGNORE TEXT” where there is no values . I can do that using the .fillna() method in pandas:

train.headline_text.fillna("IGNORE TEXT")

Another popular tool in pandas library is .dropna() which is very useful with Null/NaN/NaT values .It is very customizable with its arguments

train.dropna(axis=0, how="any", thresh=None, subset=None, inplace=False).shape

axis=0/1 , 0 means drop rows and 1 means drop columns
how=all/any , all means if all values are null , any means if even a single value is null
thresh= threshold of how many to drop at max
subset = range of row columns to scan

Notice how 2 rows got dropped b the dropna() function from dataframe as they had null values

Removing URL

def remove_URL(headline_text):
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r'', headline_text)

Use the above function to remove any url(starting with https:// www. from the text )

Notice how the url is removed using the .apply()

Removing HTML tags

You can use the below function to remove the html tags using regex

def remove_html(headline_text):
    html=re.compile(r'<.*?>')
    return html.sub(r'',headline_text)

train['headline_text'] = train['headline_text'].apply(remove_html)

Notice how the HTML tags were removed from the text

Removing Pictures/Tags/Symbols/Emojis

Often dealing with real world free text you would find your text to contain lot of smiley,emoji,picture etc based on platform that you get your dataset from . these require us to have a function that can filter out these special character sequence

def remove_emojis(data):
    emoj = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002500-\U00002BEF"  # chinese char
        u"\U00002702-\U000027B0"
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u"\U00010000-\U0010ffff"
        u"\u2640-\u2642" 
        u"\u2600-\u2B55"
        u"\u200d"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\ufe0f"  # dingbats
        u"\u3030"
                      "]+", re.UNICODE)
    return re.sub(emoj, '', data)train['headline_text'] = train['headline_text'].apply(remove_emoji)

Removing Punctuation

For cleaning on English language often punctuation occur as part of free text which do not add value usually to your model , they can be remove them from our dataset using below function

def remove_punct(headline_text):
    table=str.maketrans('','',string.punctuation)
    return headline_text.translate(table)train['headline_text'] = train['headline_text'].apply(remove_punct)

Text Tokenizing

Next useful step in most of usecase is to extract the text from the sentence , usually there are multiple possibility , here below we are using one of the most popular library nlkt which stands of natural language toolkilk library

import nltk.data##Load language Specific .pickle filetokenizer = nltk.data.load('tokenizers/punkt/PY3/english.pickle')
spanish_tokenizer = nltk.data.load('tokenizers/punkt/PY3/spanish.pickle')##Different type of tokenizerfrom nltk.tokenize import regexp_tokenize 
from nltk.tokenize import RegexpTokenizer 
from nltk.tokenize import WordPunctTokenizer 
from nltk.tokenize import PunktWordTokenizer 
from nltk.tokenize import TreebankWordTokenizer 
from nltk.tokenize import word_tokenize##Sample initialization of token tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')##Define Normalization 
normalization = None
normalization = 'stemmer'
normalization = 'lemmatizer'##Define Vectorizer 
vectorizer = 'countvectorizer'
vectorizer = 'tfidfvectorizer'

Notice how the sentence gets converted to text tokens

Normalization

A common preprocessing step when clearing social media text is normalization. Text normalization is the process of transforming a text into a canonical (standard) form. For example, the word “gooood” and “gud” can be transformed to “good”, its canonical form.

Ref - I

Example :

2moro,2mrrw,tomrw → tomorrow

b4 → before

In python this can be done using nltk library

def stem_tokens(tokens):
    stemmer = nltk.stem.PorterStemmer()
    tokens = [stemmer.stem(token) for token in tokens]
    return tokensdef lemmatize_tokens(tokens):
    lemmatizer = nltk.stem.WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return tokensdef normalize_tokens(normalization):
    if normalization is not None:
        if normalization == 'stemmer':
            train['text'] = train['text'].apply(stem_tokens)
        elif normalization == 'lemmatizer':
            train['text'] = train['text'].apply(lemmatize_tokens)
        
normalize_tokens(normalization)

Stopwords

In English language you would usually need to remove all the un-necessary stopwords , the nlkt library contains a bag of stopwords that can be used to filter out the stopwords in a text . The list can be seen by code below

import nltk
from nltk.corpus import stopwords
 set(stopwords.words('english'))

{‘ourselves’, ‘hers’, ‘between’, ‘yourself’, ‘but’, ‘again’, ‘there’, ‘about’, ‘once’, ‘during’, ‘out’, ‘very’, ‘having’, ‘with’, ‘they’, ‘own’, ‘an’, ‘be’, ‘some’, ‘for’, ‘do’, ‘its’, ‘yours’, ‘such’, ‘into’, ‘of’, ‘most’, ‘itself’, ‘other’, ‘off’, ‘is’, ‘s’, ‘am’, ‘or’, ‘who’, ‘as’, ‘from’, ‘him’, ‘each’, ‘the’, ‘themselves’, ‘until’, ‘below’, ‘are’, ‘we’, ‘these’, ‘your’, ‘his’, ‘through’, ‘don’, ‘nor’, ‘me’, ‘were’, ‘her’, ‘more’, ‘himself’, ‘this’, ‘down’, ‘should’, ‘our’, ‘their’, ‘while’, ‘above’, ‘both’, ‘up’, ‘to’, ‘ours’, ‘had’, ‘she’, ‘all’, ‘no’, ‘when’, ‘at’, ‘any’, ‘before’, ‘them’, ‘same’, ‘and’, ‘been’, ‘have’, ‘in’, ‘will’, ‘on’, ‘does’, ‘yourselves’, ‘then’, ‘that’, ‘because’, ‘what’, ‘over’, ‘why’, ‘so’, ‘can’, ‘did’, ‘not’, ‘now’, ‘under’, ‘he’, ‘you’, ‘herself’, ‘has’, ‘just’, ‘where’, ‘too’, ‘only’, ‘myself’, ‘which’, ‘those’, ‘i’, ‘after’, ‘few’, ‘whom’, ‘t’, ‘being’, ‘if’, ‘theirs’, ‘my’, ‘against’, ‘a’, ‘by’, ‘doing’, ‘it’, ‘how’, ‘further’, ‘was’, ‘here’, ‘than’}

def remove_stopwords(text):
    words = [w for w in text if w not in stopwords.words('english')]
    return wordstrain['headline_text'] = train['headline_text'].apply(remove_stopwords)

Example data input :

Input : ['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 
'off', 'the', 'stop', 'words', 'filtration', '.']
Output : ['This', 'sample', 'sentence', ',', 'showing', 'stop',
'words', 'filtration', '.']

Vectorizing your code

The sklearn.feature_extraction module can be used to extract features in a format supported by machine learning algorithms from datasets consisting of formats such as text and image

Based on your usecase you would need to selelct the ngram to create a feature vector

## Default CountVectorizerfrom sklearn.feature_extraction.text import CountVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer2.get_feature_names())
print(X.toarray())

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

## CountVectorizer(with ngram=2)vectorizer2 = CountVectorizer(analyzer='word', ngram_range=(2, 2))
X2 = vectorizer2.fit_transform(corpus)
print(vectorizer2.get_feature_names())
print(X2.toarray())

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Usage in real life :

# Vectorizationin funcation
def vectorize(vectorizer):
    if vectorizer == 'countvectorizer':
        print('countvectorizer')
        vectorizer = CountVectorizer()
        train_vectors = vectorizer.fit_transform(train['text'])
        test_vectors = vectorizer.transform(test['text'])
    elif vectorizer == 'tfidfvectorizer':
        print('tfidfvectorizer')
        vectorizer = TfidfVectorizer(min_df=2, max_df=0.5, ngram_range=(1, 2))
        train_vectors = vectorizer.fit_transform(train['text'])
        test_vectors = vectorizer.transform(test['text'])
    return train_vectors, test_vectors
train_vectors, test_vectors = vectorize(vectorizer)

Hope the above compilation helps a beginner with 1 st steps into NLP