Data Cleaning in Natural Language Processing
The post will go through basic of NLP data processing . We would go through the most popular libraries used for data cleaning in NLP space and provide code for reusing in your project
For this post I am going to use a the google News dataset (Download) which is a csv file with two columns :
- publish_date
- headline_text
There are various applications to the data like
- Creating word2vec models (Google word2vec trained model on 100 billion words ) [LINK]
- Creating Trading Systems to classify good/bad sentiments
What all the applications have in common is data cleaning , usually the raw data over a properly or improperly formed sentence is not always desirable as it contains lot of unwanted components like null/html/links/url/emoji/stopwords etc . In this blog I am going to go through the most commonly used data processing that is used to clean Natural language text
Fixing null values
The most common and the first thing you should check is null values , there are various tool for doing that .
First we start with a basic overview of how many null values are we dealing with . This can be found out by running :
train.isnull().sum()
The above gives you a overview into how many nulls are you dealing with in each column of the pandas dataframe .
Next step is to either define a value to replace or remove the null values in the dataset , to remove and create a new dataframe of not null values you can use the below code :
train[pd.notnull(train["headline_text"])]
You can also decide to replace the null values to something that makes sense for your algorithms so for instance in my case I want to make all headline_text have text as “IGNORE TEXT” where there is no values . I can do that using the .fillna() method in pandas:
train.headline_text.fillna("IGNORE TEXT")
Another popular tool in pandas library is .dropna() which is very useful with Null/NaN/NaT values .It is very customizable with its arguments
train.dropna(axis=0, how="any", thresh=None, subset=None, inplace=False).shape
- axis=0/1 , 0 means drop rows and 1 means drop columns
- how=all/any , all means if all values are null , any means if even a single value is null
- thresh= threshold of how many to drop at max
- subset = range of row columns to scan
Removing URL
def remove_URL(headline_text):
url = re.compile(r'https?://\S+|www\.\S+')
return url.sub(r'', headline_text)
Use the above function to remove any url(starting with https:// www. from the text )
Removing HTML tags
You can use the below function to remove the html tags using regex
def remove_html(headline_text):
html=re.compile(r'<.*?>')
return html.sub(r'',headline_text)
train['headline_text'] = train['headline_text'].apply(remove_html)
Removing Pictures/Tags/Symbols/Emojis
Often dealing with real world free text you would find your text to contain lot of smiley,emoji,picture etc based on platform that you get your dataset from . these require us to have a function that can filter out these special character sequence
def remove_emojis(data):
emoj = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002500-\U00002BEF" # chinese char
u"\U00002702-\U000027B0"
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
u"\U0001f926-\U0001f937"
u"\U00010000-\U0010ffff"
u"\u2640-\u2642"
u"\u2600-\u2B55"
u"\u200d"
u"\u23cf"
u"\u23e9"
u"\u231a"
u"\ufe0f" # dingbats
u"\u3030"
"]+", re.UNICODE)
return re.sub(emoj, '', data)train['headline_text'] = train['headline_text'].apply(remove_emoji)
Removing Punctuation
For cleaning on English language often punctuation occur as part of free text which do not add value usually to your model , they can be remove them from our dataset using below function
def remove_punct(headline_text):
table=str.maketrans('','',string.punctuation)
return headline_text.translate(table)train['headline_text'] = train['headline_text'].apply(remove_punct)
Text Tokenizing
Next useful step in most of usecase is to extract the text from the sentence , usually there are multiple possibility , here below we are using one of the most popular library nlkt which stands of natural language toolkilk library
import nltk.data##Load language Specific .pickle filetokenizer = nltk.data.load('tokenizers/punkt/PY3/english.pickle')
spanish_tokenizer = nltk.data.load('tokenizers/punkt/PY3/spanish.pickle')##Different type of tokenizerfrom nltk.tokenize import regexp_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import WordPunctTokenizer
from nltk.tokenize import PunktWordTokenizer
from nltk.tokenize import TreebankWordTokenizer
from nltk.tokenize import word_tokenize##Sample initialization of token tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')##Define Normalization
normalization = None
normalization = 'stemmer'
normalization = 'lemmatizer'##Define Vectorizer
vectorizer = 'countvectorizer'
vectorizer = 'tfidfvectorizer'
Normalization
A common preprocessing step when clearing social media text is normalization. Text normalization is the process of transforming a text into a canonical (standard) form. For example, the word “gooood” and “gud” can be transformed to “good”, its canonical form.
Example :
2moro,2mrrw,tomrw → tomorrow
b4 → before
In python this can be done using nltk library
def stem_tokens(tokens):
stemmer = nltk.stem.PorterStemmer()
tokens = [stemmer.stem(token) for token in tokens]
return tokensdef lemmatize_tokens(tokens):
lemmatizer = nltk.stem.WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(token) for token in tokens]
return tokensdef normalize_tokens(normalization):
if normalization is not None:
if normalization == 'stemmer':
train['text'] = train['text'].apply(stem_tokens)
elif normalization == 'lemmatizer':
train['text'] = train['text'].apply(lemmatize_tokens)
normalize_tokens(normalization)
Stopwords
In English language you would usually need to remove all the un-necessary stopwords , the nlkt library contains a bag of stopwords that can be used to filter out the stopwords in a text . The list can be seen by code below
import nltk
from nltk.corpus import stopwords
set(stopwords.words('english'))
{‘ourselves’, ‘hers’, ‘between’, ‘yourself’, ‘but’, ‘again’, ‘there’, ‘about’, ‘once’, ‘during’, ‘out’, ‘very’, ‘having’, ‘with’, ‘they’, ‘own’, ‘an’, ‘be’, ‘some’, ‘for’, ‘do’, ‘its’, ‘yours’, ‘such’, ‘into’, ‘of’, ‘most’, ‘itself’, ‘other’, ‘off’, ‘is’, ‘s’, ‘am’, ‘or’, ‘who’, ‘as’, ‘from’, ‘him’, ‘each’, ‘the’, ‘themselves’, ‘until’, ‘below’, ‘are’, ‘we’, ‘these’, ‘your’, ‘his’, ‘through’, ‘don’, ‘nor’, ‘me’, ‘were’, ‘her’, ‘more’, ‘himself’, ‘this’, ‘down’, ‘should’, ‘our’, ‘their’, ‘while’, ‘above’, ‘both’, ‘up’, ‘to’, ‘ours’, ‘had’, ‘she’, ‘all’, ‘no’, ‘when’, ‘at’, ‘any’, ‘before’, ‘them’, ‘same’, ‘and’, ‘been’, ‘have’, ‘in’, ‘will’, ‘on’, ‘does’, ‘yourselves’, ‘then’, ‘that’, ‘because’, ‘what’, ‘over’, ‘why’, ‘so’, ‘can’, ‘did’, ‘not’, ‘now’, ‘under’, ‘he’, ‘you’, ‘herself’, ‘has’, ‘just’, ‘where’, ‘too’, ‘only’, ‘myself’, ‘which’, ‘those’, ‘i’, ‘after’, ‘few’, ‘whom’, ‘t’, ‘being’, ‘if’, ‘theirs’, ‘my’, ‘against’, ‘a’, ‘by’, ‘doing’, ‘it’, ‘how’, ‘further’, ‘was’, ‘here’, ‘than’}
def remove_stopwords(text):
words = [w for w in text if w not in stopwords.words('english')]
return wordstrain['headline_text'] = train['headline_text'].apply(remove_stopwords)
Example data input :
Input : ['This', 'is', 'a', 'sample', 'sentence', ',', 'showing',
'off', 'the', 'stop', 'words', 'filtration', '.']
Output : ['This', 'sample', 'sentence', ',', 'showing', 'stop',
'words', 'filtration', '.']
Vectorizing your code
The sklearn.feature_extraction
module can be used to extract features in a format supported by machine learning algorithms from datasets consisting of formats such as text and image
## Default CountVectorizerfrom sklearn.feature_extraction.text import CountVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer2.get_feature_names())
print(X.toarray())
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## CountVectorizer(with ngram=2)vectorizer2 = CountVectorizer(analyzer='word', ngram_range=(2, 2))
X2 = vectorizer2.fit_transform(corpus)
print(vectorizer2.get_feature_names())
print(X2.toarray())
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Usage in real life :
# Vectorizationin funcation
def vectorize(vectorizer):
if vectorizer == 'countvectorizer':
print('countvectorizer')
vectorizer = CountVectorizer()
train_vectors = vectorizer.fit_transform(train['text'])
test_vectors = vectorizer.transform(test['text'])
elif vectorizer == 'tfidfvectorizer':
print('tfidfvectorizer')
vectorizer = TfidfVectorizer(min_df=2, max_df=0.5, ngram_range=(1, 2))
train_vectors = vectorizer.fit_transform(train['text'])
test_vectors = vectorizer.transform(test['text'])
return train_vectors, test_vectors
train_vectors, test_vectors = vectorize(vectorizer)
Hope the above compilation helps a beginner with 1 st steps into NLP