Cleaning Text data for improved NLP results
A little preprocessing of text data can go a long way to improving your model performance, here’s how to easily clean up your text data for better results.
The goal of this article will be to remove unnecessary words and items from a corpus of text data to help boost the performance of a machine learning model. The items that will be removed are:
- Numbers
- Capitalization
- Punctuation
- English Stop Words
Stop words are words that don’t carry much information and only give context to the surrounding words. These include words such as ‘of’, ‘the’, and ‘then’, it is often useful to remove these types of words when training models to reduce the amount of ‘noise’ your program is receiving.
The data we will be using is the Liar Dataset which is a set of political statements collected and annotated by researchers at Cornell University.
The data set has been saved as a list of strings where each string is a unique entry or ‘document’. Here is a sample of the data we will be working with:
print(text[0:2])
>> ['Austin is a city that has basically doubled in size every 25 years or so since it was founded.',
'During the recession, the consumer in his native perversity has begun to save. The savings rate is now 6.2 percent.']
First we are going to replace each number with the character ‘N’ in this way the model is not treating every new number (of which there are possibly infinite) as a unique word for which it has no previous examples.
import re
text = (text.apply(lambda x: re.sub('\d+', 'N', x))
.apply(lambda x: re.sub('N,N', 'N', x))
.apply(lambda x: re.sub('N.N', 'N', x))
)print(text[0:2])
>> ['Austin is a city that has basically doubled in size every N years or so since it was founded.',
'During the recession, the consumer in his native perversity has begun to save. The savings rate is now N percent.']
The apply function will search each document in the corpus and replace each instance of a number with the letter N.
Next, we will use the sklearn.feature_extraction module to convert the word documents into a format that our model can interpret. The format we will be using is the count vectorizer. This method converts each document to a row vector and maps each word to a unique column, the vector representation is thus a count of the number of words that make up each sentence. For illustration:
corpus = ['I have a green apple', 'I eat a red apple']
Would be represented as the matrix array:
|'I'|'have'|'eat'|'a'|'green'|'red'|'apple'|
d1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 |
d2 | 1 | 0 | 1 | 1 | 0 | 1 | 1 |
Each word has its own unique column and each document is its own unique row.
The advantage to using the vectorizer is that it creates a machine readable array for training a model, the draw back is that it will not retain the order of words in a sentence.
Using sklearn to convert our full corpus to a vector format:
from sklearn.feature_extraction.text import CountVectorizertop_num_of_words = 5000
vect = CountVectorizer(lowercase=True,
max_features=top_num_of_words,
stop_words = 'english')
vect.fit(text)
text_vector = vect.transform(text)
We have now converted our list of documents to a sparse array of size [number_of_documents, top_num_of_words] where we have chosen to only retain the 5000 most used words not including english stop words. The complete list of stop words we have omited can be found here.
For datasets that are not very large is a good strategy to only keep the top most common words. This will help to prevent over fitting since you will avoid training a model based on words which have only been seen in training a handful of times.
For more advanced feature extraction methods try next to use TF-IDF vectors or word vector representations.
Thanks for the read. :)