Cleaning Data for Machine Learning

Published in

ML 2 Vec

11 min readNov 9, 2017

One of the first things that most data engineers have to do before training a model is to clean their data. This is an extremely important step, and based on the type of data you are using, relies on different techniques to make the data “trainable”.

For this post, the dataset I will use is called Amazon’s Reviews dataset, available publicly on Kaggle. It is an entirely Text dataset of item reviews that users have submitted on Amazon.com, with a corresponding star rating of that review. My current project is to train a Deep Learning convolution Network on the data, but before I started training, I spent a significant time cleaning the data so that I could feed it into the network. Since the cleaning process was just as long as the training, it deserved it’s own post.

I will start off by cleaning the data from scratch, through different steps such as removing stop words, contractions, and punctuation, and end with vectorizing the data into word-vectors that can be fed into a model.

Loading the Training and Test data

Let’s start by importing the most important libraries. We will primarily use Pandas and Numpy some modules from NLTK, and re for regular expressions:

import pandas as pd;
import numpy as np;
import re;
import time;
from nltk.corpus import stopwords;
import json;
import sys;
import pickle;
import gensim;
import matplotlib.pyplot as plt;
import seaborn as sns;
from sklearn import preprocessing;
from IPython.display import clear_output;#set the option for max_colwidth to better visualize dataframespd.set_option('display.max_colwidth', -1)

Download the data, load it into a Pandas Dataframe, and set the columns:

data_location = 'input/amazon_review_full_csv/';
data_train = pd.read_csv(data_location + 'train.csv', header=None);data_train.columns = ['rating', 'subject', 'review'];
data_test.columns = ['rating', 'subject', 'review'];print(len(data_train));

3000000

These are a lot of rows, so to make processing easier, we can use a smaller subset of rows to verify that the pipeline works without errors, and once completed, process the rest of the dataset.

To start off, we need to get a subset of random indices. The sampling dataset
will be divided into the following ratio: 75% TRAIN, 20% DEV, 5% TEST.

np.random.seed(1024);#We'll use ~6% of the dataset for now
total_samples = int(0.06 * len(data_train))
rand_indices = np.random.choice(len(data_train), total_samples, replace=False);train_split_index = int(0.75 * total_samples);
dev_split_index   = int(0.95 * total_samples);data_sample_train = data_train.iloc[rand_indices[:train_split_index]];data_sample_dev   = data_train.iloc[rand_indices[train_split_index:dev_split_index]];
data_sample_test  = data_train.iloc[rand_indices[dev_split_index:]];sample_ratio = len(data_train) / len(data_sample_train);print("Amount of data being trained on: " + str(100.0 / sample_ratio) + '%')

Amount of data being trained on: 4.5%

Processing Data Labels

Next step is cleaning. We will go through different steps, and in the end combine them into a single function. Because we are only writing the functions to build the pipeline, I will use a small subset of the data sample to verify it.

Let’s use the first 100 rows in this case.

data_subsample = data_sample_train.iloc[0:100];

First step: We will start by removing all 3-Star (Neutral) ratings. These rows will not contribute any information to our training because we are only learning positive and negative sentiment. We can remove them using with a simple filter:

data_sub_filtered = data_subsample[data_subsample.rating != 3];
rows_removed = len(data_subsample) - len(data_sub_filtered);print('Removing ' + str(100.0 * rows_removed / len(data_subsample)) + '% of rows');

Removing 22.0% of rows

Second step: Binarize the labels. Instead of having a fixed star rating for our label, we must use a binary label instead.

For my model, I decided on setting ratings with {1, 2} to 0, and {4, 5} to 1. You can also use a One-Hot binary system if you want to predict all 4 star ratings, but that is not completely necessary.

data_sub_filtered.loc[data_sub_filtered.rating <= 2, ‘rating’] = 0;
data_sub_filtered.loc[data_sub_filtered.rating >= 4, ‘rating’] = 1;

Third step: Remove all NaNs. Every dataset has some rows that have rows with NaN values. We cannot train on these rows, so it is better to remove them during the cleaning process with a one-liner using the dropna function in Pandas.

data_sub_filtered = data_sub_filtered.dropna();

Now that we have cleaned up the labels, let’s look at the reviews to see what we are working with:

The first 5 rows of the Dataframe before cleaning

We can see that the labels were binarized properly, but that there are still multiple issues with the dataset. For instance, there are multiple uses of ellipses, unnecessary symbols, and inconsistent spelling. Let’s tackle those issues next.

Cleaning the data

Fourth step: Now we’ll start cleaning the actual reviews. for the first step in data cleaning, we’ll remove all URLs in the dataset. URLs are hard to identify later when we have removed symbols and punctuation, so they should be filtered out first.

I used a URL regex from the following post: https://stackoverflow.com/a/6883094/1843486

url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+';for row in data_sub_filtered.index:
    data_sub_filtered.loc[row, 'review'] = re.sub(url_regex, '', data_sub_filtered.loc[row, 'review']);

Fifth step: Next, we’ll remove all symbols such as « ! % & * @ # », and fix inconsistent-casings.

To do this, we can use a regular expression that only keeps alphanumerics and apostrophes. We can then iterate over the dataset and do a regex-replace on each review, and then lowercase them.

pattern_to_find = "[^a-zA-Z0-9' ]";
pattern_to_repl = "";for row in data_sub_filtered.index:
    
    data_sub_filtered.loc[row, 'review'] = re.sub(pattern_to_find, pattern_to_repl, data_sub_filtered.loc[row, 'review']).lower();

Let’s look at the dataset now:

The first 5 rows after preliminary cleaning steps

We can see that there are no more parenthesis, ellipses, punctuation or uppercase letters. This step was important because now our model won’t treat the words « Apple » and « apple » differently, or « small… » and « small».

Sixth step: Expand all contractions. This is an important step, because it will reduce disambiguation between similar phrases such as « I’ll » and « I will ». Even in the example above, we can see that there are uses of contractions like « you’ll » and « it’s », along with the same words in different contexts such as « you », « will », « it », and « is ».

For this step, I used the following post for reference:
ref: https://stackoverflow.com/a/19794953/1843486

The first step was placing all contractions in a Json dictionary, which can be loaded in with a file. The contractions map like:

ain’t → is not
can’t’ve → cannot have
’cause → because

contractions = json.load(open(‘contractions.json’, ‘r’));

Expanding contractions is a little tricky, and not completely perfect. There are two types of contractions we will be looking at:

Of the form « isn’t » and of the form « apple’s ». In the first, they can be directly mapped from the Json dictionary to obtain « is not »
But for the second form, the contraction is part of a proper noun which are not in the dictionary. I will simply map these contractions to the form « apple is »

#Regex expression for identifying contractions:
apos_regex = "'[a-z]+|[a-z]+'[a-z]+|[a-z]+'";#The following function will expand a given word (if it's a contraction) based on the conditions above:
def expand(word):
    if "'" in word:
        if word in contractions:
            return contractions[word]
        if word.endswith("'s"):
            return word[:-2] + " is"
        else:
            
            return word;
    else:
        return word;#Now we can iterate over each row, expanding contractions wherever they appear.
for row in data_sub_filtered.index:
    data_sub_filtered.loc[row, 'review'] =  ' '.join(([expand(word) for word in data_sub_filtered.loc[row,  'review'].split()]));

The first 5 rows after expanding contractions

We can see that all contractions in the above 5 rows have been properly expanded, such as the « you’ll » to « you will » in the last review at the end.

Seventh step: Remove stopwords. Stopwords refer to words that are very common in a language, and don’t contribute a lot of useful information to a model. These include words like « if, then, also, but, for »

The NLTK package has a list of common stopwords in the english language, which we’ll use here.

eng_stopwords = set(stopwords.words("english"));for row in data_sub_filtered.index:
    text_revi = data_sub_filtered.loc[row, 'review'].split();
    
    data_sub_filtered.loc[row, 'review']  = ' '.join([word for word in text_revi if word not in eng_stopwords]);

The first 5 rows after removing stop words

Now we can actually see that the final review is much different than what it was originally. Let’s compare the first one directly:

Original:
This is a lousy movie and does not follow the book. As an adaption and an individual movie, it doesn't capture the world of C. S. Lewis. The acting is terrible, (especially the casting of the white witch..... I thought she was supposed to be scary) and many of the shooting styles were stolen from lord of the rings. I can only hope that "Prince Caspian (coming out in a few years) will be an improvement. However if Andrew Adamson is still directing it probably will be just as bad.

Cleaned:
lousy movie follow book adaption individual movie capture world c lewis acting terrible especially casting white witch thought supposed scary many shooting styles stolen lord rings hope prince caspian coming years improvement however andrew adamson still directing probably bad

Looking through them, we see that there is no more punctuation, uppercasings, stop words, or contractions. The final
review reads very awkwardly to us, but for a model, this is a better input because words and charachters that don’t provide
any useful information are ommited. There are other steps that we can take from here as well, such as:

Removing proper nouns such as « Prince caspian » or « Lord of the Rings »
Removing single letter words such as « a », « c », « y »
Removing numbers such as « 10 », « 52 », « 1000 »

You can implement these steps based on how much you want to clean your input.

Now that all of the different processes have been completed, it is time to combine them all into a single function.

Combining everything

The process_sentence method will process a single sentence and return a version without URLs, symbols, contractions and stopwords.

def process_sentence(sentence):
    
    #Remove special symbols and lowercase everything
    alphanum = re.sub(pattern_to_find, pattern_to_repl, sentence).lower();
    
    #Remove URLs
    nourls = re.sub(url_regex, '', alphanum);
    
    #Expand all contractions
    noapos = ' '.join(([expand(word) for word in nourls.split()]));
    
    #Remove stopwords
    bigwords = ' '.join([word for word in noapos.split() if word not in eng_stopwords]);
    
    return bigwords;

The clean_data method will take in a dataframe, filter out neutral rows, binarize ratings, remove na’s, and process rows.

def clean_data(dframe):
    start = time.time();
    
    #Remove neutral ratings
    dframe = dframe[dframe.rating != 3]
    
    #Binarize labels
    dframe.loc[dframe.rating <= 2, 'rating'] = 0;
    dframe.loc[dframe.rating >= 4, 'rating'] = 1;#Drop NaN rows
    dframe = dframe.dropna();
    
    for pos, row in enumerate(dframe.index):
        
        #Clean reviews
        dframe.loc[row,  'review'] = process_sentence(dframe.loc[row,  'review']);
        
        #Print progress over time, as well as an estimated ETA.
        if pos % 1000 == 0 and pos > 0:
            time_so_far = (time.time() - start)/60;
            time_eta = time_so_far * (len(dframe) / pos) - time_so_far;
            sys.stdout.write("\rCompleted " + str(pos) + " / " + str(len(dframe)) + " in " + str(time_so_far) + "m eta: " + str(time_eta) + 'm');
           
    print('\n')
    print('Total time taken: ' + str(time.time() - start) + 's');
    
    return dframe;

Now we can call the clean_data function on the rest of the dataset:

data_train_processed = clean_data(data_sample_train);
data_dev_processed = clean_data(data_sample_dev);
data_test_processed = clean_data(data_sample_test);

Vectorizing the data

Although we have a clean dataset, our model will require some type of word embeddings to properly train on it. If you read my previous post on Word2Vec, it is a shallow neural network model trained to learn word embeddings based on contextual meaning. We can use a similar model to obtain embeddings for this dataset.

In that post, I trained a model from scratch, but for this project, I will be using a pre-trained model provided by Google. Each word will be mapped to a vector, and words in a sentence will be concatinated to form a 2d-vector.

The model I will be using is a pretrained model on Google News. It is publicily avalaible from Google, and can be downloaded from here.

word2vec_model gensim.models.KeyedVectors.load_word2vec_format(‘./input/GoogleNews-vectors-negative300.bin’, binary=True);

The Word2Vec model has already been trained. To obtain a feature for a word, we can index it using the model, and obtain a vector of a fixed length size, in this case 300 x 1.

For instance, for the word « git», we get:

git_vec = word2vec_model['git'];
print(git_vec.shape);
print(git_vec[0:5])Out:(300,)
[ 0.06298828 -0.14941406 -0.3046875   0.17675781  0.0009346 ]

We can see that the shape is of 300 x 1, and that the vector contains float values that define the word’s context.

However, since not all words are in the vocabulary of the model, we will have to fill those vectors with placeholders. To do this, let’s maintain a dictionary of missing words, and when we encounter a word that is not in the vocab, add a randomly initialized vector to the dictionary, and reference that when we need it.

np.random.seed(11);
missing_words = {};

The next thing we need to decide is how many words to featurize.

Some reviews are very length and some are very short, so we can set a cap, so if there are too many words we truncate, and too few we pad. To figure out this limit, let’s look at the lengths of the reviews to see the distribution:

train_lens = [len(data_train_processed.loc[row, 'review'].split(' ')) for row in data_train_processed.index]ax = sns.distplot(train_lens, bins=20);
ax.set(xlabel='Length of sentence');
ax.set_title('Length of sentences on the Training set')
plt.show();

Based on the distribution, a reasonable limit to set would be around 80 words.

max_word_limit = 80;
vec_size = 300;

Let’s define the vectorization function now. We will go through each sentence, and while the number of words are within the range of the cap, index it using either the model or the dictionary we populate dynamically.

def sentence_to_vec(sentence):
    
    sentence_vecs = np.zeros((1, max_word_limit, vec_size));for pos, word in enumerate(sentence.split(' ')):
        if pos >= max_word_limit:
            break;
                
        if word in word2vec_model:
            sentence_vecs[0, pos] = word2vec_model[word];
        else:
            if word not in missing_words:
                missing_words[word] = np.random.rand(vec_size);
            sentence_vecs[0, pos] = missing_words[word];
            
    return sentence_vecs;

We can encapsulate the loop logic in a different function that iterates over each review and vectorizes it.

def dframe_to_vec(dframe):
    start = time.time();
    dframe_matrix = np.zeros((len(dframe), max_word_limit, vec_size));
    
    for pos, row in enumerate(dframe.index):
        dframe_matrix[pos] = sentence_to_vec(dframe.loc[row, 'review']);
    
        if pos % 1000 == 0 and pos > 0:
                time_so_far = (time.time() - start)/60;
                time_eta = time_so_far * (len(dframe) / pos) - time_so_far;
                sys.stdout.write("\rCompleted " + str(pos) + " / " + str(len(dframe)) + " in " + str(time_so_far) + "m eta: " + str(time_eta) + 'm');print('\n')
    print('Total time taken: ' + str(time.time() - start) + 's');
        
    return dframe_matrix;

And call the function on the data we cleaned earlier:

data_subsample_train_vectors = dframe_to_vec(data_sub_processed)print(data_subsample_train_vectors.shape)

(78, 80, 300)

We obtain a matrix of 80x300 dimensions for each review. Looking at the words that were detected missing, some of the words include:

Norah, 1495, isnt, tokobots, 60gb, severeley, amazoncom

We can see that words detected as either proper nouns, have numbers, are non-spaced, misspelled, or are malformed urls. These can be detected with some more filters, but since most of the words occur only once in the dataset, we can leave them as in.

The final vectorized input can now be fed into a Machine Learning model such as a Convolutional Neural Network for training.