Sentiment analysis of reviews: Text Pre-processing

9 min readFeb 7, 2018

This post will take you through the steps to cleaning textual data and prepare the target variable.

How Happy is London?

For my capstone project at GA I have decided to create a model that predicts the sentiment of reviews to investigate happiness in London. I will come back to this question later. First let’s create our model!

Training Data

To make a model you first need training data. I was able to find labeled training data for sentiment evaluation of restaurant reviews in New York from meta-share, a language data resource. It can be found here and can be freely downloaded as an xml file when signing up.

Lets first look at the data. I only extracting the data that is necessary for my model. The first column showing the polarity of the review and second column text of the review.

import pandas as pd
import numpy as np
import xml.etree.ElementTree as ETxml_path = './NLP/ABSA15_RestaurantsTrain2/ABSA-15_Restaurants_Train_Final.xml'def parse_data_2015(xml_path):
    container = []                                              
    reviews = ET.parse(xml_path).getroot()                      
    
    for review in reviews:  
        sentences = review.getchildren()[0].getchildren()       
        for sentence in sentences:                                  
            sentence_text = sentence.getchildren()[0].text          
            
            try:                                                     
                opinions = sentence.getchildren()[1].getchildren()
            
                for opinion in opinions:                                
                    polarity = opinion.attrib["polarity"]
                    target = opinion.attrib["target"]
        
                    row = {"sentence": sentence_text, "sentiment":polarity}   
                    container.append(row)                                                              
                
            except IndexError: 
                row = {"sentence": sentence_text}        
                container.append(row)                                                               
                
    return pd.DataFrame(container)ABSA_df = parse_data_2015(xml_path)
ABSA_df.head()

Before we start cleaning the text I wanted to ensure duplicates were dropped. Since we are using sentiment as our target (the variable we will be predicting) we cannot have any null values so these are dropped leaving the us with 1,201 rows.

ABSA_df.isnull().sum()

print "Original:", ABSA_df.shape
ABSA_dd = ABSA_df.drop_duplicates()
dd = ABSA_dd.reset_index(drop=True)
print "Drop Dupicates:", dd.shape
dd_dn = dd.dropna()
df = dd_dn.reset_index(drop=True)
print "Drop Nulls:", df.shape

Dirty dirty text

Of all data, text is the most unstructured form and so means we have a lot of cleaning to do. These pre-processing steps help convert noise from high dimensional features to the low dimensional space to obtain as much accurate information as possible from the text.

Preprocessing data can consist of many steps depending on the data and the situation. To guide me through cleaning, I used a blogpost from analytics vidhya which shows the process of cleaning tweets. As we are dealing with reviews, some of the methods will not apply here.

To further organise this process a blogpost from kdnuggets split it into categories of tokenization, normalization and substitution.

Preprocessing: Tokenization

Tokenization is the process of converting text into tokens before transforming it into vectors. It is also easier to filter out unnecessary tokens. For example, a document into paragraphs or sentences into words. In this case we are tokenising the reviews into words.

df.sentence[17]

from nltk.tokenize import word_tokenize
tokens = word_tokenize(df.sentence[17])
print(tokens)

Stopwords

Stop words are the most commonly occuring words which are not relevant in the context of the data and do not contribute any deeper meaning to the phrase. In this case contain no sentiment. NLTK provide a library used for this.

from nltk.corpus import stopwords
stop_words = stopwords.words('english')print [i for i in tokens if i not in stop_words]

Preprocessing: Normalization

Words which look different due to casing or written another way but are the same in meaning need to be process correctly. Normalisation processes ensure that these words are treated equally. For example, changing numbers to their word equivalents or converting the casing of all the text.

‘100’ → ‘one hundred’

‘Apple’ → ‘apple’

The following normalisation changes are made:

1. Casing the Characters

Converting character to the same case so the same words are recognised as the same. In this case we converted to lowercase.

df.sentence[24]

lower_case = df.sentence[24].lower()
lower_case

2. Negation handling

Apostrophes connecting words are used everywhere, especially in public reviews. To maintain uniform structure it is recommended they should be converted into standard lexicons. The text will then follow the rules of context free grammar and helps avoids any word-sense disambiguation.

There was an apostrophe dictionary found from user comments on analytics vidhy blog which you can download here. The dictionary is in lowercase so the conversion will follow from the lower casing above.

words = lower_case.split()
reformed = [appos[word] if word in appos else word for word in words]
reformed = " ".join(reformed) 
reformed

3 . Removing

Stand alone punctuations, special characters and numerical tokens are removed as they do not contribute to sentiment which leaves only alphabetic characters. This step needs the use of tokenized words as they have been split appropriately for us to remove.

tokens

words = [word for word in tokens if word.isalpha()]
words

4. Lemmatization

This process finds the base or dictionary form of the word known as the lemma. This is done through the use of vocabulary (dictionary importance of words) and morphological analysis (word structure and grammar relations). This normalization is similar to stemming but takes into account the context of the word.

‘are’, ‘is’, ‘being’ → ‘be’

Gensim: Lemmatization

I used the lemmatize method from the Gensim package which took care of lower casing, removal of numerics, stand alone punctuation, special characters and stop words. It also identifies the words with part-of-speech tagging (POS- tagging) considering nouns/ NN, verbs/VB, adjectives/JJ and adverbs/RB.

The function uses the English lemmatizer from the pattern library to extract their lemmas. The words in a text are identified through word-category disambiguation where both its definition and context are taken into account to identify the specific POS- tag.

Tip1: This function is only available when the optional ‘pattern’ package is installed!

Tip2: This function only applies to UTF-8 encoded tokens.

df.sentence[24]

from gensim.utils import lemmatize
lemm = lemmatize(df.sentence[24])
lemm

df.sentence[17]

lemmatize(df.sentence[17])

Preprocessing: Substitution

This involves removing noise from text in its raw format. For example, the text is scrapped from the web it may contain HTML or XML wrappers or markups. Removal of these can be done through regular expressions. Fortunately our reviews do not apply to this as we were able to extract the exact review from the XML file.

Decoding

I found it difficult investigate what decoding means in the the NLP context however, I came across a comment from a stackoverflow question which helped me break down this section.

Text data is subject to different formats of decoding. For example, ASCII which contains english based letters, control characters, punctuation and numbers. When dealing with other languages with non-Latin characters another format may need to apply.

Unicode contains character sets including all languages by assigning every character to a unique number. It is recommended to use UTF-8. With its 1–4 bytes per character memory it widely accepts the majority of languages in the first 2 bytes and is memory efficient if dealing with mostly ASCII characters.

UTF-8 Character Bytes

1 byte: Standard ASCII
2 bytes: Arabic, Hebrew, most European scripts
3 bytes: BMP
4 bytes: All Unicode characters.

As we could be working with text it is advised to decode the utf-8 format ensuring they are all in the same format. The output presents a “u” before the string indicating it is Unicode.

df.sentence[24].decode("utf-8-sig")

Defining a Cleaning Function

I created a cleaning function which will be applied to the whole dataset. It includes decoding, lowercasing (so the negation dictionary can apply), conversion with negation dictionary and lemmatization which includes lower casing, tokenising, removal of special characters, stand alone punctuation, stop words and POS tagging. The second function is to separate the tags and words.

def cleaning_function(tips):
    all_ = []
    for tip in tqdm(tips):
        time.sleep(0.0001)
        
#       Decoding function
        decode = tip.decode("utf-8-sig")
    
#       Lowercasing before negation
        lower_case = decode.lower()
    
#       Replace apostrophes with words
        words = lower_case.split()
        split = [appos[word] if word in appos else word for word in words]
        reformed = " ".join(split) 
        
#       Lemmatization
        lemm = lemmatize(lower_case)
        all_.append(lemm)
        
    return all_def separate_word_tag(df_lem_test):
    words=[]
    types=[]
    df= pd.DataFrame()
    for row in df_lem_test:
        sent = []
        type_ =[]
        for word in row:
            split = word.split('/')
            sent.append(split[0])
            type_.append(split[1])words.append(' '.join(word for word in sent))
        types.append(' '.join(word for word in type_))df['lem_words']= words
    df['lem_tag']= types
    return df

Cleaning Training Data

word_tag = cleaning_function(df.sentence)
lemm_df = separate_word_tag(word_tag)# concat cleaned text with original
df_training = pd.concat([df, lemm_df], axis=1)
df_training['word_tags'] = word_tag
df_training.head()

Check for null and empty values

There were no null values found because the cleaning process does not input nulls into empty values. Both were checked and the empty values were removed. It was identified that the text of the review was 3 reviews including “10”, “LOL” and “Why?” which has changed into null values during the cleaning process. I felt they did not contribute to the sentiment analysis so thought removing them is valid.

# reset index just to be safe
df_training = df_training.reset_index(drop=True)#check null values
df_training.isnull().sum()

# empty values
df_training[df_training['lem_words']=='']

# drop these rows
print df_training.shape
df_training = df_training.drop([475, 648, 720])
df_training = df_training.reset_index(drop=True)
print df_training.shape

Cleaning Prediction Data

# load the data
fs = pd.read_csv(‘./foursquare/foursquare_csv/londonvenues.csv’)# use cleaning functions on the tips
word_tag_fs = cleaning_function(fs.tips)
lemm_fs = separate_word_tag(word_tag_fs)# concat cleaned text with original
df_fs_predict = pd.concat([fs, lemm_fs], axis=1)
df_fs_predict['word_tags'] = word_tag_fs# separate the long lat
lng=[]
lat=[]
for ll in df_fs_predict['ll']:
    lnglat = ll.split(',')
    lng.append(lnglat[0])
    lat.append(lnglat[1])
df_fs_predict['lng'] =lng
df_fs_predict['lat'] =lat#  drop the ll column
df_fs_predict = df_fs_predict.drop(['ll'], axis=1)
df_fs_predict.head()

# save clean foursquare to csv
df_fs_predict.to_csv('./foursquare/foursquare_csv/foursquare_clean.csv', header=True, index=False, encoding='UTF8')

I will continue in the next post looking at the target variable and focusing on train test split, bootstrapping, cross validation and some visualisations.

Thank you for reading How happy is london? Part 1. Feedback is always welcome, just comment below!