Applying Sentiment Analysis to E-commerce classification using Recurrent Neural Networks in Keras: Theory and Implementation

Published in

GradientCrescent

11 min readMar 27, 2019

Introduction

I recently to participated in the National Data Science Challenge organized by Shopee, an online marketplace similar to E-bay, reporting a GMV of over $1.6 bn in Q4 2017. The challenge statement was to utilize text sequences and raw images to automatically populate the different feature entries of a listing, in order to facilitate a intuitive and less-invasive user experience. These feature entries could range from clothing patterns and sleeve lengths for fashion-related items, to screen size and warranty periods for mobile products. To tackle this challenge, my team decided upon the use of a recurrent neural network (RNN).

RNNs and their derivatives (i.e. LSTMs), have gained attention for their applications in natural language processing (NLP). Unlike traditional densely connected neural networks, RNNs are capable of inferring structural intricacies that are present in language sequences, where the context and structure of a phrase delivers additional or even a converse meaning than the simple sum of its individual worlds. This makes RNN’s ideal for sequence predictions given a seed word or phrase, or for sentiment analysis in order to classify the overall emotional response produced by the text.

This can seem a bit confusing, so let’s consider the a quick example sentence:

“Although it is painful now, don’t give up, for tomorrow is a new day”

While the phrases “tomorrow is a new day” and “It is painful now” can be considered as neutral and negative in sentiment, respectively, the overall sentiment or emotion can be classified as positive due to the structural relationship between the phrases. The use of title sequences within the Shopee datasets led us to decide upon an RNN-based architecture — intuition led us to believe that the ordering of the word elements within each sequence may lend some additional information over traditional bag-of-words approaches in terms of feature predictions.

Given the appropriate data, RNN’s can fashion new poetry of their own!

In this tutorial, we’ll evaluate how RNN’s may be applied to a product feature mapping application, and how our experiences in this challenge taught us some life-long data science lessons, which I’m eager to share with you. As in our previous tutorials, we assume that you are familiar with the theory behind fully connected neural networks, and with Python and Keras in particular.

Theory

So how do RNN’s work? To get a better understanding of the architecture, let’s look at how an RNN layer operates on a sequence. Figure 1 below displays an abstraction of an RNN layer featuring a gated recurrent unit, which can be summarized as a hidden layer featuring the capability to propagate information across a sequence. Suppose that we are feeding network a series of inputs in a time-step based sequence (X0, X1 …. Xt). It then follows that:

Figure 1. Abstraction of an RNN architecutre (source)

1. X0 is fed into the layer, producing a hypothesis H0 but also an activation value A0 , which is stored in memory. (The prediction of X0 itself is influenced by a randomized activation value)

2. X1 is fed into the layer together with A0 (retrieved from memory), in order to produce a hypothesis H1. An activation value A1 is also produced and stored in memory, which now contains information related to X1 and X0.

3. X2 is fed into the layer together with A1 (retrieved from memory), in order to produce a hypothesis H1. An activation value A2 is also produced and stored in memory, which now contains information related to X1, X0, and X2.

4. The process is repeated for the sequence until a pre-defined token “stop-word” is encountered (let’s assume a period exists after Xt).

Essentially, an RNN layer allows for information from one end of the sequence to influence the prediction processes at the other end. While the exact training process is beyond the scope of this tutorial, it is based on backpropagation across time, where loss functions calculated at each step of the sequence prediction process and are summed together for an overall loss function. Training affects three separate sets of weights, namely:

- Between the input and the hidden layer (Wax)

- Between the hidden layer and the output layer (Way)

- Between the hidden layer’s activation functions across timesteps (Waa)

You may have noticed that our architecture is unidirectional, meaning that while the earlier elements of a sequence can influence later predictions, later elements cannot be used to influence earlier predictions. This can be addressed with bidirectional variant of recurrent neural networks (BRNNs), which you can read about here.

While the RNN above is an example of a many-to-many predictive architecture, sentiment classifiers are usually based on many-to-one architectures, where are hypothesis is only generated at the end of the sequence. As the aforementioned activation value propagation process is still present during forward propagation, this final prediction contains information on the overall sequence as a whole.

Implementation

While the Shopee datasets are private and proprietary, I’ve taken the liberty to create some mock data in the same style of the fashion dataset for illustrative purposes. Each listing has its own title, path to an image, and attribute categories corresponding to a feature specific to that class of data. We’ll be focusing on the attribute and title feature columns for our RNN. Note that as we can’t release the datasets for your use, you won’t be able to replicate these results. However, you may be inspired to try something similar with your own problems!

Mock data in the same style as the official fashion dataset. NaN here stands for “Not a Number”

You may wonder why we aren’t planning to use the image data. This was a conscious decision by the team, as we found that the raw images were inconsistent in terms of layout, lighting, or content — some images were simply stock photos, others badly lit, and others had their contents partially cropped. As all of these images were submitted by thousands of individual sellers, noisy data is understandable and contrasts well with how standard datasets are in commonly found deep learning tutorials. In light of this, we decided to focus on the title feature column (containing the raw text of each listing) and attribute categories as the as our network inputs and target labels, respectively. Have no fear, we’ll be covering an image-based approach, and how to deal with noisy data, in a later tutorial.

All of our work was performed on Google’s Compute Engine, using the free sign-up credit granted upon registration. As we don’t have access to the original dataset, we’ll skip any data importing steps and move straight into preprocessing — for the length of this tutorial, assume that the mock data is an excellent imitation of the actual training and validation datasets.

Essentially, our approach boils down to treating each attribute feature value as an independent sentiment value. To execute this, feature values across all attribute categories would have to be collected and converted into one-hot-encoded vectors. The initial step of preprocessing was to append all target label entries of the training data with their textual representations, using the provided dataset JSON dictionary. This can be done via Pandas’s internal merge command, which repeat for all attribute categories:

fashion_trainval = pd.merge(fashion_trainval, fashion_ref.loc[~np.isnan(fashion_ref[“Pattern”]), 
 [“Pattern”,”Attribute”]],
 on=”Pattern”, how=”left”)fashion_trainval = fashion_trainval.rename(columns={“Attribute”:”Pattern_Attr”})fashion_trainval = pd.merge(fashion_trainval, fashion_ref.loc[~np.isnan(fashion_ref[“Collar Type”]), 
 [“Collar Type”,”Attribute”]],
 on=”Collar Type”, how=”left”)fashion_trainval = fashion_trainval.rename(columns={“Attribute”:”Collar_Type_Attr”})fashion_trainval = pd.merge(fashion_trainval, fashion_ref.loc[~np.isnan(fashion_ref[“Fashion Trend”]), 
 [“Fashion Trend”,”Attribute”]],
 on=”Fashion Trend”, how=”left”)fashion_trainval = fashion_trainval.rename(columns={“Attribute”:”Fashion_Trend_Attr”})fashion_trainval = pd.merge(fashion_trainval, fashion_ref.loc[~np.isnan(fashion_ref[“Clothing Material”]), 
 [“Clothing Material”,”Attribute”]],
 on=”Clothing Material”, how=”left”)fashion_trainval = fashion_trainval.rename(columns={“Attribute”:”Clothing_Material_Attr”})fashion_trainval = pd.merge(fashion_trainval, fashion_ref.loc[~np.isnan(fashion_ref[“Sleeves”]), 
 [“Sleeves”,”Attribute”]],
 on=”Sleeves”, how=”left”)fashion_trainval = fashion_trainval.rename(columns={“Attribute”:”Sleeves_Attr”})

Next, we replaced all NaN values with a universal “default” sentiment class as a placeholder, and apply a function to sum all attribute entries into a single one-line array, which we will term the [Label] column.


fashion_trainval = fashion_trainval.fillna(“default”)fashion_trainval[“Label”] = fashion_trainval[[“Pattern_Attr”,”Collar_Type_Attr”,”Fashion_Trend_Attr”,
 “Clothing_Material_Attr”,”Sleeves_Attr”]].apply(lambda x: tuple([attr for attr in x.values]), axis=1)
fashion_trainval.head()

An example data row would now look something like this:

Next, we one-hot-encoded our feature target labels using SKLearn’s MultiLabelBinarizer class. To be more specific,e MultiLabelBinarizer converts all of our features (across all attribute categories) into a single binary array to indicate the presence of the feature for certain row of data.

from sklearn.preprocessing import MultiLabelBinarizermultilabel_binarizer = MultiLabelBinarizer()
multilabel_binarizer.fit(fashion_trainval[‘Label’])

A representation of a row of data possessing only two labels, say “floral” and “colorful”, may look something like this:

With our labels ready, we cleaned up our title sequences by standardizing their format, and removing any special characters. In order to reduce the number of inflectional and derivationally related forms of the same base word, our sequences underwent lemmatization, which aims to associate words of the same meaning into a single token. This process acts as a form of normalization for the text sequences in our dataset.

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re
lemmatizer = WordNetLemmatizer()
strip_special_chars = re.compile(“[^A-Za-z0–9 ]+”)
stop_words = set(stopwords.words(“english”))def cleanUpSentence(r, stop_words = None):
 r = r.lower().replace(“<br />”, “ “)
 r = re.sub(strip_special_chars, “”, r.lower())
 if stop_words is not None:
 words = word_tokenize(r)
 filtered_sentence = []
 for w in words:
 w = lemmatizer.lemmatize(w)
 if w not in stop_words:
 filtered_sentence.append(w)
 return “ “.join(filtered_sentence)
 else:
 return rtotalX = []
totalY = np.array(fashion_trainval[‘Label’])
totalY = multilabel_binarizer.fit_transform(totalY)
for i, doc in enumerate(fashion_trainval[‘title’]):
 totalX.append(cleanUpSentence(doc, stop_words))

Finally, we ensured that our tokens fit within a maximum dictionary size of 50,000 words, while padding our sequences to ensure a uniform arbitrary sequence length of 150 elements. This was done using Keras’s internal methods.

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequencesmaxLength = 150
max_vocab_size = 50000
input_tokenizer = Tokenizer(max_vocab_size)
input_tokenizer.fit_on_texts(totalX)
input_vocab_size = len(input_tokenizer.word_index) + 1
print(“input_vocab_size:”,input_vocab_size)
totalX = np.array(pad_sequences(input_tokenizer.texts_to_sequences(totalX), maxlen=maxLength))

Now that all of our preprocessing has been done, let’s build our sequential network model. As we were building an RNN, wee switched out our traditional layers with the aforementioned gated-recurrent units, before feeding the results into a densely-connected Sigmoid layer designed to give probabilities across all of our feature sentiment categories. We defined our model’s optimizer and loss function as ADAM and binary crossentropy, respectively. The use of binary crossentropy means that our predictions will be of the One Vs All format — You could also utilize categorical crossentropy for a more pure multiclassification-based approach, but we felt that the high number of “sentiments” would make that approach less accurate.

Finally, we trained our model across 10 epochs using a 10% validation set, primarily due to a lack of resources. Ideally, you’re going to want to train for more epochs, but even with the limited training time you’ll notice that our validation accuracies are still high. As done in our previous tutorials, we use Matplotlib to inspect the change in our training and validation accuracies and loss values.

from keras.models import Sequential
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import GRU
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Denseembedding_dim = 256
num_categories = len(y)
model = Sequential()
model.add(Embedding(input_vocab_size, embedding_dim,input_length = maxLength))
model.add(GRU(256, dropout=0.9, return_sequences=True))
model.add(GRU(256, dropout=0.9))
model.add(Dense(num_categories, activation=’sigmoid’))
model.compile(loss=’binary_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])model.save(“fashion_text_model.h5”)
from keras.models import load_model
model = load_model(‘fashion_text_model.h5’)import matplotlib.pyplot as plt
%matplotlib inlineacc = history.history[‘acc’]
val_acc = history.history[‘val_acc’]
loss = history.history[‘loss’]
val_loss = history.history[‘val_loss’]epochs = range(len(acc))plt.plot(epochs, acc, ‘bo’, label=’Training acc’)
plt.plot(epochs, val_acc, ‘b’, label=’Validation acc’)
plt.title(‘Training and validation accuracy’)
plt.legend()plt.figure()plt.plot(epochs, loss, ‘bo’, label=’Training loss’)
plt.plot(epochs, val_loss, ‘b’, label=’Validation loss’)
plt.title(‘Training and validation loss’)
plt.legend()plt.show()

Figure 3. Accuracy and Loss values over 10 training epochs using the RNN model.

Our validation accuracy approaches 99%! Wow! You may be lulled to thinking that our model performs great — but the small differences between training and validation accuracies suggest that our model is overfitting to the dataset. It could be argued however, that this correlation is simply due to the data not exhibiting significant variation.

As all of our feature probability values are exhibited in a single array (see below), it’s important to enumerate through them. First, we set a cut-off of 50%, and then separate the resulting features into their original attribute categories. Let’s demonstrate a quick prediction using row 220 of the fashion dataset (which we sadly cannot show you!)

textArray = np.array(pad_sequences(input_tokenizer.texts_to_sequences([input_x_220]), maxlen=maxLength))
predicted = model.predict(textArray)[0]
print(predicted)# predicted class
for i, prob in enumerate(predicted):
 if prob > 0.5:
 print(y[i])predicted_top = y[sorted(range(len(predicted)), key=lambda i: predicted[i],reverse=True)[:3]]
if ‘default’ in predicted_top:
 predicted_top = [i for i in predicted_top if i is not ‘default’]
else:
 predicted_top = predicted_top[:2]
predicted_top

Your raw feature prediction array would look something like this:

While your final two predictions may look something like this (with default class removed):

[‘floral’, ’dress’]

To make a submission for the challenge, we had to extract the attributes from the raw prediction grid into their corresponding attribute categories.T o summarize, we selected the top two predicted features for each category, or with a default placeholder if only one prediction had scored above the prediction boundary.

def process_prediction(data):
 predicted_ordered = y[sorted(range(len(data)), key=lambda i: data[i],reverse=True)]
 if ‘default’ in predicted_ordered:
 predicted_ordered = [i for i in predicted_ordered if i is not ‘default’]
 # get attribute of each prediction
 #predicted_attribute = [get_attribute(temp_predict) for temp_predict in predicted_ordered]
 predicted_attribute = []
 for i, temp_predict in enumerate(predicted_ordered):
 predicted_attribute.append(get_attribute(temp_predict))
 # keep top 2 for each attribute
 return_prediction = []
 for i, attr in enumerate(list(set(predicted_attribute))):
 temp_predicted_attr = [predicted_ordered[i] for i in [index for index, value in enumerate(predicted_attribute) if value == attr]]
 if len(temp_predicted_attr) > 1:
 temp_predicted_attr = temp_predicted_attr[:2]
 else:
 temp_predicted_attr = temp_predicted_attr + temp_predicted_attr
 return_prediction.append([attr]+temp_predicted_attr)
 
 return return_prediction#predicted_new_top2 = [process_prediction(predicted_item) for predicted_item in predicted_new]
predicted_new_top2 = []
for i, predicted_item in enumerate(predicted_new):
 predicted_new_top2.append(process_prediction(predicted_item))

Our overall MAP@K value (K=2), was 0.45262, compared to the overall winner’s 0.46869. In fact, Shopee’s own reference solution utilized a multimodal approach, where CNN-based image prediction results supplemented heuristic algorithms and densely connected classifiers. Some of the issues and weaknesses with our approach include:

1. We assumed that the sequences have intrinsic meaning when titles were arrayed in the style of a bag of words. However, while certain rules in grammar can lend to the separation of adjectives from nouns in our titles, this is also not guaranteed to be consistent.

2. Our model lacked sufficient hyperparameter and probability cut-off tuning, possibly negatively affecting accuracy.

3. We observed a strong tendency to overfit to our training dataset, which would require large amounts of regularization to be overcome.

4. Our dictionary may be too large, holding too many words that are under-represented in the dataset or simply filler.

In the end, this challenge was a great learning experience, and if nothing else, taught us four important data science lessons that I hope you’ll take with you!

Don’t overengineer your solutions!
Don’t neglect any available data!
Don’t put all of your eggs in a unimodal basket!
Don’t forget to allocate enough resources for training!

References

Andrew Ng, Recurrent Neural Networks

Colah, Understanding LSTMs

WildML, Recurrent Neural Networks

Applying Sentiment Analysis to E-commerce classification using Recurrent Neural Networks in Keras: Theory and Implementation

Written by Adrian Yijie Xu, PhD