Comment sentiment analysis

This article is a joint affair of Armin Behjati and Bahram Mohammadpour

This project was done in March 2018

Zoodfood comment crawler :

crawling comments from zoodfood.

First we open the page containing restaurants list then we parse the extracted restaurant’s links there are 12 restaurants in each page. we define the function ‘get_restaurants_list’ to get list of restaurants for each page given the page number.

zoodfood restaurants list has 288 pages so we fetch data for each page.
we can decrease time of excecution using threads.

we write links of restaurants to a text file so we can use them later

reading links from text file :

sample link : ‘/restaurant/menu/37j82x/kenzo/fereshteh’

we need the 3rd argument of this URL,which is called vendor is an ID for the restaurant in snappfood API.

we just need the vendor name to collect comments.

we store vendor names in ‘restaurant_vendor_name’

url of comment API :{vendor_id}/{page}
in each API call we get 10 comments and next page will get 10 next comments
if we request this URL :

we will get somthing like this :


‘count’ shows the number of comments.
we should make (count/10) API calls to get all comments.

‘get_comments_for_restaurant_page’ function returns the comment data given the ‘vendor_name’ and ‘page number’.

we need a function to get all the work done . to simplify the process we define the function bellow but first we should connect to mongoDB.

get all comments

we need a text file containing all comment texts.

Language Model:

We want to make a classifier to predict the restaurant comments emotions .

all the comments are in persian and unfortunately there are no good pre-trained language models available for persian. so we have to train our own . after training the persian language model and understanding the persian language structures we will try to recognize the emotion behind each comment .

specifying the PATH :
PATH_lang : language model PATH
PATH : labeled data for sentiment analysis PATH

Specifying the train and validation path for both language model and labeled data :
TRN : path for traing labeled data
VAL = path for validation labeled data
TRN_lang :language model training data path
VAL_lang = language model validation data path

making train and validation folders containing language model training and validation files.

counting the words in train and validation dataset :


Tokenizing the text :
before we can work withe the text we must first convert it to an array of tokens(or words).
for tokenizing the text we use an open source tokenizer “spacy” which supports persian language too . although the persian tokenizer is not very great it seemed sufficient enough for this project .though it needs some improvements.

'سس اسپرینگ رول اشتباه ارسال شده بود'

checking the tokenizer on one comment from the dataset.
every token is seperated with a “?” mark .
everything seems to work fine .


we are using fastai library which works with torchtext . so we have to make a torchtext field to pass it to the LnguageModelData object .
here we tell it how to preprocess the text . simply making it lowercase and using spacy for tokenizing.

setting the batch size and backprop through time values.
*bptt simply means how many layers we will backprop through.

after building the ModelData , ModelData object fills the attributes of the TEXT object.

(491, 7955, 1, 2364421)

first twelve elements of the map from integers to unique tokens:


numericalize is one of TEXT attributes that handles turning tokens to integers in a text.

Variable containing:
[torch.cuda.LongTensor of size 12x1 (GPU 0)]

the LanguageModelData object creates batches of 64 columns (that’s our batch size) and squence lengths of around 75 tokens(that’s the bptt we defined )
Each batch also contains the exact same data as labels, but one word later in the text since we’re trying to always predict the next word. The labels are flattened into a 1d array.
because we can’t shuffle the text files , torchtext automatically changes the bptt value a little bit to add the element of randomness.

(Variable containing:
170 41 5 ... 2 308 247
7 98 2 ... 26 3 29
347 30 3 ... 42 21 279
... ⋱ ...
5 2712 2 ... 3 373 163
2 37 18 ... 74 3290 51
18 2449 451 ... 109 1155 112
[torch.cuda.LongTensor of size 73x64 (GPU 0)], Variable containing:

[torch.cuda.LongTensor of size 4672 (GPU 0)])

em_sz : the embedding vector size
nh : number of activations in each layer
nl : number of layers

the optimizer function
the 0.7 momentum value works best so we’re not using the default value of 0.9 .

fastai uses a variant of the state of the art AWD LSTM Language Model developed by Stephen Merity. so that’s the model we’re using too .

we train the model a little bit :

We save the embeddings for further use.

Let’s test the language model to see how it works :

['<unk>', 'بود', 'و', 'عالی', 'خوب', 'با', 'هم', 'که', 'به', 'خیلی']

It seems to be working fine !

کباب <unk> خیلی خوشمزه و خوب بود. فقط ای کاش <unk> که تو منو زده بودن با چیزی ک ارسال شد ...

Sentiment Analysis:

Here we define our fastai/torchtext dataset :

sequential=False means that the text should be tokenized first.
splits is a torchtext method that creates train, test, and validation sets.

using fastai we can create ModelData object from torchtext splits.

fine-tuning pretrained models gives the opportunity to use differential learning rates .

Accuracy :

Training completed with final results shown above.


