Comment sentiment analysis

Armin Behjati

Published in

AI Backyard

6 min readAug 6, 2018

This article is a joint affair of Armin Behjati and Bahram Mohammadpour

This project was done in March 2018

Zoodfood comment crawler :

crawling comments from zoodfood.

First we open the page containing restaurants list then we parse the extracted restaurant’s links there are 12 restaurants in each page. we define the function ‘get_restaurants_list’ to get list of restaurants for each page given the page number.

zoodfood restaurants list has 288 pages so we fetch data for each page.
we can decrease time of excecution using threads.

we write links of restaurants to a text file so we can use them later

reading links from text file :

sample link : ‘/restaurant/menu/37j82x/kenzo/fereshteh’

we need the 3rd argument of this URL,which is called vendor name.it is an ID for the restaurant in snappfood API.

we just need the vendor name to collect comments.

we store vendor names in ‘restaurant_vendor_name’

url of comment API :

https://www.zoodfood.com/restaurant/comment/vendor/{vendor_id}/{page}
in each API call we get 10 comments and next page will get 10 next comments
if we request this URL : https://www.zoodfood.com/restaurant/comment/vendor/37j82x/0

we will get somthing like this :

{“status”:true,”data”:{“count”:435,”pageSize”:10,”comments”:[…]}}

‘count’ shows the number of comments.
we should make (count/10) API calls to get all comments.

‘get_comments_for_restaurant_page’ function returns the comment data given the ‘vendor_name’ and ‘page number’.

we need a function to get all the work done . to simplify the process we define the function bellow but first we should connect to mongoDB.

get all comments

we need a text file containing all comment texts.

Language Model:

We want to make a classifier to predict the restaurant comments emotions .

all the comments are in persian and unfortunately there are no good pre-trained language models available for persian. so we have to train our own . after training the persian language model and understanding the persian language structures we will try to recognize the emotion behind each comment .

specifying the PATH :
PATH_lang : language model PATH
PATH : labeled data for sentiment analysis PATH

Specifying the train and validation path for both language model and labeled data :
TRN : path for traing labeled data
VAL = path for validation labeled data
TRN_lang :language model training data path
VAL_lang = language model validation data path

making train and validation folders containing language model training and validation files.

counting the words in train and validation dataset :

Tokenizing the text :
before we can work withe the text we must first convert it to an array of tokens(or words).
for tokenizing the text we use an open source tokenizer “spacy” which supports persian language too . although the persian tokenizer is not very great it seemed sufficient enough for this project .though it needs some improvements.

'سس اسپرینگ رول اشتباه ارسال شده بود'

checking the tokenizer on one comment from the dataset.
every token is seperated with a “?” mark .
everything seems to work fine .

'میتونم?بگم?هیچ?ارزونی?ای?بی?علت?نیست?.?کیفیت?پایین?ولی?با?توجه?به?مثل?هر?چقدر?پول?بدی?اش?میخوری?میشه?گفت?قابل?قبول?ولی?پیتزا?پدر?خوب?نارمک?تویه?همین?رنج?قیمت?کیفیت?بهتری?داره'

we are using fastai library which works with torchtext . so we have to make a torchtext field to pass it to the LnguageModelData object .
here we tell it how to preprocess the text . simply making it lowercase and using spacy for tokenizing.

setting the batch size and backprop through time values.
*bptt simply means how many layers we will backprop through.

after building the ModelData , ModelData object fills the attributes of the TEXT object.

(491, 7955, 1, 2364421)

first twelve elements of the map from integers to unique tokens:

['<unk>',
 '<pad>',
 'بود',
 'و',
 'غذا',
 'خوب',
 'از',
 'به',
 'خیلی',
 'هم',
 'عالی',
 'که']

numericalize is one of TEXT attributes that handles turning tokens to integers in a text.

Variable containing:
  170
    7
  347
    0
  537
   17
  134
  454
 1869
    7
  491
    0
[torch.cuda.LongTensor of size 12x1 (GPU 0)]

the LanguageModelData object creates batches of 64 columns (that’s our batch size) and squence lengths of around 75 tokens(that’s the bptt we defined )
Each batch also contains the exact same data as labels, but one word later in the text since we’re trying to always predict the next word. The labels are flattened into a 1d array.
because we can’t shuffle the text files , torchtext automatically changes the bptt value a little bit to add the element of randomness.

(Variable containing:
   170    41     5  ...      2   308   247
     7    98     2  ...     26     3    29
   347    30     3  ...     42    21   279
        ...          ⋱          ...       
     5  2712     2  ...      3   373   163
     2    37    18  ...     74  3290    51
    18  2449   451  ...    109  1155   112
 [torch.cuda.LongTensor of size 73x64 (GPU 0)], Variable containing:
     7
    98
     2
   ⋮  
    75
    48
     0
 [torch.cuda.LongTensor of size 4672 (GPU 0)])

em_sz : the embedding vector size
nh : number of activations in each layer
nl : number of layers

the optimizer function
the 0.7 momentum value works best so we’re not using the default value of 0.9 .

fastai uses a variant of the state of the art AWD LSTM Language Model developed by Stephen Merity. so that’s the model we’re using too .

we train the model a little bit :

HBox(children=(IntProgress(value=0, description='Epoch', max=75), HTML(value='')))epoch      trn_loss   val_loss                              
    0      6.263493   6.211119  
    1      5.307442   5.09662                               
    2      5.006752   4.829635                              
    3      4.864222   4.715991                              
    4      4.826177   4.692709                              
    5      4.64845    4.495593                              
    6      4.478897   4.357899                              
    7      4.363075   4.269142                              
    8      4.283838   4.214092                              
    9      4.214998   4.175522                              
    10     4.162878   4.152645                              
    11     4.130756   4.13245                               
    12     4.104955   4.122128                              
    13     4.100972   4.117318                              
    14     4.087452   4.115951                              
    15     4.19065    4.151838                              
    16     4.134897   4.129659                              
    17     4.121771   4.120149                              
    18     4.071009   4.110334                              
    19     4.039296   4.102594                              
    20     4.021478   4.095991                              
    21     3.978974   4.087306                              
    22     3.95643    4.079997                              
    23     3.939815   4.078523                              
    24     3.937055   4.074559                              
    25     3.900932   4.069588                              
    26     3.850793   4.075124                              
    27     3.855999   4.065688                              
    28     3.80646    4.072929                              
    29     3.809924   4.069959                              
    30     3.80369    4.067943                              
    31     3.812806   4.067925                              
    32     3.807241   4.067443                              
    33     3.768194   4.069095                              
    34     3.761686   4.070869                              
    35     3.954122   4.091923                              
    36     3.935312   4.091307                              
    37     3.938833   4.092941                              
    38     3.918598   4.088365                              
    39     3.897253   4.091039                              
    40     3.876273   4.094545                              
    41     3.86413    4.096863                              
    42     3.863323   4.088464                              
    43     3.841176   4.09819                               
    44     3.870419   4.084446                              
    45     3.831362   4.09162                               
    46     3.804627   4.102284                              
    47     3.808969   4.085278                              
    48     3.805639   4.087205                              
    49     3.777925   4.095012                              
    50     3.751137   4.1057                                
    51     3.761073   4.0914                                
    52     3.71945    4.108792                              
    53     3.705185   4.110999                              
    54     3.69791    4.110427                              
    55     3.691067   4.108355                              
    56     3.672802   4.113387                              
    57     3.68202    4.103521                              
    58     3.649078   4.115421                              
    59     3.676602   4.09705                               
    60     3.630621   4.114202                              
    61     3.613245   4.121602                              
    62     3.615716   4.118817                              
    63     3.629888   4.109546                              
    64     3.601175   4.118355                              
    65     3.578798   4.127409                              
    66     3.569921   4.126852                              
    67     3.564669   4.127878                              
    68     3.569504   4.126769                              
    69     3.558655   4.127853                              
    70     3.573381   4.126194                              
    71     3.559762   4.128838                              
    72     3.572878   4.128811                              
    73     3.578699   4.130381                              
    74     3.545339   4.130868[4.130867616921622]

We save the embeddings for further use.

Let’s test the language model to see how it works :

['<unk>', 'بود', 'و', 'عالی', 'خوب', 'با', 'هم', 'که', 'به', 'خیلی']

It seems to be working fine !

کباب <unk> خیلی خوشمزه و خوب بود. فقط ای کاش <unk> که تو منو زده بودن با چیزی ک ارسال شد ...

Sentiment Analysis:

Here we define our fastai/torchtext dataset :

sequential=False means that the text should be tokenized first.
splits is a torchtext method that creates train, test, and validation sets.

using fastai we can create ModelData object from torchtext splits.

fine-tuning pretrained models gives the opportunity to use differential learning rates .

HBox(children=(IntProgress(value=0, description='Epoch', max=5), HTML(value='')))epoch      trn_loss   val_loss   accuracy                       
    0      0.914599   1.026246   0.526914  
    1      0.892305   1.073445   0.526914                       
    2      0.890372   1.029869   0.526914                       
    3      0.895171   1.018384   0.526914                       
    4      0.907669   1.034427   0.526914HBox(children=(IntProgress(value=0, description='Epoch', max=25), HTML(value='')))epoch      trn_loss   val_loss   accuracy                      
    0      0.726858   0.807458   0.658791  
    1      0.691951   0.799014   0.674747                      
    2      0.672657   0.861043   0.665392                      
    3      0.666511   0.858039   0.677368                      
    4      0.659266   0.843153   0.678958                      
    5      0.687026   0.764765   0.669744                      
    6      0.6328     0.80844    0.670203                      
    7      0.643107   0.885304   0.672158                      
    8      0.625191   0.980532   0.674329                      
    9      0.627007   0.818966   0.679019                      
    10     0.662649   0.773294   0.658341                      
    11     0.651018   0.732899   0.693246                      
    12     0.640039   0.774128   0.681997                      
    13     0.646645   0.818812   0.685979                      
    14     0.629662   0.818131   0.687009                      
    15     0.627583   1.013715   0.684953                      
    16     0.620577   0.789786   0.691546                      
    17     0.621732   0.755267   0.689668                      
    18     0.616339   0.812259   0.69037                       
    19     0.628711   0.816256   0.685763                      
    20     0.631954   0.869854   0.663502                      
    21     0.648067   0.792397   0.689935                      
    22     0.621395   0.811201   0.687528                      
    23     0.627996   0.75971    0.692931                      
    24     0.61478    0.766722   0.694669[0.7667221994412634, 0.6946690829240649]

Accuracy :

0.7319195114773396

HBox(children=(IntProgress(value=0, description='Epoch', max=5), HTML(value='')))epoch      trn_loss   val_loss   accuracy                       
    0      0.605273   0.795732   0.697832  
    1      0.605745   0.737048   0.701651                       
    2      0.61152    0.726605   0.705618                       
    3      0.618649   0.898131   0.680639                       
    4      0.607915   0.835565   0.684423HBox(children=(IntProgress(value=0, description='Epoch', max=25), HTML(value='')))epoch      trn_loss   val_loss   accuracy                      
    0      0.63425    0.712354   0.70744   
    1      0.625679   0.7335     0.696126                      
    2      0.598707   0.762889   0.694836                      
    3      0.618014   0.803139   0.694313                      
    4      0.592371   0.827738   0.683457                      
    5      0.593831   0.758763   0.684881                      
    6      0.612648   0.803343   0.663893                      
    7      0.613446   0.759131   0.694913                      
    8      0.590957   0.866204   0.691297                      
    9      0.59238    1.015298   0.688887                      
    10     0.631985   0.876454   0.65384                       
    11     0.610506   0.83315    0.689219                      
    12     0.609516   0.753415   0.695011                      
    13     0.612919   0.785118   0.691572                      
    14     0.59885    0.770415   0.695097                      
    15     0.59031    0.845301   0.677381                      
    16     0.616533   0.770404   0.684773                      
    17     0.60326    0.84417    0.691907                      
    18     0.587279   0.800148   0.688964                      
    19     0.611245   0.831045   0.689919                      
    20     0.604469   0.737693   0.707982                      
    21     0.594355   0.855759   0.676296                      
    22     0.600001   0.890678   0.679592                      
    23     0.578646   0.789512   0.691539                      
    24     0.606092   0.871126   0.687559[0.8711255853035156, 0.6875590290895354]

HBox(children=(IntProgress(value=0, description='Epoch', max=1), HTML(value='')))61%|██████    | 1335/2188 [00:40<00:25, 33.37it/s, loss=2.29]

Comment sentiment analysis

Zoodfood comment crawler :

Language Model:

Sentiment Analysis:

Written by Armin Behjati