Comment sentiment analysis

Armin Behjati
AI Backyard
Published in
6 min readAug 6, 2018

This article is a joint affair of Armin Behjati and Bahram Mohammadpour

This project was done in March 2018

Zoodfood comment crawler :

crawling comments from zoodfood.

First we open the page containing restaurants list then we parse the extracted restaurant’s links there are 12 restaurants in each page. we define the function ‘get_restaurants_list’ to get list of restaurants for each page given the page number.

zoodfood restaurants list has 288 pages so we fetch data for each page.
we can decrease time of excecution using threads.

we write links of restaurants to a text file so we can use them later

reading links from text file :

sample link : ‘/restaurant/menu/37j82x/kenzo/fereshteh’

we need the 3rd argument of this URL,which is called vendor name.it is an ID for the restaurant in snappfood API.

we just need the vendor name to collect comments.

we store vendor names in ‘restaurant_vendor_name’

url of comment API :

https://www.zoodfood.com/restaurant/comment/vendor/{vendor_id}/{page}
in each API call we get 10 comments and next page will get 10 next comments
if we request this URL : https://www.zoodfood.com/restaurant/comment/vendor/37j82x/0

we will get somthing like this :

{“status”:true,”data”:{“count”:435,”pageSize”:10,”comments”:[…]}}

‘count’ shows the number of comments.
we should make (count/10) API calls to get all comments.

‘get_comments_for_restaurant_page’ function returns the comment data given the ‘vendor_name’ and ‘page number’.

we need a function to get all the work done . to simplify the process we define the function bellow but first we should connect to mongoDB.

get all comments

we need a text file containing all comment texts.

Language Model:

We want to make a classifier to predict the restaurant comments emotions .

all the comments are in persian and unfortunately there are no good pre-trained language models available for persian. so we have to train our own . after training the persian language model and understanding the persian language structures we will try to recognize the emotion behind each comment .

specifying the PATH :
PATH_lang : language model PATH
PATH : labeled data for sentiment analysis PATH

Specifying the train and validation path for both language model and labeled data :
TRN : path for traing labeled data
VAL = path for validation labeled data
TRN_lang :language model training data path
VAL_lang = language model validation data path

making train and validation folders containing language model training and validation files.

counting the words in train and validation dataset :

2120494
791984

Tokenizing the text :
before we can work withe the text we must first convert it to an array of tokens(or words).
for tokenizing the text we use an open source tokenizer “spacy” which supports persian language too . although the persian tokenizer is not very great it seemed sufficient enough for this project .though it needs some improvements.

'سس اسپرینگ رول اشتباه ارسال شده بود'

checking the tokenizer on one comment from the dataset.
every token is seperated with a “?” mark .
everything seems to work fine .

'میتونم?بگم?هیچ?ارزونی?ای?بی?علت?نیست?.?کیفیت?پایین?ولی?با?توجه?به?مثل?هر?چقدر?پول?بدی?اش?میخوری?میشه?گفت?قابل?قبول?ولی?پیتزا?پدر?خوب?نارمک?تویه?همین?رنج?قیمت?کیفیت?بهتری?داره'

we are using fastai library which works with torchtext . so we have to make a torchtext field to pass it to the LnguageModelData object .
here we tell it how to preprocess the text . simply making it lowercase and using spacy for tokenizing.

setting the batch size and backprop through time values.
*bptt simply means how many layers we will backprop through.

after building the ModelData , ModelData object fills the attributes of the TEXT object.

(491, 7955, 1, 2364421)

first twelve elements of the map from integers to unique tokens:

['<unk>',
'<pad>',
'بود',
'و',
'غذا',
'خوب',
'از',
'به',
'خیلی',
'هم',
'عالی',
'که']

numericalize is one of TEXT attributes that handles turning tokens to integers in a text.

Variable containing:
170
7
347
0
537
17
134
454
1869
7
491
0
[torch.cuda.LongTensor of size 12x1 (GPU 0)]

the LanguageModelData object creates batches of 64 columns (that’s our batch size) and squence lengths of around 75 tokens(that’s the bptt we defined )
Each batch also contains the exact same data as labels, but one word later in the text since we’re trying to always predict the next word. The labels are flattened into a 1d array.
because we can’t shuffle the text files , torchtext automatically changes the bptt value a little bit to add the element of randomness.

(Variable containing:
170 41 5 ... 2 308 247
7 98 2 ... 26 3 29
347 30 3 ... 42 21 279
... ⋱ ...
5 2712 2 ... 3 373 163
2 37 18 ... 74 3290 51
18 2449 451 ... 109 1155 112
[torch.cuda.LongTensor of size 73x64 (GPU 0)], Variable containing:
7
98
2

75
48
0
[torch.cuda.LongTensor of size 4672 (GPU 0)])

em_sz : the embedding vector size
nh : number of activations in each layer
nl : number of layers

the optimizer function
the 0.7 momentum value works best so we’re not using the default value of 0.9 .

fastai uses a variant of the state of the art AWD LSTM Language Model developed by Stephen Merity. so that’s the model we’re using too .

we train the model a little bit :

HBox(children=(IntProgress(value=0, description='Epoch', max=75), HTML(value='')))epoch      trn_loss   val_loss                              
0 6.263493 6.211119
1 5.307442 5.09662
2 5.006752 4.829635
3 4.864222 4.715991
4 4.826177 4.692709
5 4.64845 4.495593
6 4.478897 4.357899
7 4.363075 4.269142
8 4.283838 4.214092
9 4.214998 4.175522
10 4.162878 4.152645
11 4.130756 4.13245
12 4.104955 4.122128
13 4.100972 4.117318
14 4.087452 4.115951
15 4.19065 4.151838
16 4.134897 4.129659
17 4.121771 4.120149
18 4.071009 4.110334
19 4.039296 4.102594
20 4.021478 4.095991
21 3.978974 4.087306
22 3.95643 4.079997
23 3.939815 4.078523
24 3.937055 4.074559
25 3.900932 4.069588
26 3.850793 4.075124
27 3.855999 4.065688
28 3.80646 4.072929
29 3.809924 4.069959
30 3.80369 4.067943
31 3.812806 4.067925
32 3.807241 4.067443
33 3.768194 4.069095
34 3.761686 4.070869
35 3.954122 4.091923
36 3.935312 4.091307
37 3.938833 4.092941
38 3.918598 4.088365
39 3.897253 4.091039
40 3.876273 4.094545
41 3.86413 4.096863
42 3.863323 4.088464
43 3.841176 4.09819
44 3.870419 4.084446
45 3.831362 4.09162
46 3.804627 4.102284
47 3.808969 4.085278
48 3.805639 4.087205
49 3.777925 4.095012
50 3.751137 4.1057
51 3.761073 4.0914
52 3.71945 4.108792
53 3.705185 4.110999
54 3.69791 4.110427
55 3.691067 4.108355
56 3.672802 4.113387
57 3.68202 4.103521
58 3.649078 4.115421
59 3.676602 4.09705
60 3.630621 4.114202
61 3.613245 4.121602
62 3.615716 4.118817
63 3.629888 4.109546
64 3.601175 4.118355
65 3.578798 4.127409
66 3.569921 4.126852
67 3.564669 4.127878
68 3.569504 4.126769
69 3.558655 4.127853
70 3.573381 4.126194
71 3.559762 4.128838
72 3.572878 4.128811
73 3.578699 4.130381
74 3.545339 4.130868
[4.130867616921622]

We save the embeddings for further use.

Let’s test the language model to see how it works :

['<unk>', 'بود', 'و', 'عالی', 'خوب', 'با', 'هم', 'که', 'به', 'خیلی']

It seems to be working fine !

کباب <unk> خیلی خوشمزه و خوب بود. فقط ای کاش <unk> که تو منو زده بودن با چیزی ک ارسال شد ...

Sentiment Analysis:

Here we define our fastai/torchtext dataset :

sequential=False means that the text should be tokenized first.
splits is a torchtext method that creates train, test, and validation sets.

using fastai we can create ModelData object from torchtext splits.

fine-tuning pretrained models gives the opportunity to use differential learning rates .

HBox(children=(IntProgress(value=0, description='Epoch', max=5), HTML(value='')))epoch      trn_loss   val_loss   accuracy                       
0 0.914599 1.026246 0.526914
1 0.892305 1.073445 0.526914
2 0.890372 1.029869 0.526914
3 0.895171 1.018384 0.526914
4 0.907669 1.034427 0.526914
HBox(children=(IntProgress(value=0, description='Epoch', max=25), HTML(value='')))epoch trn_loss val_loss accuracy
0 0.726858 0.807458 0.658791
1 0.691951 0.799014 0.674747
2 0.672657 0.861043 0.665392
3 0.666511 0.858039 0.677368
4 0.659266 0.843153 0.678958
5 0.687026 0.764765 0.669744
6 0.6328 0.80844 0.670203
7 0.643107 0.885304 0.672158
8 0.625191 0.980532 0.674329
9 0.627007 0.818966 0.679019
10 0.662649 0.773294 0.658341
11 0.651018 0.732899 0.693246
12 0.640039 0.774128 0.681997
13 0.646645 0.818812 0.685979
14 0.629662 0.818131 0.687009
15 0.627583 1.013715 0.684953
16 0.620577 0.789786 0.691546
17 0.621732 0.755267 0.689668
18 0.616339 0.812259 0.69037
19 0.628711 0.816256 0.685763
20 0.631954 0.869854 0.663502
21 0.648067 0.792397 0.689935
22 0.621395 0.811201 0.687528
23 0.627996 0.75971 0.692931
24 0.61478 0.766722 0.694669
[0.7667221994412634, 0.6946690829240649]

Accuracy :

0.7319195114773396
HBox(children=(IntProgress(value=0, description='Epoch', max=5), HTML(value='')))epoch      trn_loss   val_loss   accuracy                       
0 0.605273 0.795732 0.697832
1 0.605745 0.737048 0.701651
2 0.61152 0.726605 0.705618
3 0.618649 0.898131 0.680639
4 0.607915 0.835565 0.684423
HBox(children=(IntProgress(value=0, description='Epoch', max=25), HTML(value='')))epoch trn_loss val_loss accuracy
0 0.63425 0.712354 0.70744
1 0.625679 0.7335 0.696126
2 0.598707 0.762889 0.694836
3 0.618014 0.803139 0.694313
4 0.592371 0.827738 0.683457
5 0.593831 0.758763 0.684881
6 0.612648 0.803343 0.663893
7 0.613446 0.759131 0.694913
8 0.590957 0.866204 0.691297
9 0.59238 1.015298 0.688887
10 0.631985 0.876454 0.65384
11 0.610506 0.83315 0.689219
12 0.609516 0.753415 0.695011
13 0.612919 0.785118 0.691572
14 0.59885 0.770415 0.695097
15 0.59031 0.845301 0.677381
16 0.616533 0.770404 0.684773
17 0.60326 0.84417 0.691907
18 0.587279 0.800148 0.688964
19 0.611245 0.831045 0.689919
20 0.604469 0.737693 0.707982
21 0.594355 0.855759 0.676296
22 0.600001 0.890678 0.679592
23 0.578646 0.789512 0.691539
24 0.606092 0.871126 0.687559
[0.8711255853035156, 0.6875590290895354]
HBox(children=(IntProgress(value=0, description='Epoch', max=1), HTML(value='')))61%|██████    | 1335/2188 [00:40<00:25, 33.37it/s, loss=2.29]

--

--