Persian news Language model

Armin Behjati

Published in

AI Backyard

8 min readAug 23, 2018

this is a joint affair of Armin Behjati and Bahram Mohammadpour

News Crawler :

in this news agency web site there is an “archive” page in which all published news are listed.in this page you can filter news by category and publish date.
if you open the link bellow you can see the archive page:

http://www.entekhab.ir/fa/archive?service_id=18&sec_id=0&cat_id=0&rpp=100&from_date=1389/10/01&to_date=1397/01/30&p=1

there are 3 important keywords in the url :
service_id : specifies category of news
rpp : specifies number of news in each search page
p : page number

we use bs4 to parse web pages and urllib3 to send http requests.

here we define a function to fetch an archive web page given page number and category of news(service_id)

the function bellow parses the output of the previous function and returns a list of news link.this function takes page number and news category(service_id) as inputs

(100, {'service_id': 2, 'url': 'http://www.entekhab.ir/fa/news/389088'})

the next thing to do is to fetch data of each page and store it in mongoDB.
“get_data_of_page” does this job .

“get_all_pages_data” function takes page number and service_id as inputs and thread number as an optional input and returns data of each news link in the archive page with news category = service_id

the next thing is to run a for loop on the above function and store the output to mongoDB.
bellow is a sample output of each documnet in mongoDB:

{‘url’: ‘http://www.entekhab.ir/fa/news/389088',
‘content’:”<html>…</html>”,
‘service_id’:2}

we collect news in these four categories

| category | service_id |
| — — — — — | — — — — — — |
|politics|2|
|economics|5|
|havades|10|
|art and culture|18

we change service_id and run this loop again to collect news in different categories

the next thing is to parse html data and store news data and corresponding service_id
so we define the news_parse function that takes html string as input and returns the text of news.

we parse the data and grab text and service_id of each news.

the last step in data preparation is storing data.

News Language model

We are already familiar with the idea of using a pre trained network and adding some layers to the top and make it do something different in computer vision .
It’s the simple idea of a backbone plus a custom head that ables us almost do anything we can think about !
Here we try to apply the same idea to NLP .

Here we try to train a Language model on Persian news files and then use this model to train a classifier predicting the news categories .

Most of the code and models here were available thanks to fastai and Jeremy Howard .

We are going to use fastai.text here instead of torchtext which was very slow and confusing .

Setting the path of our data and a path to store the language model :

The news files from the crawler were stored in a Pandas dataframe named ‘news_out’.

as you can see there are some blanks here that we need get rid of .

(10171, 2)

In this part we try standardize the dataframe . each row has two columns for the news text and the news label.


['text', 'label']

split_vals is a function we use to make validation set

we use 9000 news files for the train and the rest for validation

(1171, 2)

We don’t need news labels for language model so we set’em all to zero .

we should save the training and validation data for further use.

'\xa0معاون بازاريابي فروش ايران خودرو با بيان اين كه اجراي روش هاي متنوع و جامع براي فروش محصولات يكي از مسيرهاي افزايش رضايتمندي مشتريان و فراهم كردن امكان خريد كالاي ايراني است تاكيد كرد: ايران خودرو در سال حمايت از كالاي ايراني برنامه هاي متنوعي را براي خريد محصولات به اجرا خواهد گذاشت. \xa0\xa0 \xa0\xa0به گزارش ايكوپرس، مصطفي خان كرمي با تشريح برنامه هاي فروش ايران خودرو در سال 97، اعمال تخفيف خريد كالاي ايراني را از جمله مشوق هاي خريد محصولات ايران خودرو دانست و اظهار كرد:\u200c با مطالعه و بررسي بازار و نيازسنجي از مشتريان برنامه هاي فروش را منطبق با خواست و سليقه آنان به اجرا مي گذاريم.وي افزود:\u200c طرح هاي فروش را متناسب با بودجه مشتريان در نظر مي گيريم و در اين راستا برنامه هاي فروش اقساطي و اعتباري را طرح ريزي مي كنيم تا همه گروه هاي درآمدي موفق به خريد محصولات ايراني ايران خودرو شوند.خان كرمي طرح هاي پيش فروش را از جمله طرح هاي فروش محصولات ايران خودرو نام برد و گفت: در پيش فروش هايی كه به صورت سرمايه گذاري پيشنهاد مي شود امكان مشاركت مشتريان در فرآيند توليد محصول ايجاد مي شود.وي درعين حال تصريح كرد: پرواضح است كه ايران خودرو نيز از مشاركت هاي مردمي در طرح هاي توليد بهره مند مي شود. با اين روش بازگشت سرمايه مشتريان تا موعد تحويل خودرو كاملا تضمين شده است.خان كرمي افزود:\u200c در طرح هاي پيش فروش عادي نيز امكان برنامه ريزي مالي بهتري را براي مشتريان فراهم مي كنيم.معاون بازاريابي و فروش ايران خودرو اضافه كرد: علاوه بر فروش هاي نقدي كه محصول با قيمت قطعي به دست مشتريان مي رسد، فروش اقساطي و اعتباري را نيز به اجرا مي گذاريم كه با اعمال شرايط منعطف، علاوه بر شناور بودن مبلغ پيش پرداخت، مشتريان قادر هستند زمان بازپرداخت اقساط را براساس شرايط مالي خود انتخاب كنند. | '

The next thing we need to do is to tokenize the text . unfortunately we couldn’t find any good persian tokenizer available so we used the spacy english tokenizer which worked fine for the task !
we pass ‘cunksize’ when we read the csv file .that means that pandas does not return a data frame, but it returns an iterator that we can iterate through chunks of a data frame.
That is why we don’t say tok_trn = get_text(df_trn) but instead we call get_all which loops through the data frame but actually what it’s really doing is it’s looping through chunks of the data frame so each of those chunks is basically a data frame representing a subset of the data .

Before we include the text, we have “beginning of stream” (BOS) token which we defined in the beginning.
So every text is going to start with ‘xbos’ because it’s often useful for the model to know when a new text is starting.

we tokenize it by doing a “process all multiprocessing” (proc_all_mp).

There is also a function called partition_by_cores which takes a list and splits it into sublists. The number of sublists is the number of cores that you have in your computer.

We make a list list of all the words that appear in some order.then we replace every word with its index into that list.the list of all the tokens, we call that the vocabulary.

Here we just save the tokens .

Python counter class gives us a list of unique items and their counts. Here are the 25 most common things in the vocabulary.

[('و', 219034),
 ('در', 166423),
 ('به', 140454),
 ('از', 104388),
 ('که', 95511),
 ('این', 84752),
 ('را', 72305),
 ('با', 69094),
 (':', 42844),
 ('است', 37288),
 ('برای', 29959),
 ('است.', 26304),
 ('می', 24125),
 ('آن', 22736),
 ('هم', 22181),
 ('یک', 19358),
 ('شده', 19201),
 ('ما', 17422),
 ('کرد', 17408),
 ('های', 17134),
 ('سال', 16955),
 ('بر', 16330),
 ('خود', 16176),
 ('گفت', 15925),
 ('کشور', 15450)]

We use most_common, pass in the max vocab size and that’ll sort it by the frequency and if it appears less often than a minimum frequency, then leave it. That gives us itos — that’s the same name that torchtext used and it means integer-to-string. This is just the list of unique tokens in the vocab. We’ll insert two more tokens — a vocab item for unknown (_unk_) and a vocab item for padding (_pad_).

https://gist.github.com/bahrammp/507004d8d310057835bc23503467a769

stoi is a dictionary going in the opposite direction as the itos

Now we call stoi for every word in sentences .

Now our vocab size is 45303 and our training language model has 9000 documents in it.

(45303, 9000)

Language model

As before we set our embedding size , number of hidden layers and number of layers .

We take all our documents and just concatenate them back to back.we are going to try to predict the next word after these words. once we have a model data object, we can grab the model from it, so that’s going to give us a learner.

We grab a learner from ModelData object .

we use the learning rate finder to find the best learning rate .

Now we train the the model for few epochs .

epoch      trn_loss   val_loss   accuracy                     
    0      4.148626   5.03116    0.226647  
    1      4.250771   5.047802   0.224713                     
    2      4.304603   5.040141   0.225326                     
    3      4.255172   5.036219   0.225684                     
    4      4.323801   5.016485   0.226707                     
    5      4.151685   5.030256   0.227471                     
    6      4.179784   5.024665   0.226944                     
    7      4.124536   5.018256   0.228459                     
    8      4.089547   5.018955   0.229405                     
    9      4.095399   5.018171   0.229185                     
    10     4.031234   5.021676   0.229044                     
    11     4.111572   5.00962    0.230145                     
    12     3.943506   5.016498   0.230516                     
    13     3.922315   5.016849   0.230894                     
    14     3.990867   5.018164   0.230943                     
    15     4.186847   5.023207   0.228214                     
    16     4.138712   5.062355   0.225501                     
    17     4.278202   5.038208   0.226603                     
    18     4.19066    5.041808   0.226961                     
    19     4.083084   5.050022   0.227127                     
    20     4.160091   5.035064   0.227734                     
    21     4.056128   5.048086   0.22812                      
    22     4.013799   5.045687   0.228869                     
    23     3.940494   5.049366   0.228921                     
    24     3.958642   5.04207    0.229198                     
    25     3.943375   5.044687   0.229329                     
    26     3.962328   5.038988   0.229918                     
    27     3.842061   5.044684   0.229866                     
    28     3.885955   5.042859   0.230633                     
    29     3.807452   5.039023   0.231055                     
    30     3.888656   5.08086    0.227317                     
    31     4.159625   5.071354   0.226402                     
    32     4.180851   5.059904   0.226543                     
    33     4.008345   5.076501   0.227153                     
    34     4.003859   5.071528   0.227382                     
    35     4.187835   5.037947   0.228434                     
    36     4.13018    5.037786   0.229056                     
    37     3.899381   5.069072   0.228455                     
    38     3.934846   5.065062   0.22942                      
    39     3.876984   5.068136   0.229285                     
    40     3.993476   5.054292   0.229019                     
    41     4.055646   5.042522   0.23002                      
    42     3.835831   5.06627    0.230388                     
    43     3.781686   5.0633     0.229969                     
    44     3.782634   5.061765   0.230488                     
 78%|███████▊  | 1700/2188 [04:06<01:10,  6.90it/s, loss=3.71]

Generate text

کارشناس ارشد حوزه انرژی و صنایع دستی و گردشگری و … در این باره گفت : این طرح در حال حاضر در حال انجام است و امیدواریم تا پایان سال جاری این پروژه به اتمام برسد.

Generated news seem fine !