Persian news Language model

Armin Behjati
AI Backyard
Published in
8 min readAug 23, 2018

this is a joint affair of Armin Behjati and Bahram Mohammadpour

News Crawler :

in this news agency web site there is an “archive” page in which all published news are listed.in this page you can filter news by category and publish date.
if you open the link bellow you can see the archive page:

http://www.entekhab.ir/fa/archive?service_id=18&sec_id=0&cat_id=0&rpp=100&from_date=1389/10/01&to_date=1397/01/30&p=1

there are 3 important keywords in the url :
service_id : specifies category of news
rpp : specifies number of news in each search page
p : page number

we use bs4 to parse web pages and urllib3 to send http requests.

here we define a function to fetch an archive web page given page number and category of news(service_id)

the function bellow parses the output of the previous function and returns a list of news link.this function takes page number and news category(service_id) as inputs

(100, {'service_id': 2, 'url': 'http://www.entekhab.ir/fa/news/389088'})

the next thing to do is to fetch data of each page and store it in mongoDB.
“get_data_of_page” does this job .

“get_all_pages_data” function takes page number and service_id as inputs and thread number as an optional input and returns data of each news link in the archive page with news category = service_id

the next thing is to run a for loop on the above function and store the output to mongoDB.
bellow is a sample output of each documnet in mongoDB:

{‘url’: ‘http://www.entekhab.ir/fa/news/389088',
‘content’:”<html>…</html>”,
‘service_id’:2}


we collect news in these four categories

| category | service_id |
| — — — — — | — — — — — — |
|politics|2|
|economics|5|
|havades|10|
|art and culture|18

we change service_id and run this loop again to collect news in different categories

the next thing is to parse html data and store news data and corresponding service_id
so we define the news_parse function that takes html string as input and returns the text of news.

we parse the data and grab text and service_id of each news.

1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
19000
20000
21000
22000
23000
24000
25000
26000
27000
28000

the last step in data preparation is storing data.

News Language model

We are already familiar with the idea of using a pre trained network and adding some layers to the top and make it do something different in computer vision .
It’s the simple idea of a backbone plus a custom head that ables us almost do anything we can think about !
Here we try to apply the same idea to NLP .

Here we try to train a Language model on Persian news files and then use this model to train a classifier predicting the news categories .

Most of the code and models here were available thanks to fastai and Jeremy Howard .

We are going to use fastai.text here instead of torchtext which was very slow and confusing .

Setting the path of our data and a path to store the language model :

The news files from the crawler were stored in a Pandas dataframe named ‘news_out’.

as you can see there are some blanks here that we need get rid of .

(10171, 2)

In this part we try standardize the dataframe . each row has two columns for the news text and the news label.


['text', 'label']

split_vals is a function we use to make validation set

we use 9000 news files for the train and the rest for validation

(1171, 2)

We don’t need news labels for language model so we set’em all to zero .

we should save the training and validation data for further use.

'\xa0معاون بازاريابي فروش ايران خودرو با بيان اين كه اجراي روش هاي متنوع و جامع براي فروش محصولات يكي از مسيرهاي افزايش رضايتمندي مشتريان و فراهم كردن امكان خريد كالاي ايراني است تاكيد كرد: ايران خودرو در سال حمايت از كالاي ايراني برنامه هاي متنوعي را براي خريد محصولات به اجرا خواهد گذاشت. \xa0\xa0 \xa0\xa0به گزارش ايكوپرس، مصطفي خان كرمي با تشريح برنامه هاي فروش ايران خودرو در سال 97، اعمال تخفيف خريد كالاي ايراني را از جمله مشوق هاي خريد محصولات ايران خودرو دانست و اظهار كرد:\u200c با مطالعه و بررسي بازار و نيازسنجي از مشتريان برنامه هاي فروش را منطبق با خواست و سليقه آنان به اجرا مي گذاريم.وي افزود:\u200c طرح هاي فروش را متناسب با بودجه مشتريان در نظر مي گيريم و در اين راستا برنامه هاي فروش اقساطي و اعتباري را طرح ريزي مي كنيم تا همه گروه هاي درآمدي موفق به خريد محصولات ايراني ايران خودرو شوند.خان كرمي طرح هاي پيش فروش را از جمله طرح هاي فروش محصولات ايران خودرو نام برد و گفت: در پيش فروش هايی كه به صورت سرمايه گذاري پيشنهاد مي شود امكان مشاركت مشتريان در فرآيند توليد محصول ايجاد مي شود.وي درعين حال تصريح كرد: پرواضح است كه ايران خودرو نيز از مشاركت هاي مردمي در طرح هاي توليد بهره مند مي شود. با اين روش بازگشت سرمايه مشتريان تا موعد تحويل خودرو كاملا تضمين شده است.خان كرمي افزود:\u200c در طرح هاي پيش فروش عادي نيز امكان برنامه ريزي مالي بهتري را براي مشتريان فراهم مي كنيم.معاون بازاريابي و فروش ايران خودرو اضافه كرد: علاوه بر فروش هاي نقدي كه محصول با قيمت قطعي به دست مشتريان مي رسد، فروش اقساطي و اعتباري را نيز به اجرا مي گذاريم كه با اعمال شرايط منعطف، علاوه بر شناور بودن مبلغ پيش پرداخت، مشتريان قادر هستند زمان بازپرداخت اقساط را براساس شرايط مالي خود انتخاب كنند. | '

The next thing we need to do is to tokenize the text . unfortunately we couldn’t find any good persian tokenizer available so we used the spacy english tokenizer which worked fine for the task !
we pass ‘cunksize’ when we read the csv file .that means that pandas does not return a data frame, but it returns an iterator that we can iterate through chunks of a data frame.
That is why we don’t say tok_trn = get_text(df_trn) but instead we call get_all which loops through the data frame but actually what it’s really doing is it’s looping through chunks of the data frame so each of those chunks is basically a data frame representing a subset of the data .

Before we include the text, we have “beginning of stream” (BOS) token which we defined in the beginning.
So every text is going to start with ‘xbos’ because it’s often useful for the model to know when a new text is starting.

we tokenize it by doing a “process all multiprocessing” (proc_all_mp).

There is also a function called partition_by_cores which takes a list and splits it into sublists. The number of sublists is the number of cores that you have in your computer.

We make a list list of all the words that appear in some order.then we replace every word with its index into that list.the list of all the tokens, we call that the vocabulary.

0
1
2
3
4
5
6
7
8
0
1

Here we just save the tokens .

Python counter class gives us a list of unique items and their counts. Here are the 25 most common things in the vocabulary.

[('و', 219034),
('در', 166423),
('به', 140454),
('از', 104388),
('که', 95511),
('این', 84752),
('را', 72305),
('با', 69094),
(':', 42844),
('است', 37288),
('برای', 29959),
('است.', 26304),
('می', 24125),
('آن', 22736),
('هم', 22181),
('یک', 19358),
('شده', 19201),
('ما', 17422),
('کرد', 17408),
('های', 17134),
('سال', 16955),
('بر', 16330),
('خود', 16176),
('گفت', 15925),
('کشور', 15450)]

We use most_common, pass in the max vocab size and that’ll sort it by the frequency and if it appears less often than a minimum frequency, then leave it. That gives us itos — that’s the same name that torchtext used and it means integer-to-string. This is just the list of unique tokens in the vocab. We’ll insert two more tokens — a vocab item for unknown (_unk_) and a vocab item for padding (_pad_).

https://gist.github.com/bahrammp/507004d8d310057835bc23503467a769

stoi is a dictionary going in the opposite direction as the itos

45303

Now we call stoi for every word in sentences .

Now our vocab size is 45303 and our training language model has 9000 documents in it.

(45303, 9000)

Language model

As before we set our embedding size , number of hidden layers and number of layers .

We take all our documents and just concatenate them back to back.we are going to try to predict the next word after these words. once we have a model data object, we can grab the model from it, so that’s going to give us a learner.

We grab a learner from ModelData object .

we use the learning rate finder to find the best learning rate .

Now we train the the model for few epochs .

epoch      trn_loss   val_loss   accuracy                     
0 4.148626 5.03116 0.226647
1 4.250771 5.047802 0.224713
2 4.304603 5.040141 0.225326
3 4.255172 5.036219 0.225684
4 4.323801 5.016485 0.226707
5 4.151685 5.030256 0.227471
6 4.179784 5.024665 0.226944
7 4.124536 5.018256 0.228459
8 4.089547 5.018955 0.229405
9 4.095399 5.018171 0.229185
10 4.031234 5.021676 0.229044
11 4.111572 5.00962 0.230145
12 3.943506 5.016498 0.230516
13 3.922315 5.016849 0.230894
14 3.990867 5.018164 0.230943
15 4.186847 5.023207 0.228214
16 4.138712 5.062355 0.225501
17 4.278202 5.038208 0.226603
18 4.19066 5.041808 0.226961
19 4.083084 5.050022 0.227127
20 4.160091 5.035064 0.227734
21 4.056128 5.048086 0.22812
22 4.013799 5.045687 0.228869
23 3.940494 5.049366 0.228921
24 3.958642 5.04207 0.229198
25 3.943375 5.044687 0.229329
26 3.962328 5.038988 0.229918
27 3.842061 5.044684 0.229866
28 3.885955 5.042859 0.230633
29 3.807452 5.039023 0.231055
30 3.888656 5.08086 0.227317
31 4.159625 5.071354 0.226402
32 4.180851 5.059904 0.226543
33 4.008345 5.076501 0.227153
34 4.003859 5.071528 0.227382
35 4.187835 5.037947 0.228434
36 4.13018 5.037786 0.229056
37 3.899381 5.069072 0.228455
38 3.934846 5.065062 0.22942
39 3.876984 5.068136 0.229285
40 3.993476 5.054292 0.229019
41 4.055646 5.042522 0.23002
42 3.835831 5.06627 0.230388
43 3.781686 5.0633 0.229969
44 3.782634 5.061765 0.230488
78%|███████▊ | 1700/2188 [04:06<01:10, 6.90it/s, loss=3.71]

Generate text

کارشناس ارشد حوزه انرژی و صنایع دستی و گردشگری و … در این باره گفت : این طرح در حال حاضر در حال انجام است و امیدواریم تا پایان سال جاری این پروژه به اتمام برسد.

Generated news seem fine !

--

--