Persian to English translation

this is a joint affair of Armin Behjati and Bahram Mohammadpour

Published in

AI Backyard

7 min readAug 23, 2018

Here we try to make a sequence-to-sequence model . we are going to make a Persian to English translation model . we will be using TEP ,English-Persian corpus which is extracted from movie subtitles and is sentence aligned . it has 4 million tokens on each side . Total number of bilingual sentences is 612086 .Average sentence length is 7.8 words.

We are using fastai library to make things faster and easier .
Also thanks to Hiromi Suenaga for explanations and notes .

([('raspy breathing .\n', '\ufeffصداي خر خر .\n'),
  ('dad .\n', 'پدر .\n'),
  ('maybe its the wind .\n', 'شايد صداي باد باشه .\n'),
  ('no .\n', 'نه .\n'),
  ('stop please stop .\n', 'دست نگه داريد خواهش ميکنم دست نگه داريد .\n')],
 612086)

we are just going to save them for further use .

we can easily split them apart into English and Persian sentences lists .

Here we tokenize the sentences . again we are using the spacy tokenizer and due to the lack of good persian tokenizer available we decided to try english tokenizer for persian too and it worked fine !
proc_all_mp is processing every sentence across multiple processes .

Here is example of a sentence after tokenization :

(['you',
  'have',
  'a',
  'week',
  ',',
  'evans',
  'then',
  'well',
  'burn',
  'the',
  'house',
  '.',
  '\n'],
 ['اوانز',
  'تو',
  'فقط',
  'يک',
  'هفته',
  'وقت',
  'داري',
  'وگرنه',
  'خونتو',
  'خواهيم',
  'سوزوند',
  '.',
  '\n'])

Now we have to save the tokens .

Here are the top 25 most frequent words :

[('\n', 612051),
 ('.', 474293),
 ('،', 103134),
 ('من', 90505),
 ('به', 76135),
 ('را', 66759),
 ('تو', 60411),
 ('و', 60239),
 ('که', 59504),
 ('از', 55806),
 ('اين', 47409),
 ('اون', 44937),
 ('يک', 33618),
 ('در', 32070),
 ('با', 29032),
 ('ما', 28843),
 ('رو', 26067),
 ('كه', 24502),
 ('شما', 21629),
 ('هم', 19892),
 ('بايد', 18554),
 ('براي', 18230),
 ('نه', 18117),
 ('مي', 16622),
 ('بود', 16085)]

We are going to limit the number of words to 50000 to keep things simple . We insert a few extra tokens for beginning of stream (_bos_), padding (_pad_), end of stream (_eos_), and unknown (_unk).
we will return 3 for the tokens we haven’t seen .

Now we have a list of IDs for both Persian and English and functions to turn them to words and reverse .

(['پدر', '.', '\n', '_eos_'], 50004, 50004)

Word Vectors

We are going to use pre-trained word vectors here but I think that transfer learning and fine-tuning pre-trained models would work better which is something wee are going to experiment more on it !
Here we use fasttext word vectors which has the dimension of 300 .

We are going to convert them into standard python dictionary for each word .

(300, 300)

(0.0075652334, 0.29283327)

ModelData

We are going to grab a maximum length for sentences .

(21, 18)

Here we define our dataset which requiers two things : len and an indexer .

Making training and validation set using a list of random numbers .

(550799, 61252)

Now we create our Datasets .

We have declare our padding index because it’s going to pad all sentences so that they have equal sizes . as before we are going to use the sampler to save some memory by putting together sentences with similar sizes .

[(18, 21), (13, 6), (18, 15), (11, 4), (10, 4)]

The architecture has an encoder that’s going to take a sequence of tokens then it returns a final hidden state or just a vector which we feed to the decoder RNN .

We have to make an embedding with the size of vocabsize by 300(fasttext vectors size ) .
we multiply by 3 because we had the std of about 0.3 and the standard is 1 .

We are going to go through and put it through the embedding. We are going to stick it through the RNN, dropout, and a linear layer. We will then append the output to a list which will be stacked into a single tensor and get returned.
the input to the embedding is the previous word that we translated.
outp.data.max : It looks in its tensor to find out which word has the highest probability. max in PyTorch returns two things: the first thing is what is that max probability and the second is what is the index into the array of that max probability. So we want that second item which is the word index with the largest thing.
dec_inp : It contains the word index into the vocabulary of the word.
by adding bidirectional=True to our encoder now we have a bidirectional model .
we also feed in the actual correct word at the early traing so that it has less difficulty to learn. that’s what we call pr-force .
We also used Attention here . the way it was used was that we created a little nueral network inside and we trained it end to end .
w2h = self.l2(h[-1])
u = F.tanh(w1e + w2h)
a = F.softmax(u @ self.V, 0)
We are going to take the last layer’s hidden state and we are going to stick it into a linear layer. Then we stick it into a nonlinear activation and then we are going to do a matrix multiply. Now rather than just taking the last encoder output, we have the whole tensor of all of the encoder outputs which we just weight by this neural net we created .

If the generated sequence length is shorter than the sequence length of the target, we need to add some padding. PyTorch padding function requires a tuple of 6 to pad a rank 3 tensor (sequence length, batch size, by number of words in the vocab). Each pair represents padding before and after that dimension.
F.cross_entropy expects a rank 2 tensor, but we have sequence length by batch size, so we just flatten it out. That is what view(-1, …) does.

to_gpu will not put to in the GPU if you do not have one.
We could just call Learner to turn that into a learner, but if we call RNN_Learner, it does add in save_encoder and load_encoder that can be handy sometimes .

29332 ['...', 'کني', 'ميخوام', 'ميکني', 'ميدوني']
6363 ["'s", "n't", '...', "'m", "'re"]

as usual we use the learning rate finder to find the best learning rate .

66%|██████▌   | 2837/4304 [07:47<04:01,  6.07it/s, loss=19.4]

we train the model .

epoch      trn_loss   val_loss                                
    0      1.788608   6.152387  
    1      2.027214   4.972059                                
    2      1.997908   4.378297                                
    3      2.163803   4.001754                                
    4      2.255261   3.823183                                
    5      2.33996    3.825839                                
    6      2.470269   3.503114                                
    7      2.553205   3.394398                                
    8      2.667889   3.23554                                 
    9      2.653444   3.195496                                
    10     2.792435   3.154075                                
    11     2.63283    3.183128                                
    12     2.501625   3.198065                                
    13     2.53512    3.153566                                
    14     2.659681   3.171559[array([3.17156])]

test

Let’s test our translation .
we can also visualize our attention list and see at each time step where is the attention .

من آخرين آزمايش گذشتت بودم . اگر داري به . اين گوش ميدي ، تو شکست خورده ي
i was your final test of forgiveness . and if you re listening to this , then you ve failed .
i was the last test . if you re listening to this , you you failed . 
 _eos_
نه ، راستش . حالم زياد خوب نيست . 
 _eos_
no , i � m not . you know , i do nt i do nt feel very well . 

no , i m not . i m not so good . 
 _eos_
اون كه پرواز آخره . دوست دارم با من بياي تئاتر . 
 _eos_
it � s the late flight . i really want you to come to the play . please . 
 _eos_
he s the last one . i d like to come in to me . 
 _eos_
، پس با خودم مي‌گفتم اگر تو با من به تئاتر بياي واقعا كمكم ميكني . 
 _eos_
so i was thinking that if you came to the play , i think that would really help . 
 _eos_
so you can help me if you come with me to the theater . 
 _eos_
ميخواستي من را بكشي شريك . الانه كه وضعيت 211 بشه . 
 _eos_
do you want me murder , partner are you crazy , there has been a busy 211 . . 
 _eos_
you re going to kill me . . 
 _eos_
احمق ، گوش كن ، من فقط اونو برميدارم من ميدونم تو داري نگاه كني . 
 _eos_
silly of me . listen , i just picked this up . i know you like to look . 
 _eos_
idiot , listen , i just take it . i know you re watching . 
 _eos_
به نظر شبيه يک معماست ، چقدر احمقانه است ، چيزهاي قديمي مثل اون به درد چي ميخورن
must have seemed like a puzzle , what a silly , old thing like that was any good for . 

it looks like a a . , its old , what what it like to do with him . 
 _eos_
و اگر تو اگر تو باور نداری ، جک ، اگر نمی تونی باور کنی ، ‏ .
_unk you if you do nt believe that , jack , if you ca nt believe that , . 
 _eos_
and if you do nt believe , , if you ca nt believe , , you ca nt believe , ,
لازم نیست که وقتی می تونه این بالا پیش تو باشه اون پایین بمونه . ‏ . 

there s no need for him to be down there with me when he can be up here with you .
no need to be up with you when you can be up there . 
 _eos_
اما به همه ی اون پناهنده های بیچاره ای فکر کن که اگر من کمکشون نمی کردم اینجا
but think of all those poor refugees who must rot in this place if i did nt help them . 

but you 's all the poor refugees who if i did n't help them out . 
 _eos_

Persian to English translation

this is a joint affair of Armin Behjati and Bahram Mohammadpour

Word Vectors

ModelData

test

Written by Armin Behjati