Persian to English translation

this is a joint affair of Armin Behjati and Bahram Mohammadpour

Armin Behjati
AI Backyard
7 min readAug 23, 2018

--

Here we try to make a sequence-to-sequence model . we are going to make a Persian to English translation model . we will be using TEP ,English-Persian corpus which is extracted from movie subtitles and is sentence aligned . it has 4 million tokens on each side . Total number of bilingual sentences is 612086 .Average sentence length is 7.8 words.

We are using fastai library to make things faster and easier .
Also thanks to Hiromi Suenaga for explanations and notes .

we are just going to save them for further use .

we can easily split them apart into English and Persian sentences lists .

Here we tokenize the sentences . again we are using the spacy tokenizer and due to the lack of good persian tokenizer available we decided to try english tokenizer for persian too and it worked fine !
proc_all_mp is processing every sentence across multiple processes .

Here is example of a sentence after tokenization :

Now we have to save the tokens .

Here are the top 25 most frequent words :

We are going to limit the number of words to 50000 to keep things simple . We insert a few extra tokens for beginning of stream (_bos_), padding (_pad_), end of stream (_eos_), and unknown (_unk).
we will return 3 for the tokens we haven’t seen .

Now we have a list of IDs for both Persian and English and functions to turn them to words and reverse .

Word Vectors

We are going to use pre-trained word vectors here but I think that transfer learning and fine-tuning pre-trained models would work better which is something wee are going to experiment more on it !
Here we use fasttext word vectors which has the dimension of 300 .

We are going to convert them into standard python dictionary for each word .

ModelData

We are going to grab a maximum length for sentences .

Here we define our dataset which requiers two things : len and an indexer .

Making training and validation set using a list of random numbers .

Now we create our Datasets .

We have declare our padding index because it’s going to pad all sentences so that they have equal sizes . as before we are going to use the sampler to save some memory by putting together sentences with similar sizes .

The architecture has an encoder that’s going to take a sequence of tokens then it returns a final hidden state or just a vector which we feed to the decoder RNN .

We have to make an embedding with the size of vocabsize by 300(fasttext vectors size ) .
we multiply by 3 because we had the std of about 0.3 and the standard is 1 .

We are going to go through and put it through the embedding. We are going to stick it through the RNN, dropout, and a linear layer. We will then append the output to a list which will be stacked into a single tensor and get returned.
the input to the embedding is the previous word that we translated.
outp.data.max : It looks in its tensor to find out which word has the highest probability. max in PyTorch returns two things: the first thing is what is that max probability and the second is what is the index into the array of that max probability. So we want that second item which is the word index with the largest thing.
dec_inp : It contains the word index into the vocabulary of the word.
by adding bidirectional=True to our encoder now we have a bidirectional model .
we also feed in the actual correct word at the early traing so that it has less difficulty to learn. that’s what we call pr-force .
We also used Attention here . the way it was used was that we created a little nueral network inside and we trained it end to end .
w2h = self.l2(h[-1])
u = F.tanh(w1e + w2h)
a = F.softmax(u @ self.V, 0)
We are going to take the last layer’s hidden state and we are going to stick it into a linear layer. Then we stick it into a nonlinear activation and then we are going to do a matrix multiply. Now rather than just taking the last encoder output, we have the whole tensor of all of the encoder outputs which we just weight by this neural net we created .

If the generated sequence length is shorter than the sequence length of the target, we need to add some padding. PyTorch padding function requires a tuple of 6 to pad a rank 3 tensor (sequence length, batch size, by number of words in the vocab). Each pair represents padding before and after that dimension.
F.cross_entropy expects a rank 2 tensor, but we have sequence length by batch size, so we just flatten it out. That is what view(-1, …) does.

to_gpu will not put to in the GPU if you do not have one.
We could just call Learner to turn that into a learner, but if we call RNN_Learner, it does add in save_encoder and load_encoder that can be handy sometimes .

as usual we use the learning rate finder to find the best learning rate .

we train the model .

test

Let’s test our translation .
we can also visualize our attention list and see at each time step where is the attention .

--

--