Natural Language Processing — Neural Machine Translator

Published in

The Startup

11 min readJul 2, 2020

Natural Language Processing (NLP) is a combination of computer science, information engineering, and artificial intelligence (AI). While people use words to communicate, computers operate with the language of numbers. Yet these numbers can act as a bridge between the diverse languages in our world. Using NLP, we can create a translation system to lead us towards open and effective communication. The computers’ emerging ability to understand and analyze human language is Natural Language Processing.

In order for the computer to understand the words we use in our languages, we break up paragraphs and sentences into units of language. These units are converted into numbers then back into words, but those of a different language, thus completing the translation process. There are a number of concepts and algorithms involved in this process, which we will go through in this story:

Preprocessing: Tokenization and UNK Replacement
Word Embeddings: Vectors and Dimension Reduction
Sequence to Sequence: Encoding and Decoding

These topics cover the general idea of translation. However, in order to perform this process, we made use of a recent adaptation to machine translation: Neural Machine Translation. This approach is largely based on the neurons in our brain and how we learn as humans. It uses the following concepts:

Datasets: Training, Validation, Testing
Neural Networks: Recurrent Neural Networks
Long Short Term Memory and Attention
Parallel Corpus
Forward and Backward Propagation
Loss Function
Teacher Forcing

With a number of concepts to talk about, without further ado, let’s dive in!

PREPROCESSING

Preprocessing occurs at the start of machine translation, converting text in raw data into a form that the machine will understand. There are two major parts to this process: tokenization and UNK replacement. Preprocessing may even consist of converting the data into lowercase so that the machine treats words that are in uppercase as the same word.

Tokenization

Tokenization is the breaking down of the larger data in the form of paragraphs and sentences into “tokens” or units of language. These units are often individual words and punctuations. Two extra tokens that we introduce mark the Start of a Sentence (SOS) and End of a Sentence (EOS).

UNK Replacement

When training the machine with datasets, some words do not appear enough times for the program to learn their meanings. Thus, when it does not recognize or understand a word, it replaces it with UNK as in unknown, then continues with the translation. These gaps in the sentences may be filled in later by post-processing instead.

WORD EMBEDDINGS

After preprocessing, machine translation uses word embeddings. These are vector representations of texts. Basically, each sentence is assigned a series of numbers, also known as a vector, with an arbitrary number of dimensions. These dimensions help describe the words of a sentence.

For example, imagine that you have just witnessed a robbery and the police ask you to describe the robber. You may recall the culprit’s appearance: height, body type, hair color, skin color, etc.

The more descriptive you are, the better understanding the police will have of who they are looking for. However, there is a limit to how descriptive you would be. You should not try to describe the length of every piece of hair on the robber’s head. Such details only become irrelevant. This is the same with dimensions. A word’s meaning becomes more clear with more dimensions, but too much information becomes unnecessary.

In fact, at times we may want to graph words and their meanings onto a graph. This allows us to visualize the vectors and words with similar or dissimilar meanings; words with more similar meanings are plotted closer together. An example might be “man” and “boy”; they will likely have similar numerical values, allowing the computer to understand that they are related. If we had more than two dimensions in these vectors, we would have to go through Dimension Reduction to narrow it down to two dimensions: an x value and a y value.

Vectors are so important when it comes to word embeddings because the values allow the language to change while their meanings remain the same. The translator will focus on those numerical values rather than string values or definitions. Vectors become more interesting when we analyze them to find relationships between words. We can even use mathematical operations on them including addition and subtraction. This allows us to identify similarities and differences among words and sentences.

For example in the following equation: King — Man + Woman = Queen, we understand that the meaning of a queen resembles a King and a Woman, more so than a Man.

SEQUENCE TO SEQUENCE

The system that uses vectors to translate a sentence or sequence of words is called sequence to sequence.

It is made up of an encoder and decoder. Encoding allows the program to convert the original sentence into a vector for comprehension. This is much like describing the culprit’s appearance to a sketch artist. Then, decoding turns the values in the vector back into human language, but this time into the new language we wanted to translate them into.

Neural Machine Translation

Next, we will go over Neural Machine Translation (NMT) and its tactics. As mentioned previously, it utilizes the concept of Neural Networks to translate sentences by taking in large amounts of trained data.

Datasets

In order for the AI to learn the languages, we provide it with different datasets.

The data it first uses to learn is the training dataset. This is the majority of the data we have that is given to the program. Next, we use validation data to provide an unbiased evaluation of the model. This second set checks if the model is working correctly. After tuning the model with validation data, testing data is the final evaluation given to the program.

Neural Networks

We know that NMT relies on Neural Networks (NN), but what are NNs? These are connections to nodes that are inspired by the neurons in our brains. The input that goes into the network is processed and analyzed by the network’s hidden layers to extract certain features. What is extracted is then used to obtain an output.

Recurrent Neural Networks

More specifically, we made use of a type of NN called Recurrent Neural Networks (RNN). This is often used in text recognition. The network learns in a sequential manner, one word at a time. It remembers its previous learnings with a stored memory and uses them to process new inputs.

Long Short Term Memory

Narrowing it down even further, Long Short Term Memory (LSTM) is a type of RNN. Based on context gates, the AI knows what to remember, what to forget, and when to do either of these actions. For example, we might give the program a short story to read. Instead of trying to store every little detail, it will remember selected parts of the story. There are two main advantages of this system. First, it draws out only the most important information since the smallest details are often irrelevant when we try to learn something. Second, only focusing on the big picture limits the mistakes that the program might make when trying to retain too much information.

Attention

So how does the program know what to remember and what to forget? We help the NN pay attention to the most important features by assigning only certain values to vectors. This ties into the number of dimensions we would include in the vector. Moreover, we can assign “weights” to certain words based on their importance in a sentence. The following sentence shows one way in which words of a sentence might be weighted”

Weights allow the network to know what is or isn’t important, and it will understand the sentences better. If it then wanted to process the next sentence that said: “I love eating them”. It will try and figure out what “them” is referring to. Since cheese and sandwiches are weighted more, the program is more likely to understand that those are the antecedents for “them”. Attention is especially important in longer sentences since they are more susceptible to having the values be forgotten. Thus we want to make sure the most important features are remembered.

Parallel Corpus

There are a few other tactics we can use to help the program learn. Parallel Corpus is one of the simple ways.

We feed the machine a series of sentences or phrases in one language and their perfect translation in another language. The machine can then use this data as a reference.

Forward and Backward Propagation

Forward Propagation and Backward Propagation are more processes involved in a machine’s learning. We can imagine this learning process as one that is very similar to ours. Let’s take a look at the sketch of a chair.

What if we tried to draw this picture ourselves? We would look at the picture and try to remember for ourselves how a chair looks. For a machine, this is called Forward Propagation: we are putting newly acquired data into our memory. Next, we may try to draw the chair without seeing it. Our product will be an approximation of what the chair looks like, much like testing the data that we obtained. In order to improve our drawing, we will pull out the picture of the chair again and compare it with our drawing. Knowing what we are missing or could do better, our future drawings will most likely be more accurate. This is backwards propagation. The program is able to compare the model it currently has and its target side by side and determine the amount of “loss”. Its goal is then to limit the amount of loss.

Loss Function

To measure the amount of loss, we use what is called the Loss Function. To illustrate this, we can use the following sample sentence “Father has a dog”. The translator may return “A father has a wolf”. We see that ⅗ words actually appear in the original sentence. Thus we have a loss of two.

As seen on the left, PPL is a unit of measuring loss. Obviously, the less loss we have, the better translation we have. We see that the PPL value decreases as the program trains itself overtime. This seems simple enough. However, our greatest problem may be minimizing this loss but without overcomplicating the system. Sometimes we make the mistake of overfitting, meaning that we used too many dimensions in attempt to make the data match perfectly. This may cause us to exclude other data simply because there was not a 100% match.The program is best trained with the right combination that we try to obtain in our process.

Teacher Forcing

Another method of helping our program is called Teacher Forcing. One of our main problems is that words of a sentence rely on the context established in the previous words in the sentence. Therefore, one error can propagate throughout the network. The teacher forcing algorithm solves this problem by having a translated “answer” to help feed in a reference word. We ensure in this way that the machine is learning with the proper context. For example, our sentence might be “the cat ate the lasagna”. Our program may translate like the following:

The
The car
The car ???

The word “ate” doesn’t seems to make much sense after “car”, which confuses our program. This is precisely where teacher forcing can come into play. We simply feed in the word “cat” as the answer, allowing our program to continue its translation. The program is now able to continue its training while it finds other ways to see the difference between “car” and “cat”.

Summary

Our goal is to create a functional translator in order to bridge language barriers. Natural Language Processing allows us to do this. This process involves concepts like preprocessing, word embeddings, and sequence to sequence encoding and decoding. To improve the learning procedures for our program, we used an approach called Neural Machine Translation based on neural networking. Thus we could implement strategies that include long short term memory, attention, parallel corpus, forward and backward propagation, and teacher forcing. These concepts help us reduce the number of loss that can be demonstrated through the loss function.

Final Product

The strategies covered in this story were only the basics to a high level, multilayer translator.

How our translator converts the English sentence on the left into German

By importing datasets with both words and vectors, the code could make much use of these sets. It is able to translate multiple sentences with zero loss. We were able to create a semi-functional language translator that has high potential and functions well for shorter, simple sentences. It learns from its mistakes and corrects texts that it has completed. We can say that this was good progress as we had only trained the program for a day to obtain these results.

Future

A Google Translate level translator takes much more time to create. Moreover, we would try to import even more datasets for the AI to learn. In our case, the data used were mostly from news articles, limiting the amount of conversational vocabulary that the machine would understand. To improve this machine, we can check our data for fundamental errors. Furthermore, our translator focuses less on common words when deciphering meaning, again limiting its understanding of everyday language. In a short amount of time, we were introduced to a variety of methods and saw the potential of such a machine.

Translating programs create many opportunities for communication, used with tourists and many businesses. With improved programs, the spreading of knowledge and information around the globe may no longer need to be as expensive or time consuming. Needless to say, Natural Language Processing is an expansive system that has the ability to contribute much to our modern society.

ABOUT

Program: Invent the Future AI Scholars Program hosted by Simon Fraser University and supported by AI4ALL

Date: July 15th — July 26th, 2019

Team Members: Linda Bian, Olivia Chan, Skylar Hildebrand, Ines Khouider, Elizabeth Wong, Sophie Zhao

Mentors: Carolyn Chen (TA), Nishant Kambhatla, Pooya Moradi