Natural Language Processing — Neural Machine Translator

Sophie Zhao
Jul 2, 2020 · 11 min read
Image for post
Image for post

Natural Language Processing (NLP) is a combination of computer science, information engineering, and artificial intelligence (AI). While people use words to communicate, computers operate with the language of numbers. Yet these numbers can act as a bridge between the diverse languages in our world. Using NLP, we can create a translation system to lead us towards open and effective communication. The computers’ emerging ability to understand and analyze human language is Natural Language Processing.

In order for the computer to understand the words we use in our languages, we break up paragraphs and sentences into units of language. These units are converted into numbers then back into words, but those of a different language, thus completing the translation process. There are a number of concepts and algorithms involved in this process, which we will go through in this story:

  • Preprocessing: Tokenization and UNK Replacement
  • Word Embeddings: Vectors and Dimension Reduction
  • Sequence to Sequence: Encoding and Decoding

These topics cover the general idea of translation. However, in order to perform this process, we made use of a recent adaptation to machine translation: Neural Machine Translation. This approach is largely based on the neurons in our brain and how we learn as humans. It uses the following concepts:

Image for post
Image for post
  • Datasets: Training, Validation, Testing
  • Neural Networks: Recurrent Neural Networks
  • Long Short Term Memory and Attention
  • Parallel Corpus
  • Forward and Backward Propagation
  • Loss Function
  • Teacher Forcing

With a number of concepts to talk about, without further ado, let’s dive in!


Preprocessing occurs at the start of machine translation, converting text in raw data into a form that the machine will understand. There are two major parts to this process: tokenization and UNK replacement. Preprocessing may even consist of converting the data into lowercase so that the machine treats words that are in uppercase as the same word.

Image for post
Image for post


Tokenization is the breaking down of the larger data in the form of paragraphs and sentences into “tokens” or units of language. These units are often individual words and punctuations. Two extra tokens that we introduce mark the Start of a Sentence (SOS) and End of a Sentence (EOS).

UNK Replacement

When training the machine with datasets, some words do not appear enough times for the program to learn their meanings. Thus, when it does not recognize or understand a word, it replaces it with UNK as in unknown, then continues with the translation. These gaps in the sentences may be filled in later by post-processing instead.


After preprocessing, machine translation uses word embeddings. These are vector representations of texts. Basically, each sentence is assigned a series of numbers, also known as a vector, with an arbitrary number of dimensions. These dimensions help describe the words of a sentence.

Image for post
Image for post

For example, imagine that you have just witnessed a robbery and the police ask you to describe the robber. You may recall the culprit’s appearance: height, body type, hair color, skin color, etc.

The more descriptive you are, the better understanding the police will have of who they are looking for. However, there is a limit to how descriptive you would be. You should not try to describe the length of every piece of hair on the robber’s head. Such details only become irrelevant. This is the same with dimensions. A word’s meaning becomes more clear with more dimensions, but too much information becomes unnecessary.

In fact, at times we may want to graph words and their meanings onto a graph. This allows us to visualize the vectors and words with similar or dissimilar meanings; words with more similar meanings are plotted closer together. An example might be “man” and “boy”; they will likely have similar numerical values, allowing the computer to understand that they are related. If we had more than two dimensions in these vectors, we would have to go through Dimension Reduction to narrow it down to two dimensions: an x value and a y value.

Image for post
Image for post

Vectors are so important when it comes to word embeddings because the values allow the language to change while their meanings remain the same. The translator will focus on those numerical values rather than string values or definitions. Vectors become more interesting when we analyze them to find relationships between words. We can even use mathematical operations on them including addition and subtraction. This allows us to identify similarities and differences among words and sentences.

Image for post
Image for post

For example in the following equation: King — Man + Woman = Queen, we understand that the meaning of a queen resembles a King and a Woman, more so than a Man.


The system that uses vectors to translate a sentence or sequence of words is called sequence to sequence.

Image for post

It is made up of an encoder and decoder. Encoding allows the program to convert the original sentence into a vector for comprehension. This is much like describing the culprit’s appearance to a sketch artist. Then, decoding turns the values in the vector back into human language, but this time into the new language we wanted to translate them into.

Neural Machine Translation

Image for post
Image for post

Next, we will go over Neural Machine Translation (NMT) and its tactics. As mentioned previously, it utilizes the concept of Neural Networks to translate sentences by taking in large amounts of trained data.


In order for the AI to learn the languages, we provide it with different datasets.

Image for post
Image for post

The data it first uses to learn is the training dataset. This is the majority of the data we have that is given to the program. Next, we use validation data to provide an unbiased evaluation of the model. This second set checks if the model is working correctly. After tuning the model with validation data, testing data is the final evaluation given to the program.

Neural Networks

We know that NMT relies on Neural Networks (NN), but what are NNs? These are connections to nodes that are inspired by the neurons in our brains. The input that goes into the network is processed and analyzed by the network’s hidden layers to extract certain features. What is extracted is then used to obtain an output.

Image for post
Image for post

Recurrent Neural Networks

More specifically, we made use of a type of NN called Recurrent Neural Networks (RNN). This is often used in text recognition. The network learns in a sequential manner, one word at a time. It remembers its previous learnings with a stored memory and uses them to process new inputs.

Long Short Term Memory

Narrowing it down even further, Long Short Term Memory (LSTM) is a type of RNN. Based on context gates, the AI knows what to remember, what to forget, and when to do either of these actions. For example, we might give the program a short story to read. Instead of trying to store every little detail, it will remember selected parts of the story. There are two main advantages of this system. First, it draws out only the most important information since the smallest details are often irrelevant when we try to learn something. Second, only focusing on the big picture limits the mistakes that the program might make when trying to retain too much information.


So how does the program know what to remember and what to forget? We help the NN pay attention to the most important features by assigning only certain values to vectors. This ties into the number of dimensions we would include in the vector. Moreover, we can assign “weights” to certain words based on their importance in a sentence. The following sentence shows one way in which words of a sentence might be weighted”

Image for post
Image for post

Weights allow the network to know what is or isn’t important, and it will understand the sentences better. If it then wanted to process the next sentence that said: “I love eating them”. It will try and figure out what “them” is referring to. Since cheese and sandwiches are weighted more, the program is more likely to understand that those are the antecedents for “them”. Attention is especially important in longer sentences since they are more susceptible to having the values be forgotten. Thus we want to make sure the most important features are remembered.

Parallel Corpus

There are a few other tactics we can use to help the program learn. Parallel Corpus is one of the simple ways.

Image for post
Image for post

We feed the machine a series of sentences or phrases in one language and their perfect translation in another language. The machine can then use this data as a reference.

Forward and Backward Propagation

Forward Propagation and Backward Propagation are more processes involved in a machine’s learning. We can imagine this learning process as one that is very similar to ours. Let’s take a look at the sketch of a chair.

Image for post
Image for post

What if we tried to draw this picture ourselves? We would look at the picture and try to remember for ourselves how a chair looks. For a machine, this is called Forward Propagation: we are putting newly acquired data into our memory. Next, we may try to draw the chair without seeing it. Our product will be an approximation of what the chair looks like, much like testing the data that we obtained. In order to improve our drawing, we will pull out the picture of the chair again and compare it with our drawing. Knowing what we are missing or could do better, our future drawings will most likely be more accurate. This is backwards propagation. The program is able to compare the model it currently has and its target side by side and determine the amount of “loss”. Its goal is then to limit the amount of loss.

Loss Function

To measure the amount of loss, we use what is called the Loss Function. To illustrate this, we can use the following sample sentence “Father has a dog”. The translator may return “A father has a wolf”. We see that ⅗ words actually appear in the original sentence. Thus we have a loss of two.

Image for post
Image for post

As seen on the left, PPL is a unit of measuring loss. Obviously, the less loss we have, the better translation we have. We see that the PPL value decreases as the program trains itself overtime. This seems simple enough. However, our greatest problem may be minimizing this loss but without overcomplicating the system. Sometimes we make the mistake of overfitting, meaning that we used too many dimensions in attempt to make the data match perfectly. This may cause us to exclude other data simply because there was not a 100% match.The program is best trained with the right combination that we try to obtain in our process.

Teacher Forcing

Another method of helping our program is called Teacher Forcing. One of our main problems is that words of a sentence rely on the context established in the previous words in the sentence. Therefore, one error can propagate throughout the network. The teacher forcing algorithm solves this problem by having a translated “answer” to help feed in a reference word. We ensure in this way that the machine is learning with the proper context. For example, our sentence might be “the cat ate the lasagna”. Our program may translate like the following:

  1. The
  2. The car
  3. The car ???

The word “ate” doesn’t seems to make much sense after “car”, which confuses our program. This is precisely where teacher forcing can come into play. We simply feed in the word “cat” as the answer, allowing our program to continue its translation. The program is now able to continue its training while it finds other ways to see the difference between “car” and “cat”.


Our goal is to create a functional translator in order to bridge language barriers. Natural Language Processing allows us to do this. This process involves concepts like preprocessing, word embeddings, and sequence to sequence encoding and decoding. To improve the learning procedures for our program, we used an approach called Neural Machine Translation based on neural networking. Thus we could implement strategies that include long short term memory, attention, parallel corpus, forward and backward propagation, and teacher forcing. These concepts help us reduce the number of loss that can be demonstrated through the loss function.

Final Product

The strategies covered in this story were only the basics to a high level, multilayer translator.

Image for post
Image for post
How our translator converts the English sentence on the left into German

By importing datasets with both words and vectors, the code could make much use of these sets. It is able to translate multiple sentences with zero loss. We were able to create a semi-functional language translator that has high potential and functions well for shorter, simple sentences. It learns from its mistakes and corrects texts that it has completed. We can say that this was good progress as we had only trained the program for a day to obtain these results.


A Google Translate level translator takes much more time to create. Moreover, we would try to import even more datasets for the AI to learn. In our case, the data used were mostly from news articles, limiting the amount of conversational vocabulary that the machine would understand. To improve this machine, we can check our data for fundamental errors. Furthermore, our translator focuses less on common words when deciphering meaning, again limiting its understanding of everyday language. In a short amount of time, we were introduced to a variety of methods and saw the potential of such a machine.

Image for post
Image for post

Translating programs create many opportunities for communication, used with tourists and many businesses. With improved programs, the spreading of knowledge and information around the globe may no longer need to be as expensive or time consuming. Needless to say, Natural Language Processing is an expansive system that has the ability to contribute much to our modern society.


Program: Invent the Future AI Scholars Program hosted by Simon Fraser University and supported by AI4ALL

Date: July 15th — July 26th, 2019

Team Members: Linda Bian, Olivia Chan, Skylar Hildebrand, Ines Khouider, Elizabeth Wong, Sophie Zhao

Mentors: Carolyn Chen (TA), Nishant Kambhatla, Pooya Moradi

The Startup

Medium's largest active publication, followed by +771K people. Follow to join our community.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store