Programming a Machine Translator

Alyssa Gould
Feb 5, 2020 · 10 min read

Accurate sentence translation is something that has been in demand for centuries.

Image for post
Image for post

To be able to instantaneously understand the language of another culture is the key to so many opportunities for everyone all around the world. The first known translations were actually translations of religious text from Hebrew to Greek due to the prevalence of the Greek language in the 3rd century BC.

However, as many of you may know, with technologies such as Google Translate, the translation process has become way more efficient and accessible over time. Despite the occasional inaccuracy in Google, it is still one of the most popular translation sites available.

Instead of the standard word to word translation methods, artificial intelligence actually makes the process of language translation automated, and way easier for translators. Artificial intelligence is a branch of computer science that uses various methods to train programs to produce an output based on trained data. In this tutorial, I will show you how I trained my program with a set of data to be able to translate Spanish sentences to English in Google Colab in 10 steps!

10 Step Tutorial to Program a Translator!

  1. Import Tensorflow

Tensorflow is an open source machine learning platform that can be used to train data sets to predict certain outcomes. This is automatically installed in Google Colab, so only importing is necessary. This platform allows you to use unique commands that can specifically be used to program a computer to train on a data set. This means that the program will process the data and be able to produce predicted possible outcomes based on common pairs. For example, if the words “red” and “rojo” are commonly placed together in the dataset, the program will be trained to accept these as interchangeable over time.

2. Download Data

This is the data that you will use to train your program on. The program will take the data you input and learn the common translation pairs. These are the commands or practices that you use to help your program learn to improve! Similarly to a ballet show, you need to practice specific jumps and bends to get better at them. These are the jumps and bends that you show a ballerina.

Image for post
Image for post
Example of what is in the data set

3. Tokenize

Tokenizing is the concept of taking sentences and breaking them apart into pieces, so that the program can better process them. Is it really logical to train whole sentences of data and translate whole sentences of data? No!! It is so much simpler to break sentences down into separate parts, then translate those parts! Similar to how your body doesn’t digest a whole apple, it digests pieces of the apple that it broke down into parts. Or connecting to the ballerina analogy, it would be the jumps/bends or scenes that make up an entire performance. You can’t just expect a ballerina to learn an entire performance start to finish!

def preprocess_sentence(w):
w = unicode_to_ascii(w.lower().strip())
#separating punctuation
w = re.sub(r"([?.!,¿])", r" \1 ", w)
w = re.sub(r'[" "]+', " ", w)
# replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")
w = re.sub(r"[^a-zA-Z?.!,¿]+", " ", w)
w = w.rstrip().strip()
# adding a start and an end token to the sentence, so that the model know when to start and stop predicting.
w = '<start> ' + w + ' <end>'
return w
# 1. Remove the accents
# 2. Clean the sentences
# 3. Return word pairs in the format: [ENGLISH, SPANISH]
def create_dataset(path, num_examples):
lines =, encoding='UTF-8').read().strip().split('\n')
word_pairs = [[preprocess_sentence(w) for w in l.split('\t')] for l in lines[:num_examples]]return zip(*word_pairs)en, sp = create_dataset(path_to_file, None)
#set a limit for the maximum lengthdef max_length(tensor):
return max(len(t) for t in tensor)
#tokenize the textdef tokenize(lang):
lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(
tensor = lang_tokenizer.texts_to_sequences(lang)
tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor,
return tensor, lang_tokenizer
def load_dataset(path, num_examples=None):
# creating cleaned input, output pairstarg_lang, inp_lang = create_dataset(path, num_examples)
input_tensor, inp_lang_tokenizer = tokenize(inp_lang)
target_tensor, targ_lang_tokenizer = tokenize(targ_lang)
return input_tensor, target_tensor, inp_lang_tokenizer, targ_lang_tokenizer

4. Prepare the Data

Here you will take the data and prepare it to process through the program. This is where you can determine how many words you want to train your program with. This is where I experimented with different numbers of words for the data sets to provide more accurate data. They say practice makes perfect, right? The more practice, the better you are. Same with Machine Learning models, the more data to train with, the more accurate the translation will be!

num_examples = 30000
input_tensor, target_tensor, inp_lang, targ_lang = load_dataset(path_to_file, num_examples)
# Calculate max_length of the target tensors
max_length_targ, max_length_inp = max_length(target_tensor), max_length(input_tensor)
# Creating training and validation sets using an 80-20 splitinput_tensor_train, input_tensor_val, target_tensor_train, target_tensor_val = train_test_split(input_tensor, target_tensor, test_size=0.2)# Show lengthprint(len(input_tensor_train), len(target_tensor_train), len(input_tensor_val), len(target_tensor_val))def convert(lang, tensor):for t in tensor:if t!=0:print ("%d ----> %s" % (t, lang.index_word[t]))print ("Input Language; index to word mapping")convert(inp_lang, input_tensor_train[0])print ()print ("Target Language; index to word mapping")convert(targ_lang, target_tensor_train[0])
BUFFER_SIZE = len(input_tensor_train)BATCH_SIZE = 64steps_per_epoch = len(input_tensor_train)//BATCH_SIZEembedding_dim = 256units = 1024vocab_inp_size = len(inp_lang.word_index)+1vocab_tar_size = len(targ_lang.word_index)+1dataset =, target_tensor_train)).shuffle(BUFFER_SIZE)dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)
example_input_batch, example_target_batch = next(iter(dataset))
example_input_batch.shape, example_target_batch.shape

5. Encoder

An encoder is a network that takes the input, and outputs a feature map/vector/tensor. This is where the preparation for the heat maps comes in. A heat map is what we will use to represent the attention of the output. This means that this graph will represent which words’ attention was more focused where. This helps us understand which specific words were connected to the translation of which other words. I will show some examples in the outcome!

class Encoder(tf.keras.Model):def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):super(Encoder, self).__init__()self.batch_sz = batch_szself.enc_units = enc_unitsself.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)self.gru = tf.keras.layers.GRU(self.enc_units,return_sequences=True,return_state=True,recurrent_initializer='glorot_uniform')def call(self, x, hidden):x = self.embedding(x)
output, state = self.gru(x, initial_state = hidden)
return output, state
def initialize_hidden_state(self):return tf.zeros((self.batch_sz, self.enc_units))encoder = Encoder(vocab_inp_size, embedding_dim, units, BATCH_SIZE)# Sample inputsample_hidden = encoder.initialize_hidden_state()sample_output, sample_hidden = encoder(example_input_batch, sample_hidden)print ('Encoder output shape: (batch size, sequence length, units) {}'.format(sample_output.shape))print ('Encoder Hidden state shape: (batch size, units) {}'.format(sample_hidden.shape))#Bahdanau Attention for the Encoder
class BahdanauAttention(tf.keras.layers.Layer):
def __init__(self, units):super(BahdanauAttention, self).__init__()self.W1 = tf.keras.layers.Dense(units)self.W2 = tf.keras.layers.Dense(units)self.V = tf.keras.layers.Dense(1)# Performing addition to calculate the score
def call(self, query, values):
hidden_with_time_axis = tf.expand_dims(query, 1)score = self.V(tf.nn.tanh(self.W1(values) + self.W2(hidden_with_time_axis)))attention_weights = tf.nn.softmax(score, axis=1)context_vector = attention_weights * valuescontext_vector = tf.reduce_sum(context_vector, axis=1)return context_vector, attention_weightsattention_layer = BahdanauAttention(10)
attention_result, attention_weights = attention_layer(sample_hidden, sample_output)
print("Attention result shape: (batch size, units) {}".format(attention_result.shape))
print("Attention weights shape: (batch_size, sequence_length, 1) {}".format(attention_weights.shape))

6. Decoder

The decoder is a network that takes the feature vector from the encoder, and gives the best, closest match to the actual input or intended output. This is the aspect of the program that essentially uses the information from the heat map/encoder, and processes it to produce an outcome aka the translation. It notices where the encoder focused its attention, and finds the best match for those words.

def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):super(Decoder, self).__init__()self.batch_sz = batch_szself.dec_units = dec_unitsself.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)self.gru = tf.keras.layers.GRU(self.dec_units,return_sequences=True,return_state=True,recurrent_initializer='glorot_uniform')self.fc = tf.keras.layers.Dense(vocab_size)# Used for attention
self.attention = BahdanauAttention(self.dec_units)
def call(self, x, hidden, enc_output):
# enc_output shape == (batch_size, max_length, hidden_size)
context_vector, attention_weights = self.attention(hidden, enc_output)
# x shape after passing through embedding == (batch_size, 1, embedding_dim)
x = self.embedding(x)
# x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)
output, state = self.gru(x)# output shape == (batch_size * 1, hidden_size)
output = tf.reshape(output, (-1, output.shape[2]))
# output shape == (batch_size, vocab)
x = self.fc(output)
return x, state, attention_weights
decoder = Decoder(vocab_tar_size, embedding_dim, units, BATCH_SIZE)
sample_decoder_output, _, _ = decoder(tf.random.uniform((BATCH_SIZE, 1)),
sample_hidden, sample_output)
print ('Decoder output shape: (batch_size, vocab size) {}'.format(sample_decoder_output.shape))

Build better voice apps. Get more articles & interviews from voice technology experts at

7. Define the Optimizer and the Loss Function

The loss function is based on computing the delta between the actual and reconstructed input. The optimizer will try to train both the encoder and the decoder to lower this reconstruction loss. This is where the program tries to improve its accuracy. The goal is to have the lowest possible loss factor, this means that there is a higher chance of accuracy! It would be like calculating the amount of times a ballerina makes a mistake in a movement, then using that to try and get better.

optimizer = tf.keras.optimizers.Adam()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
from_logits=True, reduction='none')
def loss_function(real, pred):
mask = tf.math.logical_not(tf.math.equal(real, 0))
loss_ = loss_object(real, pred)mask = tf.cast(mask, dtype=loss_.dtype)
loss_ *= mask
return tf.reduce_mean(loss_)

8. Checkpoints

Checkpoints help order the process so the program can better follow through with the training while understanding the order of processes. This is like the different scenes to a performance.

checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
checkpoint = tf.train.Checkpoint(optimizer=optimizer,

9. Training

This is the process where the program reads through sets of data and begins encoding and decoding. This is how the program improves upon itself to try and minimize the loss. This is the rehearsal for the final show. You’ve finished preparing for practice, and now you let the machine evaluate the data through unsupervised learning. Practice, practice, practice, and we’ll see at the final show how far it has come, through the training practice!

with tf.GradientTape() as tape:
enc_output, enc_hidden = encoder(inp, enc_hidden)
dec_hidden = enc_hidden
dec_input = tf.expand_dims([targ_lang.word_index['<start>']] * BATCH_SIZE, 1)
# Teacher forcing - feeding the target as the next input
for t in range(1, targ.shape[1]):
# passing enc_output to the decoder
predictions, dec_hidden, _ = decoder(dec_input, dec_hidden, enc_output)
loss += loss_function(targ[:, t], predictions)# using teacher forcing
dec_input = tf.expand_dims(targ[:, t], 1)
batch_loss = (loss / int(targ.shape[1]))
variables = encoder.trainable_variables + decoder.trainable_variables
gradients = tape.gradient(loss, variables)
optimizer.apply_gradients(zip(gradients, variables))
return batch_loss
# epochs
for epoch in range(EPOCHS):start = time.time()enc_hidden = encoder.initialize_hidden_state()total_loss = 0for (batch, (inp, targ)) in enumerate(dataset.take(steps_per_epoch)):batch_loss = train_step(inp, targ, enc_hidden)
total_loss += batch_loss
if batch % 100 == 0:print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1,
batch_loss.numpy()))# saving (checkpoint) the model every 2 epochsif (epoch + 1) % 2 == 0: = checkpoint_prefix)
print('Epoch {} Loss {:.4f}'.format(epoch + 1,
total_loss / steps_per_epoch))
print('Time taken for 1 epoch {} sec\n'.format(time.time() - start))

10. Translation

Once the model is trained, the program is ready to input Spanish sentences and output English translations! This is the final test to see how well the program learnt from your programming and the data. You can finally see how well the program can translate some example Spanish sentences to English. Or how well our ballerina has practiced her new tricks!

sentence = preprocess_sentence(sentence)
inputs = [inp_lang.word_index[i] for i in sentence.split(' ')]
inputs = tf.keras.preprocessing.sequence.pad_sequences([inputs],
inputs = tf.convert_to_tensor(inputs)
result = ''
hidden = [tf.zeros((1, units))]
enc_out, enc_hidden = encoder(inputs, hidden)
dec_hidden = enc_hidden
dec_input = tf.expand_dims([targ_lang.word_index['<start>']], 0)
for t in range(max_length_targ):
predictions, dec_hidden, attention_weights = decoder(dec_input,
# storing the attention weights to plot later onattention_weights = tf.reshape(attention_weights, (-1, ))
attention_plot[t] = attention_weights.numpy()
predicted_id = tf.argmax(predictions[0]).numpy()
result += targ_lang.index_word[predicted_id] + ' '
if targ_lang.index_word[predicted_id] == '<end>':
return result, sentence, attention_plot
# the predicted ID is fed back into the modeldec_input = tf.expand_dims([predicted_id], 0)
return result, sentence, attention_plot
# function for plotting the attention weightsdef plot_attention(attention, sentence, predicted_sentence):
fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(1, 1, 1)
ax.matshow(attention, cmap='viridis')
fontdict = {'fontsize': 14}ax.set_xticklabels([''] + sentence, fontdict=fontdict, rotation=90)
ax.set_yticklabels([''] + predicted_sentence, fontdict=fontdict)
#function for translations
def translate(sentence):
result, sentence, attention_plot = evaluate(sentence)
print('Input: %s' % (sentence))
print('Predicted translation: {}'.format(result))
attention_plot = attention_plot[:len(result.split(' ')), :len(sentence.split(' '))]
plot_attention(attention_plot, sentence.split(' '), result.split(' '))
# restoring the latest checkpoint in checkpoint_dir
translate(u'hace mucho frio aqui.')


Image for post
Image for post

In the end, my final loss factor was 0.0725! This shows major improvement from the first round of testing with a whopping 4.7595!

I practiced some translations with bigger and smaller data sets to test the improvement in accuracy! The first translation with the 30,000 word data set outputted a rather inaccurate translation. While the 100,000 word data set outputted an almost perfect translation!

Image for post
Image for post
The heatmap on the left is with a 100,000 word data set, while the heatmap on the right is with a 30,000 word data set.

These heatmaps represent attention. This means that the “warmer” the color, the more attention was placed on each word when translating. For example, the spanish word “estás” means you are, so the attention of this word is focused more on the translation to english “are”.

This is similar to how translators like Google use AI to make a more efficient translation system for all languages. This branch of AI, called Natural Language Processing, is actually used in a number of different ways to help us better understand languages and the history of our culture. One example, it is being used to analyze ancient texts and translate those to better understand ancient civilizations! This is so valuable in helping understand where we come from today and predicting how our world could turn out to be.

I hope to get more in touch with using artificial intelligence to analyze languages in the future! Next I plan to work on a program that can recognize human speech.

Here’s a link to my git

Here are where I got some of my resources:

Thank you!

Hi! My name is Alyssa Gould. I’m super passionate about languages and using Natural Language Processing to help learn about languages!

Feel free to contact me at anytime at and add me on LinkedIn!

Something just for you

Voice Tech Podcast

Voice technology interviews & articles.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store