Programming a Machine Translator

Alyssa Ida
Voice Tech Podcast
Published in
10 min readFeb 5, 2020


Accurate sentence translation is something that has been in demand for centuries.

To be able to instantaneously understand the language of another culture is the key to so many opportunities for everyone all around the world. The first known translations were actually translations of religious text from Hebrew to Greek due to the prevalence of the Greek language in the 3rd century BC.

However, as many of you may know, with technologies such as Google Translate, the translation process has become way more efficient and accessible over time. Despite the occasional inaccuracy in Google, it is still one of the most popular translation sites available.

Instead of the standard word to word translation methods, artificial intelligence actually makes the process of language translation automated, and way easier for translators. Artificial intelligence is a branch of computer science that uses various methods to train programs to produce an output based on trained data. In this tutorial, I will show you how I trained my program with a set of data to be able to translate Spanish sentences to English in Google Colab in 10 steps!

10 Step Tutorial to Program a Translator!

  1. Import Tensorflow

Tensorflow is an open source machine learning platform that can be used to train data sets to predict certain outcomes. This is automatically installed in Google Colab, so only importing is necessary. This platform allows you to use unique commands that can specifically be used to program a computer to train on a data set. This means that the program will process the data and be able to produce predicted possible outcomes based on common pairs. For example, if the words “red” and “rojo” are commonly placed together in the dataset, the program will be trained to accept these as interchangeable over time.

#Import Tensorflow
from __future__ import absolute_import, division, print_function, unicode_literals

%tensorflow_version 2.x
except Exception:
import tensorflow as tf
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from sklearn.model_selection import train_test_split
import unicodedata
import re
import numpy as np
import os
import io
import time

2. Download Data

This is the data that you will use to train your program on. The program will take the data you input and learn the common translation pairs. These are the commands or practices that you use to help your program learn to improve! Similarly to a ballet show, you need to practice specific jumps and bends to get better at them. These are the jumps and bends that you show a ballerina.

Example of what is in the data set
#Downloading the Data
path_to_zip = tf.keras.utils.get_file(
'', origin='',
path_to_file = os.path.dirname(path_to_zip)+"/spa-eng/spa.txt"

3. Tokenize

Tokenizing is the concept of taking sentences and breaking them apart into pieces, so that the program can better process them. Is it really logical to train whole sentences of data and translate whole sentences of data? No!! It is so much simpler to break sentences down into separate parts, then translate those parts! Similar to how your body doesn’t digest a whole apple, it digests pieces of the apple that it broke down into parts. Or connecting to the ballerina analogy, it would be the jumps/bends or scenes that make up an entire performance. You can’t just expect a ballerina to learn an entire performance start to finish!

# Converts the unicode file to ascii
def unicode_to_ascii(s):
return ''.join(c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) != 'Mn')
def preprocess_sentence(w):
w = unicode_to_ascii(w.lower().strip())
#separating punctuation
w = re.sub(r"([?.!,¿])", r" \1 ", w)
w = re.sub(r'[" "]+', " ", w)
# replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")
w = re.sub(r"[^a-zA-Z?.!,¿]+", " ", w)
w = w.rstrip().strip()
# adding a start and an end token to the sentence, so that the model know when to start and stop predicting.
w = '<start> ' + w + ' <end>'
return w
# 1. Remove the accents
# 2. Clean the sentences
# 3. Return word pairs in the format: [ENGLISH, SPANISH]
def create_dataset(path, num_examples):
lines =, encoding='UTF-8').read().strip().split('\n')
word_pairs = [[preprocess_sentence(w) for w in l.split('\t')] for l in lines[:num_examples]]return zip(*word_pairs)en, sp = create_dataset(path_to_file, None)
#set a limit for the maximum lengthdef max_length(tensor):
return max(len(t) for t in tensor)
#tokenize the textdef tokenize(lang):
lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(
tensor = lang_tokenizer.texts_to_sequences(lang)
tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor,
return tensor, lang_tokenizer
def load_dataset(path, num_examples=None):
# creating cleaned input, output pairstarg_lang, inp_lang = create_dataset(path, num_examples)
input_tensor, inp_lang_tokenizer = tokenize(inp_lang)
target_tensor, targ_lang_tokenizer = tokenize(targ_lang)
return input_tensor, target_tensor, inp_lang_tokenizer, targ_lang_tokenizer

4. Prepare the Data

Here you will take the data and prepare it to process through the program. This is where you can determine how many words you want to train your program with. This is where I experimented with different numbers of words for the data sets to provide more accurate data. They say practice makes perfect, right? The more practice, the better you are. Same with Machine Learning models, the more data to train with, the more accurate the translation will be!

#Set the size of the data setnum_examples = 30000
input_tensor, target_tensor, inp_lang, targ_lang = load_dataset(path_to_file, num_examples)
# Calculate max_length of the target tensors
max_length_targ, max_length_inp = max_length(target_tensor), max_length(input_tensor)
# Creating training and validation sets using an 80-20 splitinput_tensor_train, input_tensor_val, target_tensor_train, target_tensor_val = train_test_split(input_tensor, target_tensor, test_size=0.2)# Show lengthprint(len(input_tensor_train), len(target_tensor_train), len(input_tensor_val), len(target_tensor_val))def convert(lang, tensor):for t in tensor:if t!=0:print ("%d ----> %s" % (t, lang.index_word[t]))print ("Input Language; index to word mapping")convert(inp_lang, input_tensor_train[0])print ()print ("Target Language; index to word mapping")convert(targ_lang, target_tensor_train[0])
BUFFER_SIZE = len(input_tensor_train)BATCH_SIZE = 64steps_per_epoch = len(input_tensor_train)//BATCH_SIZEembedding_dim = 256units = 1024vocab_inp_size = len(inp_lang.word_index)+1vocab_tar_size = len(targ_lang.word_index)+1dataset =, target_tensor_train)).shuffle(BUFFER_SIZE)dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)
example_input_batch, example_target_batch = next(iter(dataset))
example_input_batch.shape, example_target_batch.shape

5. Encoder

An encoder is a network that takes the input, and outputs a feature map/vector/tensor. This is where the preparation for the heat maps comes in. A heat map is what we will use to represent the attention of the output. This means that this graph will represent which words’ attention was more focused where. This helps us understand which specific words were connected to the translation of which other words. I will show some examples in the outcome!

#Encoderclass Encoder(tf.keras.Model):def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):super(Encoder, self).__init__()self.batch_sz = batch_szself.enc_units = enc_unitsself.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)self.gru = tf.keras.layers.GRU(self.enc_units,return_sequences=True,return_state=True,recurrent_initializer='glorot_uniform')def call(self, x, hidden):x = self.embedding(x)
output, state = self.gru(x, initial_state = hidden)
return output, state
def initialize_hidden_state(self):return tf.zeros((self.batch_sz, self.enc_units))encoder = Encoder(vocab_inp_size, embedding_dim, units, BATCH_SIZE)# Sample inputsample_hidden = encoder.initialize_hidden_state()sample_output, sample_hidden = encoder(example_input_batch, sample_hidden)print ('Encoder output shape: (batch size, sequence length, units) {}'.format(sample_output.shape))print ('Encoder Hidden state shape: (batch size, units) {}'.format(sample_hidden.shape))#Bahdanau Attention for the Encoder
class BahdanauAttention(tf.keras.layers.Layer):
def __init__(self, units):super(BahdanauAttention, self).__init__()self.W1 = tf.keras.layers.Dense(units)self.W2 = tf.keras.layers.Dense(units)self.V = tf.keras.layers.Dense(1)# Performing addition to calculate the score
def call(self, query, values):
hidden_with_time_axis = tf.expand_dims(query, 1)score = self.V(tf.nn.tanh(self.W1(values) + self.W2(hidden_with_time_axis)))attention_weights = tf.nn.softmax(score, axis=1)context_vector = attention_weights * valuescontext_vector = tf.reduce_sum(context_vector, axis=1)return context_vector, attention_weightsattention_layer = BahdanauAttention(10)
attention_result, attention_weights = attention_layer(sample_hidden, sample_output)
print("Attention result shape: (batch size, units) {}".format(attention_result.shape))
print("Attention weights shape: (batch_size, sequence_length, 1) {}".format(attention_weights.shape))

6. Decoder

The decoder is a network that takes the feature vector from the encoder, and gives the best, closest match to the actual input or intended output. This is the aspect of the program that essentially uses the information from the heat map/encoder, and processes it to produce an outcome aka the translation. It notices where the encoder focused its attention, and finds the best match for those words.

class Decoder(tf.keras.Model):def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):super(Decoder, self).__init__()self.batch_sz = batch_szself.dec_units = dec_unitsself.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)self.gru = tf.keras.layers.GRU(self.dec_units,return_sequences=True,return_state=True,recurrent_initializer='glorot_uniform')self.fc = tf.keras.layers.Dense(vocab_size)# Used for attention
self.attention = BahdanauAttention(self.dec_units)
def call(self, x, hidden, enc_output):
# enc_output shape == (batch_size, max_length, hidden_size)
context_vector, attention_weights = self.attention(hidden, enc_output)
# x shape after passing through embedding == (batch_size, 1, embedding_dim)
x = self.embedding(x)
# x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)
output, state = self.gru(x)# output shape == (batch_size * 1, hidden_size)
output = tf.reshape(output, (-1, output.shape[2]))
# output shape == (batch_size, vocab)
x = self.fc(output)
return x, state, attention_weights
decoder = Decoder(vocab_tar_size, embedding_dim, units, BATCH_SIZE)
sample_decoder_output, _, _ = decoder(tf.random.uniform((BATCH_SIZE, 1)),
sample_hidden, sample_output)
print ('Decoder output shape: (batch_size, vocab size) {}'.format(sample_decoder_output.shape))

Build better voice apps. Get more articles & interviews from voice technology experts at

7. Define the Optimizer and the Loss Function

The loss function is based on computing the delta between the actual and reconstructed input. The optimizer will try to train both the encoder and the decoder to lower this reconstruction loss. This is where the program tries to improve its accuracy. The goal is to have the lowest possible loss factor, this means that there is a higher chance of accuracy! It would be like calculating the amount of times a ballerina makes a mistake in a movement, then using that to try and get better.

#Define the optimizer and the loss functionoptimizer = tf.keras.optimizers.Adam()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
from_logits=True, reduction='none')
def loss_function(real, pred):
mask = tf.math.logical_not(tf.math.equal(real, 0))
loss_ = loss_object(real, pred)mask = tf.cast(mask, dtype=loss_.dtype)
loss_ *= mask
return tf.reduce_mean(loss_)

8. Checkpoints

Checkpoints help order the process so the program can better follow through with the training while understanding the order of processes. This is like the different scenes to a performance.

#checkpointscheckpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
checkpoint = tf.train.Checkpoint(optimizer=optimizer,

9. Training

This is the process where the program reads through sets of data and begins encoding and decoding. This is how the program improves upon itself to try and minimize the loss. This is the rehearsal for the final show. You’ve finished preparing for practice, and now you let the machine evaluate the data through unsupervised learning. Practice, practice, practice, and we’ll see at the final show how far it has come, through the training practice!

def train_step(inp, targ, enc_hidden):
loss = 0
with tf.GradientTape() as tape:
enc_output, enc_hidden = encoder(inp, enc_hidden)
dec_hidden = enc_hidden
dec_input = tf.expand_dims([targ_lang.word_index['<start>']] * BATCH_SIZE, 1)
# Teacher forcing - feeding the target as the next input
for t in range(1, targ.shape[1]):
# passing enc_output to the decoder
predictions, dec_hidden, _ = decoder(dec_input, dec_hidden, enc_output)
loss += loss_function(targ[:, t], predictions)# using teacher forcing
dec_input = tf.expand_dims(targ[:, t], 1)
batch_loss = (loss / int(targ.shape[1]))
variables = encoder.trainable_variables + decoder.trainable_variables
gradients = tape.gradient(loss, variables)
optimizer.apply_gradients(zip(gradients, variables))
return batch_loss
# epochs
for epoch in range(EPOCHS):start = time.time()enc_hidden = encoder.initialize_hidden_state()total_loss = 0for (batch, (inp, targ)) in enumerate(dataset.take(steps_per_epoch)):batch_loss = train_step(inp, targ, enc_hidden)
total_loss += batch_loss
if batch % 100 == 0:print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1,
batch_loss.numpy()))# saving (checkpoint) the model every 2 epochsif (epoch + 1) % 2 == 0: = checkpoint_prefix)
print('Epoch {} Loss {:.4f}'.format(epoch + 1,
total_loss / steps_per_epoch))
print('Time taken for 1 epoch {} sec\n'.format(time.time() - start))

10. Translation

Once the model is trained, the program is ready to input Spanish sentences and output English translations! This is the final test to see how well the program learnt from your programming and the data. You can finally see how well the program can translate some example Spanish sentences to English. Or how well our ballerina has practiced her new tricks!

def evaluate(sentence):
attention_plot = np.zeros((max_length_targ, max_length_inp))
sentence = preprocess_sentence(sentence)
inputs = [inp_lang.word_index[i] for i in sentence.split(' ')]
inputs = tf.keras.preprocessing.sequence.pad_sequences([inputs],
inputs = tf.convert_to_tensor(inputs)
result = ''
hidden = [tf.zeros((1, units))]
enc_out, enc_hidden = encoder(inputs, hidden)
dec_hidden = enc_hidden
dec_input = tf.expand_dims([targ_lang.word_index['<start>']], 0)
for t in range(max_length_targ):
predictions, dec_hidden, attention_weights = decoder(dec_input,
# storing the attention weights to plot later onattention_weights = tf.reshape(attention_weights, (-1, ))
attention_plot[t] = attention_weights.numpy()
predicted_id = tf.argmax(predictions[0]).numpy()
result += targ_lang.index_word[predicted_id] + ' '
if targ_lang.index_word[predicted_id] == '<end>':
return result, sentence, attention_plot
# the predicted ID is fed back into the modeldec_input = tf.expand_dims([predicted_id], 0)
return result, sentence, attention_plot
# function for plotting the attention weightsdef plot_attention(attention, sentence, predicted_sentence):
fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(1, 1, 1)
ax.matshow(attention, cmap='viridis')
fontdict = {'fontsize': 14}ax.set_xticklabels([''] + sentence, fontdict=fontdict, rotation=90)
ax.set_yticklabels([''] + predicted_sentence, fontdict=fontdict)
#function for translations
def translate(sentence):
result, sentence, attention_plot = evaluate(sentence)
print('Input: %s' % (sentence))
print('Predicted translation: {}'.format(result))
attention_plot = attention_plot[:len(result.split(' ')), :len(sentence.split(' '))]
plot_attention(attention_plot, sentence.split(' '), result.split(' '))
# restoring the latest checkpoint in checkpoint_dir
translate(u'hace mucho frio aqui.')


In the end, my final loss factor was 0.0725! This shows major improvement from the first round of testing with a whopping 4.7595!

I practiced some translations with bigger and smaller data sets to test the improvement in accuracy! The first translation with the 30,000 word data set outputted a rather inaccurate translation. While the 100,000 word data set outputted an almost perfect translation!

The heatmap on the left is with a 100,000 word data set, while the heatmap on the right is with a 30,000 word data set.

These heatmaps represent attention. This means that the “warmer” the color, the more attention was placed on each word when translating. For example, the spanish word “estás” means you are, so the attention of this word is focused more on the translation to english “are”.

This is similar to how translators like Google use AI to make a more efficient translation system for all languages. This branch of AI, called Natural Language Processing, is actually used in a number of different ways to help us better understand languages and the history of our culture. One example, it is being used to analyze ancient texts and translate those to better understand ancient civilizations! This is so valuable in helping understand where we come from today and predicting how our world could turn out to be.

I hope to get more in touch with using artificial intelligence to analyze languages in the future! Next I plan to work on a program that can recognize human speech.

Here’s a link to my git

Here are where I got some of my resources:

Thank you!

Hi! My name is Alyssa Gould. I’m super passionate about languages and using Natural Language Processing to help learn about languages!

Feel free to contact me at anytime at and add me on LinkedIn!

Something just for you