Language Recognition Using Deep Neural Networks

Tom Ham
Coinmonks
13 min readAug 9, 2018

--

I find it fascinating how we are able to tell just by looking that ‘ gebracht’ is most likely a German word and ‘ reconstituer’ looks like a French word. We are able to spot certain patterns in words that give us clues as to which language the word belongs. An example is an ‘eux’ at the end of a word is an indication that the word is French, whereas words ending in an ‘o’ tend to be of Spanish decent (e.g. renacuajo, basurero etc). The aim of this project is to teach a computer to do the exact same thing; to be able to recognise which language a given word comes from.

Due to the complexity of this task, using a deep neural network will most likely be the most accurate technique. This project was undertaken in python, and I used the Keras package to handle the neural network side of things. Note that Keras uses the TensorFlow backend, so we are able to generate some nice TensorBoard visualisations.

The plan of this project is to create a huge labelled word library from scratch, with words from five different languages, each labelled with their language. The way that this will be done is by scraping a load of Wikipedia articles. We will then convert these words into binary vectors so that they can be inputted into the neural network. The network will be trained on 85% of the words and be validated on the remaining 15%. We will then (hopefully!) have a trained network which is able to accurately predict which of the five languages a given word is from.

Lets dive in. The first thing I did was to pick which five languages I want to use for this project. I chose English, German, Czech, Swedish and French. I thought that these languages had enough distinct patterns in them that the network would be able to recognise. The next thing I did was to find around 15 or so Wikipedia articles for each language, making sure to vary the topics to ensure a diverse range of words were picked up. Below shows the first file ‘config.py’, essentially the configuration settings for this project.

We see it only contains two variables: ‘max_letters’ and ‘language_tags’. Max letters is the maximum length of word that the scraper will pick up, and hence the maximum length of word that can be inputted into the neural network. We then see a dictionary ‘language_tags’. The dictionary’s keys are the Wikipedia tags of the five languages that I am using, and the values are lists of the names of Wikipedia articles in that language. It is important that all of this can easily be changed. If I wanted the network to use ten languages, I would simply have to add another entry into the dictionary, or if I wanted the network to be able to predict 15 letter long words, I would just update max_letters.

We now move onto the next file, ‘functions’. This project contains quite a few complicated functions to not only scrape for the words and add them to a large word library, but also to turn these words into vectors etc. Hence I have created a separate file which houses these functions to keep the code nice and organised. The first function is called generate_dictionary and it is shown below.

The aim of this function is to create a long list of words that were found in all of the provided Wikipedia articles for a given language. Its parameters, tag and max_word_length are fairly self explanatory. They are the language tag and the maximum number of letters that words are allowed to be in this list. Note for the Wikipedia scraping I will be using the wikipedia python package. We set the language of the Wikipedia articles to the desired language in line 9, and then begin iterating through the articles of this language. Line 11 gathers the raw HTML, and line 12 gets the raw text content of the article as one long string. Since we want the text to be all in ASCII, we use the unidecode function to convert the text into ASCII. You may then notice peculiar function called process. This is the next function in this file shown below.

So process takes two parameters, page_content (the string of the Wikipedia article content), and the maximum word length. We begin by using regular expressions to extract only the words, removing any numbers, links, punctuation etc from the article. Then, it is important to remove any capital letters, hence we use the lower function on this text. Then, the split function is used to create a list from the remaining words. So now we have a list of every word in the article. However we must still remember that we cannot include words more than max_letters in length. We filter these words out in lines 6–8, and return this fully processed list. Then, generate_dictionary (the function above this one) returns this list, and we are all done.

The next important function in ‘functions.py’ is convert_dic_to_vector. This function takes a list of words (like the one generated using the above function), and returns a list of vectors representing each word. The way I am representing a letter in this vector is as follows:

a=10000000000000000000000000
b=01000000000000000000000000
c=00100000000000000000000000

z=00000000000000000000000001

We then string these together to form a word. Since all the vectors must be the same length, we must fill the unused letters (max letters-used letters) with zeros. For example the word ‘hello’ would be represented as.

00000001000000000000000000
00001000000000000000000000
00000000000100000000000000
00000000000100000000000000
00000000000000100000000000
00000000000000000000000000
00000000000000000000000000
00000000000000000000000000
00000000000000000000000000
00000000000000000000000000
00000000000000000000000000
00000000000000000000000000

The first 5 letters are populated with h, e, l, l, and o, with the remaining 7 letters populated with zeros. Also, it would be one continuous string of numbers without line breaks (the line breaks were for visual clarity). This means each word vector will be a 26*12 = 312 digit long string of ones and zeros. So now we must implement a function which converts a list of words into a list of vectors in the form shown above.

The function takes a list of words and a maximum word length. Then for each word in this list, we create an empty string called ‘vec’ (line 4). Then, for each letter in the word (line 6), we convert this letter into a number (a being 1, b being 2 etc) (line 8). We then create the vector from this in line 9 by adding zeros up until this number, then a one, then fill the remaining spaces with zeros. Remember, this process is repeated for each letter in the word. This vector is then added to the ‘vec’ string. Once this is done, we need to fill the unused letters out of the 12 with zeros. This is done simply enough in lines 11–13. We then append this ‘vec’ string to a new list, and repeat for all the other words. The function then returns this list of vectors.

The forth and final function in ‘fucntions.py’ is called create_output_vector. This function creates the output vector for a given language. (shock, I know!) Below I will list what I mean by ‘output vector’.

English = 10000
Czech = 01000
German = 00100
Swedish = 00010
French = 00001

These will be used to label the data. i.e. the vector for the word ‘hello’ shown above will be labelled with the output vector corresponding to English (10000), whereas the word ‘ typhuvud’ will be labelled with the Swedish output vector (00010), since it is a Swedish word. Below shows the function.

The parameter tag_index is the index of the language in the language_tags dictionary (from the config file), and the number_of_languages parameter (obviously) takes the number of languages being used. This one line function then simply creates an output vector as I have described above, and returns it.

We are now ready to actually start gathering some data. We are moving onto a new file, so the first step will be to import and relevant modules.

We import pandas and numpy along with the functions file and the two variables from config. We are now ready to create some data and label it. Before I show the code, I will explain the methodology. We will have two lists, one containing the word vectors, and one containing the output vectors. The nth element in the word vector list will correspond to the nth element in the output vector list, and we will then store this in a big array to be used later to train the network. I will also create a pandas dataframe and save this as a CSV file so we can open it in excel and see if the process has produced the results as expected (always good to check as you go along).

We start by creating three empty lists. word_data will store the 312 character vectorised words, language_data will store the 6 character language vectors, and master_dic will store the actual ASCII representation of the words. We then set up a counter (count) and then begin the loop which will gather the data. We iterate over each language tag, beginning with generating the list of all words found in the articles (line 9). Then at this stage, we add each of these words to master_dic (the list which will store all of the ASCII words). Then, we convert each of the words found in the articles into vectors (line 12), and add each of these vectors to word_data (lines 13–14). Finally, we create the output vectors and add them to language_data (lines 15–17).

We have now populated our three lists with the relevant information. We are now going to create an array with all of this information in it. This array will be populated with inner arrays representing each word, with each inner array structured as follows:

So each inner array will contain 318 elements. The code for this is shown below.

We create our master array, an empty list called ‘arr’. Then, we begin creating our inner arrays for each word. We start by appending the ASCII representation of the word (line 4). Then, digit by digit we append the output vector (lines 5–6), and then do the same for the input vector (lines 7–8). We then add this inner array to the master array, and repeat this process for each word. The data has now been fully processed and must now be saved. We begin converting to a numpy array (line 12), and saving this to a .npy file. (this is the file we will open later when we need to train the network). We also create a pandas dataframe and save this in order to check if the data is in the format we want it to be. Below shows a snippet from this csv file.

We see the first column shows the ASCII representation of the word and the 2–6 columns are the output vectors, in this case all representing English. (I’ve added a vertical line at the end of the output vectors to make it clear). We then have the first few columns of the input vectors (they extend much further than this). Just by eyeballing a few letters we can see the data is in the exact format we want it to be, so it is time to start setting up and training the neural network.

As usual, we begin by importing necessary modules.

We will use keras to do most of the work, but sci-kit learn’s train_test_split function will be helpful in splitting our data into training data and testing data.

We then load our data, and split it into two arrays, inputs and labels. Inputs just contains the 312 element long arrays representing all of the words, and labels contains the 5 element long arrays representing the language tags. Line 6 shows the splitting of these arrays into training and testing data, with 15% of the data being used for testing. I then print the shape of these arrays to show how much data we are working with. This shows the following:

We see we have over 410,000 training examples and over 73,000 testing examples. Next, we set up the neural network parameters and shape.

We will be using a feed-forward network with four hidden layers, each with 200, 150, 100, 100 nodes respectively. I chose this many hidden layers due to the complexity of this task, and found that with fewer layers, the accuracy was not at a value that I deemed to be acceptable. Each layer uses a sigmoid activation function except the final layer which uses softmax, as this function allows us to assign a probability to each language, as the sum of the output layer values will be one. We will use the Adam optimiser, an adaptive learning rate, gradient descent optimiser, and we will use the binary cross entropy loss function. The only metric we will worry about for this project is the accuracy of the model. We will next setup the callbacks and fit the model.

The two callbacks we will be using are ModelCheckpoint and TensorBoard. ModelCheckpoint saves the weights and biases of the model when the model’s validation accuracy improves so that we can use the model later, and TensorBoard allows us see some visualisations on the improvements of the model after it has been trained. Line 2 shows the setting up of the ModelCheckpoint callback. We are saving the weights to the file ‘weights.hdf5’, and using the validation accuracy as the success metric. The callback will only save the weights if they improve the validation accuracy. Next, line 3 shows the TensorBoard callback setup. We will save the TensorBoard files to a folder called ‘logs’. Then, line 6 shows the training of the network. I have selected to use batch sizes of 1000, with 200 epochs. We are validating against the testing data, and using the two callbacks as described above.

After a few hours, the model is fully trained. Below shows the TensorBoard visualisations of the validation accuracy over the course of the training.

Epochs

We can see that the validation accuracy reached a healthy figure of 95.43% after the 200 epochs of training. For such a complex task of language prediction, this was a very pleasing figure to see. I could have potentially squeezed an extra percent out of the model by using a few hundred more epochs but this runs the risk of overfitting the model, so 95.5% accuracy is a healthy compromise. We can now have some fun with the model and throw some of our own words at it to see if it is actually able to predict their language.

We begin a new file into which we will load the neural network and input our own words into it. I begin by importing the modules and setting up the network exactly the same way as in the previous file.

The only difference to this and the previous file is in line 13, where we load the weights of the already trained model from the file ‘weights.hdf5’. We will now create a loop that allows you to input your own word, and for each of the 5 languages, it will tell you the percentage likelihood that the word is from that language.

This whole block of code is sitting inside a while True loop, so that the user can indefinitely throw words at the network. Lines 3–8 check that the word the user has inputted is less than max_letters, and if so, it adds the lowercase version of that word to an empty list. We then use the convert_dic_to_vector function on this list (only containing the one word). We then create a numpy array of zeros and essentially replace a zero with a one in this array when a one appears in the vectorised word. This gives us a numpy array representing the word. We then predict the language of this word in line 18. Lines 20–24 print, for each language, the probability that the word belongs to that language. Lets test out the network! I will start by throwing a few random words from each language, and below show you the results.

We see that the network confidently predicts correctly the language of every word it has not seen before. Lets try some words that could be two languages, for example ‘pain’ could be both french or English, and ‘adresse’ is both German and French.

We can see for ‘adresse’, the network’s highest predictions are German and French, the two languages from which the word comes from. We see the same thing for ‘pain’, where English and French are the highest scorers. So not only is the network confident in words that belong to one languages, it can recognise when a word can belong to two different languages too. If you would like to have a play with this yourself, here is the download link to a zip folder containing the relevant python files and the hdf5 file of the weights. To use this file, (assuming python and pip are already installed and added to the PATH environmental variables), in the terminal run the command:

This installs the keras package. Then, unzip the folder you just downloaded and open a command window in this folder. Type the command:

and you should be able to throw words at the network and see what it comes out with!

To summarise, this project was my first time working with neural networks and I am very happy with the results. The high accuracy was pleasing and it was fun to see how the network dealt with a variety of words.

--

--

Tom Ham
Coinmonks

Student at University of Bath. Interested in data science and machine learning.