Classifying text with Neural Networks and mimir in JavaScript

Aug 12, 2015 · 6 min read

I have been working a lot with Computer Vision (CV) in the last few years: inevitably, at some stage, the topic of Machine Learning will crop up when working with CV. I had the opportunity to learn about the most popular concepts and models: amongst those, I was particularly intrigued by Artificial Neural Networks (ANN). There are plenty of resources on the web that illustrate the ANN model, so I’m not going to explain it in this article, rather I will show how to use ANNs in JavaScript for text classification purposes.

brain

brain.js is an exceptionally good library, probably one of the best discoveries I made on github, and it is extremely easy to use. brain is a Neural Network library written in JavaScript and suited my purpose perfectly. brain is also responsible for the fact that I dropped my own implementation of ANNs in JavaScript as I felt I was just wasting my time trying to match the quality of brain.

mimir

mimir is a micro-module I wrote that uses the Bag-Of-Words model to represent text in a vector form. mimir also performs tf-idf (term frequency — inverse document frequency) analysis to weigh the importance of a word within the context of a set of texts.

Bag-Of-Words (BOW)

What is the BOW model? The example on Wikipedia is actually quite clear: given two documents

John likes to watch movies. Mary likes movies too.

and

John also likes to watch football games.

you can construct a dictionary consisting of all the words included in either document:

{
"John": 1,
"likes": 2,
"to": 3,
"watch": 4,
"movies": 5,
"also": 6,
"football": 7,
"games": 8,
"Mary": 9,
"too": 10
}

so each document can be represented in an array form, each element representing the number of occurrences of a word in that document, in our case:

[1, 2, 1, 1, 2, 0, 0, 0, 1, 1]
[1, 1, 1, 1, 0, 1, 1, 1, 0, 0]

Pretty simple. Let’s move on.

History, Music or Programming?

Those of you who are familiar with ANNs will probably have figured out where I’m going with this.

Given a set of documents, it is possible to extract a dictionary of words and represent each document in the set with an array.

The size of this array will be the size of the input layer of the ANN and the number of classes will be the size of the output layer. The size of the hidden layer can vary, there are several rules-of-thumb on the web, such as the mean value between input and output, half the size of the input, and a few more.. the beauty of brain is that it calculates everything automatically according to the data fed to the ANN. This is very useful because I can forget about computing the size of the dictionary to specify the input layer size, for example.

Take the following set of texts (3 texts per class: History, Programming and Music):

[
// history
“The end of the Viking-era in Norway is marked by the Battle of Stiklestad in 1030”,
“The end of the Viking Age is traditionally marked in England by the failed invasion attempted by the Norwegian king Harald III “,
“The earliest date given for a Viking raid is 787 AD when, according to the Anglo-Saxon Chronicle, a group of men from Norway sailed to the Isle of Portland in Dorset”,
// programming
“A programming language is a formal constructed language designed to communicate instructions to a machine, particularly a computer. Programming languages can be used to create programs to control the behavior of a machine or to express algorithms.”,
“Thousands of different programming languages have been created, mainly in the computer field, and many more still are being created every year.”,
“The description of a programming language is usually split into the two components of syntax (form) and semantics (meaning). Some languages are defined by a specification document (for example, the C programming language is specified by an ISO Standard), while other languages (such as Perl) have a dominant implementation that is treated as a reference”,

// music
“Classical music is art music produced or rooted in the traditions of Western music (both liturgical and secular)”,
“European music is largely distinguished from many other non-European and popular musical forms by its system of staff notation, in use since about the 16th century”,
“classical music has been noted for its development of highly sophisticated forms of instrumental music.”
]

We want to extract a dictionary containing all the terms present in the texts, and represent each text as a JavaScript array. To do that, we need both mimir, and we need brain for the Neural Network:

var mimir = require('mimir'),
brain = require('brain');

Note that I work in node.js but you can reproduce this in the browser without a problem.

Then we create a couple of utility functions:

function vec_result(res, num_classes) {
var i = 0,
vec = [];
for (i; i < num_classes; i += 1) {
vec.push(0);
}
vec[res] = 1;
return vec;
}
function maxarg(array) {
return array.indexOf(Math.max.apply(Math, array));
}

The first one is a small utility that creates an array representing the result of the classification (more on this later) and the second is a small function which returns the index of the max value of an array (which is useful for outputting the name of the class, again more on this later).

Now to the fleshy bit of the code, first we create a map of classes:

var ANN_Classes = {
HISTORY: 0,
PROGRAMMING: 1,
MUSIC: 2
},
classes_array = Object.keys(ANN_Classes),

The texts we outlined before are stored in a variable texts, so we can obtain the dictionary for those texts by calling

dict = mimir.dict(texts),

and then we create the training data by associating each text to the appropriate classification:

traindata = [
[mimir.bow(texts[0], dict), ANN_Classes.HISTORY],
[mimir.bow(texts[1], dict), ANN_Classes.HISTORY],
[mimir.bow(texts[2], dict), ANN_Classes.HISTORY],
[mimir.bow(texts[3], dict), ANN_Classes.PROGRAMMING],
[mimir.bow(texts[4], dict), ANN_Classes.PROGRAMMING],
[mimir.bow(texts[5], dict), ANN_Classes.PROGRAMMING],
[mimir.bow(texts[6], dict), ANN_Classes.MUSIC],
[mimir.bow(texts[7], dict), ANN_Classes.MUSIC],
[mimir.bow(texts[8], dict), ANN_Classes.MUSIC]
],

after this, we create the texts to be classified:

test_history = 'The beginning of the Viking Age in the British Isles is, however, often given as 793.',
test_music = 'Baroque music is a style of Western art music composed from approximately 1600 to 1750',
test_bow_history = mimir.bow(test_history, dict),
test_bow_music = mimir.bow(test_music, dict);

Now we are ready to feed our mini-dataset to train the ANN:

var net = new brain.NeuralNetwork(),
ann_train = traindata.map(function (pair) {
return {
input: pair[0],
output: vec_result(pair[1], 3)
};
});
net.train(ann_train);

And lastly we test the trained network:

var predict = net.run(test_bow_history);
console.log(predict);
console.log(classes_array[maxarg(predict)]);
console.log(classes_array[maxarg(net.run(test_bow_music))]);

And here’s the output:

[ 0.7126992212077395, 0.09320228147016081, 0.130837623591308 ]
HISTORY
MUSIC

That first array is an example of the output produced by brain: each class has a probability associated with it, the highest value is the most probable class. Our function maxarg calculates the index of the highest value, which we use to retrieve the name of the class in the classes_array array.

The last two lines are human-friendly versions of the classification operated by the ANN.

Note

Naturally, the example is ultra-simplified, with toy-sized training data, and samples which were easily classifiable, nonetheless, the ANN rated the probability of the historical text to be 0.71 versus 0.03 for programming and 0.13 for music, so we can definitely call ourselves satisfied as proof-of-concepts go.

Conclusion

BOW allows to represent text as an array, so it opens up a world of possibilities. For example, we can apply the same concepts illustrated here and use an SVM classifier instead of ANN, or even both, or even more and then maybe calculate some kind of ground truth to increase our level of confidence in the final classification.

Complete example code available at: classify-text

Written by

Joe Minichino

Distributed Systems Sorcerer, AI and Heavy Metal. Labs Engineer at Teamwork.com

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just \$5/month. Upgrade