Learning Machine Learning: Toy Search Engine on NodeJS

4 min readMar 17, 2018

Machine Learning is modern trend and prospective direction in IT-technologies. And some days ago I started to learn it too, trying to make my PC thinking :).

First I realized that it’s not a magic :).

There are machine learning tools like tensorflow, keras, pytorch. But for me as noob it was a difficult to understand a plenty of its options. Hopefully I met synapticjs, which’s author quite easy explained the basis of neural networks and neurons. And also it was a plus for me that synaptic is developed on JavaScript, language I use in my current job.

And here I’ll tell about my first attempt to make some more or less serious task with neural network: to make toy search engine function, which will pass query request and return relevant list of suitable articles. In order to get relevant response I wanted to teach neural network on small scope of wikipedia articles.

I prepared db.json file with 20 wikipedia urls and post chunks. I started with small number, because it may be easier then to get and debug result and looks enough to understand basics.

My idea was next:

On training neural network should learn how to separate each article to own category (20 posts = 20 categories);
On usage neural network should define what category corresponds more to request.

First Approach

As I didn’t work with neural networks before I started to read articles. So for text classification the approach may be follow:

Gather words dictionary from all used posts. Also it would be nice to skip useless words as as, of, a, an, or, and and so on. In my case I gathered nearly 1000 words and just skipped words with length < 3. Of course there are more advanced variants. And also I used stemming from library natural in order to consider different forms of word as one word, for example: text and texts as onetext.
Assign to each word unique vector of 1 and 0’s, which’s length should be equal to dictionary length. For example, if I have 5 words: engineer, people, invent, design, analyse, their vectors will be next:

engineer = [0, 0, 0, 0, 1]
people   = [0, 0, 0, 1, 0]
invent   = [0, 0, 1, 0, 0]
design   = [0, 1, 0, 0, 0]
analyse  = [1, 0, 0, 0, 0]

Such indexing guarantees that if post includes different words than other posts, it will have unique vector among other posts too:

engineer people invent                = [0, 0, 1, 1, 1]
people design analyse                 = [1, 1, 0, 1, 0]
engineer people invent design analyse = [1, 1, 1, 1, 1]

On the other hand if two posts include the same words with different order they will be considered as the same :).
Vector of each post is passed to input of neural network. And amount of input neurons should be equal to words amount in dictionary and length of post vector.
The result of neural network output should be the index (category) of post. For post category indexing I used its serial number in binary format. For example, post Artificial_neural_network has number 19 among other posts and its category vector is [1, 0, 0, 1, 1].
For neural network I used simple model Perceptron with 1000 neurons of input layer, 100 neurons of hidden layer and 5 neurons of output layer. Unfortunatelly this amount of neurons was so huge for my PC :). My network processed only 10 iterations of learning during some minutes and gave not friendly result. Maybe I did something wrong with synaptic options and didn’t know neural theory, any case it was failed approach for me, and I decided to find something else.

Second Approach

Which I partially got from neural network articles and partially reached on my own throughs.

The idea:

Gather dictionary of words as in previous. Associate some float number between [0; 1] with each word.
Split post to ngrams with 10 (just empirical value) words in each ngram.
As ngram is an array of words it may be represented as a vector of word float numbers.
Associate each category with float number in range [0; 1].
Teach neural network to classify each ngram of post with post category number.
And for neural network it requires just 10 neurons for input layer, 4 neurons for hidden layer, 1 neuron for output layer. Much smaller than in previous variant.
10000 iterations to teach were passed quite quickly (30.486s) and some results:

> var se = require('./nnSearch')> se.search('american canadian engineer and business man')
[ 'https://en.wikipedia.org/wiki/Elon_Musk',
  'https://en.wikipedia.org/wiki/Engineer',
  'https://en.wikipedia.org/wiki/Sport',
  'https://en.wikipedia.org/wiki/Russia',
  'https://en.wikipedia.org/wiki/Usa' ]> se.search("creator of linux")
[ 'https://en.wikipedia.org/wiki/Linus_Torvalds',
  'https://en.wikipedia.org/wiki/JavaScript',
  'https://en.wikipedia.org/wiki/Linux',
  'https://en.wikipedia.org/wiki/Artificial_neural_network',
  'https://en.wikipedia.org/wiki/Open-source_software' ]> se.search('free operating system core created by finnish engineer')
[ 'https://en.wikipedia.org/wiki/Linus_Torvalds',
  'https://en.wikipedia.org/wiki/JavaScript',
  'https://en.wikipedia.org/wiki/Linux',
  'https://en.wikipedia.org/wiki/Artificial_neural_network',
  'https://en.wikipedia.org/wiki/Open-source_software' ]> se.search("high level programming language")
[ 'https://en.wikipedia.org/wiki/JavaScript',
  'https://en.wikipedia.org/wiki/Artificial_neural_network',
  'https://en.wikipedia.org/wiki/Linus_Torvalds',
  'https://en.wikipedia.org/wiki/Linux',
  'https://en.wikipedia.org/wiki/Open-source_software' ]> se.search("image recongition")
[ 'https://en.wikipedia.org/wiki/Artificial_neural_network',
  'https://en.wikipedia.org/wiki/JavaScript',
  'https://en.wikipedia.org/wiki/Linus_Torvalds',
  'https://en.wikipedia.org/wiki/Linux',
  'https://en.wikipedia.org/wiki/Open-source_software' ]> se.search("free operating system")
[ 'https://en.wikipedia.org/wiki/Linux',
  'https://en.wikipedia.org/wiki/Linus_Torvalds',
  'https://en.wikipedia.org/wiki/Open-source_software',
  'https://en.wikipedia.org/wiki/JavaScript',
  'https://en.wikipedia.org/wiki/Node.js' ]> se.search("third planet from sun")
[ 'https://en.wikipedia.org/wiki/Software',
  'https://en.wikipedia.org/wiki/Node.js',
  'https://en.wikipedia.org/wiki/Earth',
  'https://en.wikipedia.org/wiki/Open-source_software',
  'https://en.wikipedia.org/wiki/Sun' ]> se.search("universe object with strong gravitational effect")
[ 'https://en.wikipedia.org/wiki/Black_hole',
  'https://en.wikipedia.org/wiki/Albert_Einstein',
  'https://en.wikipedia.org/wiki/Sun',
  'https://en.wikipedia.org/wiki/Google',
  'https://en.wikipedia.org/wiki/Earth' ]> se.search("top russian it company and search engine")
[ 'https://en.wikipedia.org/wiki/Moscow',
  'https://en.wikipedia.org/wiki/Tallinn',
  'https://en.wikipedia.org/wiki/Usa',
  'https://en.wikipedia.org/wiki/Yandex',
  'https://en.wikipedia.org/wiki/Russia' ]> se.search("the capital and large city of estonia")
[ 'https://en.wikipedia.org/wiki/Usa',
  'https://en.wikipedia.org/wiki/Russia',
  'https://en.wikipedia.org/wiki/Moscow',
  'https://en.wikipedia.org/wiki/Sport',
  'https://en.wikipedia.org/wiki/Tallinn' ]> se.search("the capital of russia")
[ 'https://en.wikipedia.org/wiki/Russia',
  'https://en.wikipedia.org/wiki/Sport',
  'https://en.wikipedia.org/wiki/Usa',
  'https://en.wikipedia.org/wiki/Engineer',
  'https://en.wikipedia.org/wiki/Moscow' ]> se.search("federal republic country with states")
[ 'https://en.wikipedia.org/wiki/Moscow',
  'https://en.wikipedia.org/wiki/Usa',
  'https://en.wikipedia.org/wiki/Tallinn',
  'https://en.wikipedia.org/wiki/Russia',
  'https://en.wikipedia.org/wiki/Yandex' ]

Resume

With neural networks there are two big issues: how to prepare dataset and what network configuration should be.
Neural network parameters should be chosen empirically.
Time to train neural network can be very long.
Search engine gives some relevant result but not ideal. (In fact the attempt to use the same network configuration for code documentation searching gave more worse result. Suppose it related with lack of deep theoretical knowledge and experience, primitive dataset preparation and network configuration.)
My PC can’t think on its own or hides this :).

P.S. Source code

schipiga/simple-search-engine

Contribute to simple-search-engine development by creating an account on GitHub.

github.com