Google Engine from scratch: PageRank algorithm in Dart/Flutter

Ilia Gyrdymov
6 min readJul 5, 2022

--

Hi all,

I’m Ilia, and I’m a frontend developer with an interest in Machine Learning and Dart programming language.

A couple of years ago, I started a project — a library in Dart called ml_linalg. The library aims at efficient computations in the Linear Algebra field.

ml_linalg is based on SIMD architecture which allows doing the math fast.

I’d like to demonstrate to you the possibilities of the library by showing how to implement a famous PageRank algorithm in the Dart programming language.

But first, let’s look through the details of the algorithm.

What is PageRank?

It’s an algorithm which ranks websites for the search output. The more the rank, the more the probability of visiting the website, and the higher the position in the search output should have. The algorithm was developed by Larry Page and Sergey Brin, founders of Google, in 1996.

PageRank is still used in the google engine to rank websites.

Imagine we have a small internet consisting only of 4 sites. Let’s describe the relationships between the sites as a graph:

  • Site “A” has links to “B” and “C”
  • Site “B” has links to “C” and “D”
  • Site “D” has a link to “C”
  • Site “C” has a link to “D”

The graph is awesome, but I have some doubts if a computer can understand it, so let’s convert the graph into pure numbers. To do so, we need to create a matrix where rows will contain the probability of visiting the corresponding site, and columns will contain the probability of leaving the website:

The first column describes the probabilities of leaving site “A”. One can leave the website for “B” or “C” websites. In total, it’s possible to visit 2 websites from the “A” website. The probability of visiting each website is the same, so it can be written as 1/<total_amount_of_websites>, which is 1/2.

The second column describes the probabilities of leaving site “B”. One can leave the website for “C” or “D” websites. In total, it’s possible to visit 2 websites from the “B” website. The probability of visiting each website is the same, so it can be written as 1/<total_amount_of_websites>, which is 1/2 again.

The third column describes the probabilities of leaving site “C”. One can leave the website for the “D” website only. In total, it’s possible to visit 1 website, so the probability for the “D” position is 1.

The fourth column describes the probabilities of leaving site “D”. One can only leave the website for the “C” website. In total, it’s possible to visit 1 website, so the probability for the “C” position is 1.

Let’s think of ranking the websites. At the first glance, it seems that the task is pretty simple — we may just sum up all the probabilities of visiting a certain website. Let’s denote the function for getting the rank value as R(<website>) and do it:

  • R(A) = 0 + 0 + 0 + 0 = 0
  • R(B) = 1/2 + 0 + 0 + 0 = 1/2
  • R(C) = 1/2 + 1/2 + 0 + 1 = 2
  • R(D) = 0 + 1/2 + 1 + 0 = 1.5

Wait a minute. The rank definition tells us that not all websites are equally important, doesn’t it? But we treat all the sites equal. Apparently, it should be the weighted sum. Let’s rewrite our calculations:

  • R(A)= R(A) * 0 + R(B) * 0 + R(C) * 0 + R(D) * 0 = ?
  • R(B) = R(A) *1/2 + R(B) * 0 + R(C) * 0 + R(D) *0 = ?
  • R(C) = R(A) *1/2 + R(B) * 1/2 + R(C) * 0 + R(D) *1 = ?
  • R(D) = R(A) *0 + R(B) * 1/2 + R(C) * 1 + R(D) *0 = ?

Hmm, it seems that it is a self-referring problem: we use ranks to calculate the ranks. Let’s write it in a more generic form:

This is a formula for just one website’s rank. Let’s pack all ranks into a vector and rewrite the formula in an even more generic form:

Doesn’t it remind you of something? Ok, you might have forgotten it. Vector r is an eigenvector of the matrix L! The vector is our target, and we have to find it.

I think it's better to refresh our memory and try to remember what eigenvectors are.

Say we have a couple of vectors:

which are basically the unit vectors. Let’s draw them on the coordinate system:

As you may notice, the vectors form a span:

Imagine we need to apply a linear transformation. Say, we need to shear the space formed by the vectors:

The dashed lines are the new span.

Which of the original vectors didn’t change their direction? Right, vector b stays the same. The vector is a kind of “characteristic” of the transformation. It is the most stable thing in this small Universe.

These stable vectors that don’t change direction after linear transformation are called “eigenvectors”.

Even if the vector is scaled after a transformation, it is still the eigenvector:

Vector b points in the same direction but has a different length.

The scale factor is called “eigenvalue”.

That’s the basis for eigen-stuff.

It’s pretty simple to calculate eigenvectors with the ml_linalg library. The Matrix class has a method called eigen, which is exactly what we need. There are several ways to find eigenvectors and corresponding eigenvalues, but for our example, the simplest method works really well. It’s called the Power Iteration method, which is the default one for eigen function.

Let’s code. What we have to do:

  • We need to create a matrix to store probabilities from our example using the Matrix class
  • Finding eigenvectors is an iterative procedure, so we need to create a vector with initial ranks using the Vector class — it will be our starting point for the future iterations
  • Call the eigen method with the initial vector from the previous step

Let’s look at what we have:

Since there can be a lot of eigenvectors, eigen returns a collection of vectors and their eigenvalues. Since the Power Iteration method finds only one eigenvector, we may just call the getter first of the returned collection.

All computations behind eigen method are SIMD-powered, the method is highly efficient and fast.

The last instruction prints the following:

(0.0, 0.0, 0.7893522381782532, 0.6139405965805054)

It can be interpreted in the following way:

  • The rank value of “A” is 0.0
  • The rank value of “B” is 0.0
  • The rank value of “C” is 0.78
  • The rank value of “D” is 0.61

According to the vector, the third element, the “C” website, has the highest rank. It makes a lot of sense since “C” has the greatest amount of input links compared to the other websites.

That’s pretty much it,

Cheers :)

--

--

Ilia Gyrdymov

Frontend engineer (Dart, Vue/Vuex/Nuxt, React/Redux + Typescript) with an interest in Machine Learning, living in Cyprus