Analysis of Google’s Page Ranking Algorithm

Sriram Rella
10 min readNov 6, 2022

--

google.com

The world we live in today is the most innovative and technologically advanced than it ever was, and those advancements are only proliferating by the day, nay in some fields by the hour. Due to these ever-changing and ever-evolving concepts and new inventions, the notion of AI, 3D-designing, virtual reality are all becoming an integral part of our lives. However, the most important creation that changed the entire world is the invention of a search engine. The search engine allows anyone from anywhere in the world to get any information they need within seconds after typing their requirement in a small box. But what exactly is a search engine?

As defined by Google, a search engine is “a program that searches for and identifies items in a database that corresponds to keywords or characteristics specified by the user.” A search engine is essentially a finder that scrounges through hundreds and millions of data to provide the user with the desired results. Google has been the number one search engine since its beginning in 1998 due to its efficient page rank algorithm. Many search engines failed to compete against Google’s speed and amount of data provided because they all lacked the page rank algorithm.

Before explaining Google’s page rank algorithm and why it is so efficient, let us expound on the Google search engine model. Stanford computers initially used the Google search engine model developed by Sergey Brin and Lawrence Page; later, it became a universal application because of its effective use of a crawling technique, which consists of spiders that search the content of the web beforehand. Thus, no search engine works in real-time. During this crawling process, spiders gather keywords that exclude stop words such as “the” and “of,” once it has collected these crucial words, it stores each word and the related document information into a database called an inverted index. This index works by finding a specific word; this word is assigned a document number that corresponds to the description of the word and the number of occurrences. The index proceeds to store the URL that contains a higher number of keywords. This process is performed every month or so, and the spider returns to all of the previously visited pages and looks for updates/changes. The spider does not just search for keywords and store their URLs but first visits a seed page which is used as a base so the spider can link to other pages within that initial page, which link other pages in which the spider either uses a breath-first or depth-first algorithms to collect metadata, such as document types, dates, creators, and so on. Now, let us explain Google’s page rank algorithm. The PageRank is a function that allocates a real number to all of the pages that are crawled by the spider. Once the crawling is complete, the web page with the highest number assigned to it is considered the most essential/relevant web page for the user’s input. PageRank, although it might seem like a fixed algorithm, is not because a simple change in the user input can change the entire PageRank. Google’s PageRank algorithm is most useful because of its efficient use of transition matrices. These matrices and their use will be explained in further detail later on. Our main goal for this project was to use the information gathered from Google’s PageRank to implement our own search engine. Due to us not having servers like Google, Bing, and Yahoo, we limited our search engine within the Georgia Tech website’s bounds.

One of the most important applications of eigenvectors and eigenvalues is the Google’s Page Rank Algorithm. For example let’s say we’re searching for “Elon Musk.” (the face behind Tesla and SpaceX, aka the real Tony Stark as if you didn’t already know). What is going to determine the given order of pages (wikipedia.org, tesla.com, forbes.com etc) about Elon Musk?

Well the www hyperlink structure forms a directed graph where the nodes represent web pages and the directed edges are the hyperlinks. In simpler terms the whole www can be transformed into a directed graph.

Here is a simple visual representation of a directed graph. Here we can see A, B, C and D are respective web pages. There are several connections between these pages as in B points to A, C points to A, A points to D and so on. So now that we have a direct graph, how do we define an important webpage? As stated earlier, if we have a webpage that has many other webpages pointing towards it, then we can assume it has to be an important webpage. To further our understanding about the WWW hyperlink structure we need to understand the types of links we can find. Firstly, inbound links are links in a webpage from other webpages. Next, outbound links are links that point from a given webpage to another page. Lastly, we have dangling links, these are links that just point to any page with no outgoing links. For example, webpage A has two incoming links from C and B and outgoing links to pages B, C and D. So, what we can infer here is that the more important a given webpage is, the higher it’s PageRank will be, hence will be placed top after a Google search. So, if we search for Elon Musk, the first few web pages have higher PageRank. Wikipedia is at the very top, suggesting that many webpages are pointing to it.

Breadth-first search, a concept taught in Data Structures & Algorithms, comes into play when web crawling. Here is a simple BFS graph:

We have a root node A, which has two children B and E and B has two children C and D. BFS visits each node on a layer by layer basis, meaning first its starts at node A, then visits B and E, then visits its children C, D and F (in that order). For our handson php implementation, we implemented BFS to the root webpage “www.gatech.edu".

PHP Script for generating links that are on the current website (crawler)
Prints out the data and calculates the PageRank for each page
Calculates the PageRank based on the links in the SQL query

Our code finds all the hyperlinks within gatech.edu and visits those hyperlinks to find more hyperlinks. Hence giving us this result with search term: “Computer”:

We’ve successfully crawled and understood the topology of gatech.edu. What is the mathematics behind these results? This is the original formula behind the PageRank Algorithm.

Here PR(A) is the rank of webpage A and its rank depends on the PageRank of other websites. PR(B), PR(C) are the ranks of pages B, C D, that link to the main webpage A. L(B), L(C), L(D) are the number of outbound links for a given webpage. D is the damping factor in the range of 0 and 1. So, this suggests, coming back to the simple BFS graph, if we want to know the PageRank of web page A, we need to know the page rank of C and D as they point to A. In other words, PageRank of C and D will determine the Page Rank of web page A. So using the above formula, the PageRank of web page A would be:

Iteration 0 has 1/4 probability for each webpage because a random surfer can start from any of the four webpages, hence each webpage has an equal probability at the very beginning. Furthermore, as we do more and more interactions using the formula basising of the previous iteration, we will understand the precise PageRank of a webpage. In this case, webpage A seems to have a higher PageRank as web pages B, C and D have the same PageRank. Using the formula is one way to find the page rank of each page, but we can also use matrix representation which allows us to do multiple calculations at the same time to find the PageRank of a webpage. Here we introduce the transition matrix. The transition matrix is a stochastic matrix where each column describes the probability of a surfer landing on one page after one iteration. Using the directed graph above, suppose a random surfer is on webpage A. From webpage A, the surfer can visit webpage B, C, or D. So given this scenario, we have 1/3 probability of the surfer landing on webpage B, C, or D. Lets now suppose the surfer starts on webpage B. From B, there is 1/2 probability of the surfer landing on D and 1/2 probability of landing on A. The same idea goes for webpages C and D. If we sum up the values of each column, we will end up with 1. This gives us the below stochastic matrix.

As we multiply each iteration by the stochastic matrix we get a closer approximation of the PageRank of a given webpage. The initial vector is the vector:

To find the second iteration of the PageRank we’ll have to multiply the initial vector to the stochastic matrix. To find the third iteration we’ll have to multiply the second iteration vector by the stochastic matrix and so on. We can derive this formula (v is the page rank of the given pages, and P is the transition matrix) which is very similar to Markov chains:

So in order to find the final PageRank vector for Markov-chains, we just have to solve an eigenvector-eigenvalue problem. To do so we need to find the eigenvector corresponding to an eigenvalue of 1 to find the PageRanks for the web pages. We have walked through the web pages which have both outgoing and incoming links, however, what is the case for dangling webpages (links that just point to any page with no outgoing links)? Suppose we have the following graph:

The probability of a random surfer landing on either of these three web pages is 1/3. And the transition matrix for such a graph would be:

If we solve this transition matrix using the method above we get for our first iteration this vector:

But, when we solve for Iteration 2 we get the zero vector, but this is wrong, why?:

To avoid such scenarios, we need to use the damping factor (1- d).

G = dP + (1 — d)K

Here, G is the PageRank matrix, d is the damping factor, P is the transition matrix and K is the identity matrix. And all the elements of the nxn matrix K are equal to 1/n. (1 — d) is the probability a random surfer will click on the links in a given webpage. It is also the probability the surfer will leave the current webpage and go to another webpage (teleportation). Google is said to use d = 0.85.

Google search engine, the world’s most fastest and powerful search engine. All that is possible only because of their invention of PageRank. PageRank is nothing but an algorithm that effectively implements various linear algebra concepts into its programming. Concepts such as the transition matrix are implemented alongside the coding algorithm of BFS; Markov Chains that design a vector for the spider to traverse the web. Most importantly, the stochastic matrix which builds the core foundation of ranking each page and iteration with the help of matrix multiplication. As shown above, we have implemented our own search engine that is bounded within Georgia Tech’s website only because of the sheer depth and complexity of the world wide web. However, we were able to intertwine the knowledge gained from coding and linear algebra to develop a PHP script that uses the BFS, which within itself uses matrices and their functionalities. Even for the code that is used to print out the result of PageRank a matrix is used. All in all, though PageRank uses a vast number of linear algebra concepts, it is not the only algorithm that does; machine learning, 3D printing, AI are only a few of the other coding and high level technologies that implement linear algebra into their programs. Yet, linear algebra is not limited to coding and high level technologies, it is used for analyzing data gathered during science experiments, cinematography, animation and many more.

Photo by Nathana Rebouças on Unsplash

Co-authored by Rahul Sunkara, and Jared Zengo

--

--