Google’s Prestigious PageRank Algorithm that Revolutionalized Web Surfing.
A long time ago, where Google did not even exist, searching the web was a whole other exercise by itself. It was like running a 10km run after having a complete cross-fit workout. The sites suggested by the older search engines were too often irrelevant and old, while the ones you really wanted were either buried way down in the list of results or missing altogether. This was mostly because of the fact that there wasn’t an automatic ranking algorithm based on the number of visits and relevance based on time. For all the computer science enthusiasts, you could even say that the web was just a queue.
The whole purpose of the web was to return the best and most relevant answers to all your problems as quickly as possible. Having your web represent a queue, which just constantly increases infinitely, can make searching up an answer as difficult as trying to look for your lucky spoon in a stack of dishes in the sink right after a party.
Though, with the help of linear algebra, more specifically matrices and vectors, link analysis came to the rescue. Whether you want to detect patterns in large data sets or perform gigantic computations involving millions of variables, linear algebra has the tools you need. For example, just like how Google revolutionalized the web with their PageRank Algorithm. This algorithm is the reason why today, we can find relevant answers to most of our questions quickly.
Worrying about content turned out to be an impractical way to rank web pages. Computers weren’t good at it, and human judges couldn’t keep up with the thousands of pages added each day. The approach taken by Larry Page and Sergey Brin, the grad students who co-founded Google, was to let webpages rank themselves by voting with their feet, or in other words, their links.
Google’s Algorithm assigns a fractional score between 0 and 1 to each page. That score is called its PageRank. It essentially measures how important that page is relative to the others by computing the proportion of time that a hypothetical Web surfer would spend there. Whenever there is more than one outgoing link to choose from, the surfer selects one at random, with equal probability. Under this interpretation, pages are regarded as more important if they are visited more frequently. And because the PageRanks are defined as proportions, they have to add up to 1 when summed over the whole network.
Sounds pretty easy to create such a powerful algorithm right? That’s what I thought. Over the past weekend, I attempted to create a very basic algorithm that reflects the general physics of the PageRank algorithm only with 3 nodes (these nodes represent websites).
Initially, the algorithm will start off by giving every website an equal rank of 1/3. After each iteration (10 in total in this simulation), the ranks of each vertex update based on 3 calculations:
x = z
y = (1/2)x
z = (1/2)x + y
*This update rule can be seen from the model above*
The rule is that each page takes its PageRank from the last round and distributes it equally to all the pages it links to. Therefore, after one round, the updated value of X would still equal 1/3, because that’s the PageRank it received from Z. However, Y’s score drops to 1/6, since it gets only half of X’s PageRank from the previous iteration. The other half goes to Z, which makes Z the highest-ranked website in this iteration since along with 1/6 it receives from X, it also gets the full 1/3 from Y, thus totaling to 1/2.
After 10 iterations, we can come to the conclusion that the algorithm’s final results follow a similar linear pattern, x = 2y= z. This implies that X and Z are equally important pages, even though Z has twice as many links coming in. This is turn, makes sense because X is just as important as Z because it gets the full endorsement of Z but reciprocates with only half its own rank. The other half is then sent to Y, which also explains why Y’s rank is only half as well as X and Z.
In terms of my code, it is not as complex as it sounds by the conceptual definition. The Graph data structure was used to make the underlying base for the algorithm’s computations to provide more flow.
Class Alg and
.PageRank() is where the actual algorithm resides. After initializing the graph with all the vertices, which represent different websites, we prompt a for loop to iterate 10 times to change each value of each vertex (the Fraction module was used to make calculations more accurate when playing with floats) based on their linear relation, which is defined above.
base_percentages was a copied list of
percentages, which was used to reference the updates of the ranks of each vertex, and at the end, to finish the algorithm off, we printed a simple statement printing each vertex’s value and their PageRank, which was converted into a percentage after the 10 iterations.
Overall, the complexity of this algorithm was O(N) and the algorithm does portray the ratio 2:1:2 between the vertices.
Here is the link to my project: https://github.com/GEEGABYTE1/PageRank_Alg_Simulation
Linear Algebra is great when you work with a plenitude of variables, as there are in the real web. In other words, it works great when trying to develop faster and faster algorithms for solving massive sets of equations. Each small difference makes a huge impact on the front-end ranging from multiple aspects, such as the performance of the web itself, to the accessibility of different sources on the web. This small project was great to get a grasp of what linear algebra can do, and how powerful Google’s PageRank algorithm really is. Although my algorithm only solves 3 basic equations, in reality, Google’s algorithm can solve billions at a time!