Nailing Internal Linking with Graph Theory

Paulo Suzart
Aug 6, 2020 · 10 min read

Simply put, Internal Linking is the ability to link pages in the same domain. This way users have can have more fluid browsing experience, while Search Engine Bots can also understand your content based on how these pages are linked. To make it effective, one should consider at least two technical requirements: i) having the biggest amount of pages in this site to be linked, and ii) position the linked pages within few clicks possible away from the home page.

In this article we will explore how we nailed SEO Internal Linking at Omio by applying Graph Theory algorithms to achieve ~99% pages linked and keeping 90% of pages reachable within a certain depth.

TL;DR:
Search Engines “see” your pages through the lenses of a huge graph processor. When they apply algorithms like Page Rank, these three dimensions — amount of in-links, quality of in-links and distance from the home page — are what matters in the ranking and in-ranking of your pages. Remember, your site is a subgraph of the whole internet, by presenting a “healthier” graph to users and bots your chances of better ranking grow immediately.

Orphan and Deep Pages

At Omio we tried several different approaches over time. None seemed to be working properly. We always ended up with a high number of Orphan Pages and Too Deep Pages, that are pages that lie too many clicks away from the home page.

The message you send to bots when you have pages deep down the layers of pages is: “hey, this is not a relevant page”. And Orphan Pages, they are not even reachable browsing from the home page, the message here is a bit worse, it’s like: “Hey Google, don’t even consider those pages”.

The two requirements presented in the first paragraph of the article make total sense but they should be governed by the main rule of SEO: relevant pages should get more links. You don’t want those low traffic or low conversion pages getting the same amount of links your blockbuster pages are getting, right? In this equation, quality also matters. In other words, it is important for a page to have a high number of links pointing to it from other pages with high relevance.

Previous Approaches

We would follow very simple rules in the past. Things like:

rule A) If you are on a page “Mode from A to B”, like Trains from Berlin to Paris, show a box of links containing the top 30 destinations in Europe.

rule B) If you are on a page “Travel to A”, like Travel to London, show the top destinations leaving from London.

rule C) If you are on a page “mentioning a location A”, show the nearest connections from that location A. For example if you are in a page similar to Travel to Berlin, show connections to locations nearby, like trains from Berlin to Potsdam.

There were many other rules. And they actually make sense from a business standpoint. The problem is that each one of them would heavily disturb the amount of links displayed and concentrate all the internal links pointing to the same place.

The rule A would cause the same 30 destinations to get links across all pages mentioning locations in Europe. This is good for London, Amsterdam and Munich, but that’s it. Smaller destinations would simply be left behind. The rule B is also useful, but given that we have ~1800 pages ending in London, we would get 1800 pages linking to the same set of pages leaving from London.

The impact of rule C is questionable. First, the definition of what “nearby” means is a little fuzzy — is it 30km, 70km or 300km? What if the nearest city nearby is in another country? Should we still display it as a nearby route? What about islands? Additionally, there aren’t always pages created for other cities around a certain location. There are many edge cases to consider which can get very complex.

Enter Graphs

But wait a minute! We are clearly talking about nodes (locations) being connected (through a mode of transport). It seems that we have a Multi Directed Graph here. Something like:

Very good, this is a start. We are clearly talking about a graph when we see our websites, and our business is all about connecting locations by some mode of transportation. But still, how can this be useful?

After a deep analysis that we called French Report because it used our french domain, we identified a couple of indicators from graphs that could help us understand the health of our graphs of locations per domain.

Triangle Count — This determines how “strongly” positions are linked to other two positions linked among them. A higher number of triangles means more pages available to connect to a triplet of positions.

Inbound and outbound degree —The number of connections to a given position and the number of connections from a given position. This is relevant because having a high number of triangles but an unbalanced number of in/out connections will cause the page to be isolated and not being able to link to others.

The All Pairs Shortest Path — This is the shortest path a bot would have to crawl to reach position B by starting at position A. If we aim to a certain depth X, this indicator should be close to his target depth X.

Super cool but what to do now? We understood Internal Linking is a graph problem, our business is all about nodes connected, hence we found some interesting indicators that showed the need for improving the connectivity of our nodes (create landing pages that would connect locations that were disconnected before). All this is great but the biggest challenge is to be solved: How to connect all the various pages?

Link Pie — Ultimate Internal Linking Algorithm at Omio

From the README of Link Pie, our Internal Linking Algorithm:

It uses Graph Theory to elaborate a multi step algorithm that given a domain and a list of published pages in that domain, links all pages in an optimised tree of pages where:

1) it has the minimal possible depth
2) the relevance of pages is preserved
3) there are no pages left unlinked
4) clusters of locations — whether accidentally or intentionally — geographically isolated, are linked to the biggest portion of pages

Item 1) is very interesting because it should answer the question: “Which pages should then be linked from the home page in order to have the minimal amount of clicks to reach any other page?”

It’s tempting to consider again using business rules to define this. In the end, the pages linked from the home page — that by the way is linked from every other page in your site, thus a very relevant page—will get a high rank, so better do a wise choice. The solution has to be based on the set of pages only, nothing else.

We tried several coefficients, ending with the most natural coefficient for us, the Betweenness Centrality, called BC here. That is, a node with high BC denotes a node that would require the less amount of clicks/steps from it to any other node in this graph of pages. Important locations usually get more pages created for them, and this causes that location to become great hubs for information propagation — namely, a good path for a crawler to walk through.

Check the image below, it represents a hypothetical graph with BC scores attached to it. Notice we consider the Betweenness centrality score of an edge to be the sum of the connecting vertexes scores.

Now, getting back to our question: “Which pages should be linked from the home page?”. The answer is: Pages with the biggest coefficient of centrality.

But how should pages link among them? The rules of a Line Graph for a Directed Graph is simply the natural way of pages (edges) to be connected in the generated graph. So we simply follow the rules of Line Graph, we get a resulting graph where pages are naturally connected and all properties of the original graph is preserved. We are good to go, right?

Not really. The problem is that the normal rule of a Directed Graph Line causes an explosion of pages to be linked whenever there is a hyper-connected node on sight. Take, for example, this constellation: A → Paris → C. According to the Line Graph, all vertex ending in Paris need to point to all vertex leaving from Paris. Considering Paris to he a hyper-connected node with more than 1500 in connections and another 1500 out connections. Simply using Line Graph to connect our pages wouldn’t be usable at all. This is because we have a limited number of links to show per page, also Line Graph ignores the relevancy of pages and would simply blindly link them.

Connecting the Pages

A Hyper-Connected node is any location that gets a high number of pages mentioning it. How to solve the connection problem given that we have a limit of 44 links per page and also need to stick to the relevance represented in the Betweenness Centrality of each location?

This is the heart of our Link Pie. The way we managed to avoid hyper-connected nodes (like London, New York and Berlin) to skew the expected outcome was to implement our own “Line Graph” algorithm that would take into account the coefficient of a page and guarantee that the page gets the proportionate number of in-links, yet all other pages gets at least one in-link.

Given a page A → B, instead of blindly follow the natural rule of Line Graphs of a Digraph, we have the constraint of using n links per page. Thus n/2 of them linking to pages that also to position B (? → B), and n/2 of them linking to pages leaving from A (A → ?). This gives us a fair amount of links and a nice way to show contextualised journeys to our users. Question mark here is just a notation for any unknown location.

In general, the algorithm would consider a set of pages S From and To a certain position A and first give one link for each page in P, then make use of quantiles to designate the exact proportion of links a page P in S deserves. That is, if a page P is at percentile p95, then 95% of the pages in this set S will point to page P. This fulfills the promises 2) and 3) of the algorithm.

Those used to math will realise that we have a sort of matrix here that initially starts with zeros and gets 1 on every cell that represents a link to be created between pages, as long as there is enough space for it.

The Algorithm does this repeatedly for each Strongly Connected Component (SCC) of the Graph of locations and finally attaches each SCC to the main Graph by attaching the pages with the biggest BC in the SCC to pages in the main Graph having enough space respecting n. This is fulfills the promises 3) and 4) of the algorithm.

Conclusion

The full implementation of Link Pie considers many more aspects of our business, like the fact that there are several different types of pages, several modes of transport and rules for which type of page should link to which type of page, forming what we call as Structural Linking. The algorithm is also pretty flexible and supports boosting of certain pages so that we can still influence its output when needed.

Offering the best user experience as the guide wire, a good SEO strategy encompasses an unimaginable amount of work and signals to watch. This post was intended to give a brief overview of how we tackled this core pilar of the SEO strategy by combining several Graph Theory concepts to produce a smooth browsing experience for our users that now can find the best possible related pages when they visit us.

Search Engines “see” your pages through the lenses of a huge Graph Processor. When they apply algorithms like Page Rank, these three dimensions — amount of in links, quality of in links and distance from the home page — are what matters in the ranking and in-ranking of your pages. Remember, your site is a subgraph of the whole internet, by presenting a “healthier” graph to users and bots your chances of better ranking grow immediately.

With our domains now reaching ~99% page connected and the majority of pages reachable within depth 5, Link Pie helped us achieve the precise result we wanted and it is rewarding to see our numbers improving dramatically in all SEO tools we use. What is especially rewarding, though, is to see the search engines positively reacting to this complex effort.

If you liked this article and want to work on similar challenges at scale, why not consider joining forces :)

Omio Engineering

Engineers and data scientists dedicated to building the…