Search Engine Optimization: How Site Architecture affects SEO

Let’s say we just created an amazing (web) application that solves a problem faced by many. The website’s setup, servers are running; all that is left are for these potential users to visit. However, weeks pass, but hits on the website still remain low. The problem we find, is that despite our amazing solution, no one can find us. In the overflowing abundance of alternatives that is present in the Internet, our solution is simply swept away, drowned under the massive amount of information. Given this scenario, how can we increase the visibility of our application? This is where Search Engine Optimization comes in.

What is Search Engine Optimization?

Search engine optimization (SEO) is the process of affecting the placing of a website on a search engine’s unpaid results.

A search engine identifies many factors in a web page, and uses an algorithm to determine the ranking of a certain page with regards to a user’s query. The actual implementation of the algorithm is most of the time unknown and changes day to day, but there are a core set of factors that are revealed by search engines or accepted by the SEO community to be constant.

These factors can be split into three groups:

  1. On-The-Page SEO
  2. Off-The-Page SEO
  3. Violations

On-The-Page SEO

This refers to factors that are almost directly under a creator’s control. This includes actual website content, site architecture and even HTML tags of the created website.

Off-The-Page SEO

The factors categorized under this group are usually not directly under a creator’s control. Some of these factors include the trust and authority the site holds, the links (contributing to trust and authority) directing to the site, and how the social media community view the site.

Violations

Unlike the previous two groups of factors which identify positive factors, this group of factors identify factors that search engines attribute a negative score. Techniques that fall under this category tend to try to abuse how search engines determine the ranking for a site, and thus, identified by search engines to be “spam” or “black hat”. In the worst case, employment of such techniques can cause a site to be banned from a search engine’s results.


Given the depth and complexity of SEO, I will only be exploring how the architecture of a site can improve the visibility of a site on a search engine’s results.

Crawlability

For a site to be displayed to users in the results of a search engine, said site has to be first crawled by the web crawler of the search engine. Crawling, in this respect, is the process of downloading pages visited by the web crawler and then the indexing of these downloaded pages for efficient retrieval for users by the search engine.

Given the large number of sites on the Internet, each site is given a crawl budget. This is measured in either time or number of pages to be crawled. Since there is limited amount of resources available, we would like to maximize the budget allocated for our site and attempt to guide the web crawler towards crawling the right pages.

Use of robots.txt

The robots exclusion protocol (REP), or robots.txt is a text file site creators typically place in root of the domain (e.g. www.example.com/robots.txt), to tell web crawlers which part of the site to not crawl.

This is useful for saving crawl budget, by directing web crawlers away from directories unrelated to site content, such as /tmp/. Use of robots.txt can also prevent crawling of duplicate content, which will be explored in more detail later.

For exclusion of individual pages from web crawler indexing, consider using <meta name=”robots” content=”noindex”> in the <head> section of the HTML pages over robots.txt. This is as using the meta tag allows the pages themselves to be crawled but not indexed, allowing these pages to pass on their link benefits. As compared to robots.txt, the pages are not crawled but could remain indexed, which prevents the passing on of their link benefits.

The syntax for robots.txt can be found here.

Sitemaps

In contrast to robots.txt or robots meta tag, a sitemap is a way to tell web crawlers what to crawl. Sitemaps are a list of all individual webpage’s URLs. They can be in the form of HTML or XML, though XML sitemaps are said to be the preferred means of data digestion by search engines.

Sitemaps are useful when pages are hidden from certain users of the site. By including the URLs in our sitemap, we allow web crawlers to know of these pages and crawl them.

Using XML sitemaps, we can also indicate to the web crawler the priority or hierarchy of site content alongside information on when the page was last updated. Similarly, image, video, mobile and news XML information can be included in the XML sitemap. All these information can be utilized by the search engine algorithm to generate our site ranking.

Sitemaps can be generated using free tools such as XML-Sitemaps, which normally come with a limit on number of pages that can be indexed, or by using paid tools such as Sitemap Writer Pro.

Once the sitemap is generated, the XML file is ready to be put in the root of the domain. The XML file can also be submitted to Google and Bing using Google Webmaster Tools and Bing Webmaster Tools respectively. Google Webmaster Tools also offer a feature that allows us to test sitemap files prior to uploading to google. This is a good way to check for errors in sitemaps and also to identify clashing instructions between sitemaps and robots.txt.

Canonicalization

When creating a site, various versions of the same page can arise due to:

  1. www and non-www versions of the site
  2. Allowing of search engines to index paginated pages
  3. Filtering parameters appended to URL
  4. Tracking code for analytics

This creates a problem where search engines are confused on which indexed page is the correct one to return. Another problem is when other people actively link to these different versions of the same page. The trust and authority that would be attributed to the same page has been split up between the different version. This lowers the actual score of the page.

In order to solve this problem, we can use canonical URLs. They are a way to inform search engines that multiple versions of a page actually refer to the same one.

301 Redirects

Perhaps the best way to have canonical URLs is to implement server-side 301 redirects. This ensures that both users and search engines are directed to the correct page, which lets only the canonical one be indexed and ranked.

However, beware of incorrect implementation of 301 redirects, which could slow down crawl speed due to redirect chains or possibility of infinite redirect chains.

Use rel=”canonical” link element

Given a canonical (preferred) URL (e.g. https://example.com/preferred-url) for a set of pages that are the same, we can put:

<link rel=”canonical” href=”https://example.com/preferred-url" />

in the <head> section of all pages in that set of pages. This indicates to the search engine which is the canonical URL to access for that particular set of addresses, which could aid in displaying the canonical URL in the search results.

Sitemaps

Sitemaps can also be used as an indicator for an URL to be a canonical URL. List only URLs that you wish to be canonical in the sitemap.

Google Webmaster Tool: Parameter Handling

When URLs contain optional parameters, such as filtering parameters or tracking code for analytics, we can use Google Webmaster Tool to tell Google what parameters to ignore. This could potentially increase crawl speed, as compared to other methods. However, this is only applicable for Google search engines.

Google Webmaster Tool: Set Preferred Domain

As mentioned above, www and non-www versions of the site can be treated as separate pages.

For Google, by setting our preferred domain to either www or non-www version, Google will treat links to the www version exactly the same as links to non-www version.

Site Speed

Another aspect that could improve the rank of the site is it’s speed. While not a major factor, every little bit helps. This means using a Content Delivery Network, minifying Javascript files, bundling them, and ensuring the algorithms in both Javascript and backend logic have acceptable run time.

HTTPS / Secure Site

Google also rewards sites that uses HTTPS with a small ranking boost, as part of an effort to have the whole web running securely. Similar to Site Speed, running HTTPS only offers a small boost.


In conclusion, SEO still remains a field with many uncertainties. This is due to the lack of complete transparency of the algorithms used by search engines. The above mentioned factors are but a subset of all variables, and can only influence search results to a certain extent.

However, the SEO community has been trying to identify the factors that could affect search rankings. Every two years, hundreds of well-regarded SEOs are asked to determine the importance of specific ranking factors. The survey results can be seen here.

Like what you read? Give Oswell Chan a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.