CHAPTER 2: SEO Crawling, Indexing, and Ranking

Published in

BlogsCord

6 min readMay 29, 2020

The basic criteria for your website to show up in SERPs (Search Engine Results Page) are that your content becomes visible to the search engine. I.e. the search engine should be able to find your content on the web no matter where it may be.

Search engine Functions

There are three main functions of a search engine:

1. Crawl: Scan the internet thoroughly for content by looking at the code/content for every relevant URL that it comes across.

2. Index: Organize and store the content found during the Crawling process. Once the content is indexed, then it is just one step away from being displayed in the SERPs.

3. Rank: Selecting pieces of content that will best suit the search result so the results end up in most relevant to the least relevant order.

Search engine Crawling

Scanning the internet for content is a very lengthy and tiresome process even for any machine. So how do search engines conduct their scanning of the internet? Well, they send out a team of ‘robots’ (known as crawlers or spiders) to find new and updated content. Content here can be anything, e.g. a webpage, an image, a video, a PDF, etc. — but the format of the content does not matter to these robots, they still find it through links.

Googlebot starts its crawling process by looking for a few web pages first, and then it follows the links on those webpages to find new URLs. By hopping along this path of links, the crawler can find new content and add it to their index called Caffeine — a massive database of discovered URLs — to later be retrieved when a searcher is seeking information that the content on that URL is a good match for.

Search engine Index

Just like every book has an index, which tells the reader about the content and its location. Similarly, the search engines too, store the information they find in an index, which is a huge database of all the content they’ve discovered and found it good enough to be shown as a result to a searcher.

Search engine Ranking

When a search is performed, the search engine looks into its index for the content which might be of highest relevance to the search. This particular process of ordering search results by relevance is called ranking. The higher content is ranked, the more relevant it is supposed to be to the search.

Google and other search engines for SEO

It is not unknown to us that the SEO community has a special likeliness towards Google. Even though there are 30 major search engines available to us, the SEO community prefers Google over any other search engine. But why so? It is because Google conducts the maximum searches across the globe. Figuratively, around 90% of the web searches are done through Google — that is 90% more than Bing and Yahoo combined.

Crawling

Making sure your site gets crawled and indexed is a prerequisite to showing up in the SERPs. E.g. if you want to check the number of pages that are in the index for your website or any website then type “site:yourdomain.com” in the address bar. This will return the number of pages that are indexed for that particular domain.

If your searched site is not showing up anywhere in the search results, there are a few possible reasons why:

· Your site is brand new and hasn’t been crawled yet.

· Your site isn’t linked to any external websites.

· Your site’s navigation makes it hard for a robot to crawl it effectively.

· Your site contains some basic code called crawler directives that are blocking search engines.

· Your site has been penalized by Google for spammy tactics.

Optimizing the crawling of your website sometimes becomes necessary when some pages that are important to your site are not crawled and indexed by the GoogleBot, or sometimes unnecessary pages are crawled and indexed which you wouldn’t want to be showed up at SERPs. In such cases, you can guide the GoogleBot and tell it which specific pages you would want to get crawled and which pages you would like to keep away from the bot, by Google Search Console. You can sign up for a free Google search console account and unlock various optimization techniques.

Robots.txt

Robots.txt files are located in the root directory of websites (ex. yourdomain.com/robots.txt) and suggest which parts of your site search engines should and shouldn’t crawl, as well as the speed at which they crawl your site, via specific robots.txt directives.

Some important GSC tactics

While searching for an item at any e-commerce site, the URL keeps on changing as you apply the filters for your search. E.g. www.xyz.com/products/men/shoes. The URL keeps on updating with your search and this is a brilliant job done by google at identifying which URLs to include and which to omit. You can guide the searches for such jobs yourself using GSC.
If you use this feature to tell GoogleBot “crawl no URLs with _____ parameter”, then you’re giving a command to the GoogleBot for excluding that URL from the search results. You can use this feature to include a specific URL in the search results as well.

Important Information about Crawling

If your website lets the users access some parts of the website only after login, fill out forms, or answer surveys then search engines won’t see those protected pages. (A crawler won’t log in on behalf of any human).

Robots such as GoogleBot do not use search forms. If a website has a search box, then no search engine will crawl through its contents.

Text hidden in the form of images or GIFs has a rare chance of being indexed by the robots. Even though the robots are getting advanced at image recognition, there is still no guarantee about your text that is hidden behind an image will get indexed. It’s always best to add text within the <HTML> markup of your webpage.

Just as a crawler needs to discover your site via links from other sites, it needs a path of links on your site to guide it from page to page. If you’ve got a page you want search engines to find but it isn’t linked to from any other pages, it’s as good as invisible. Many sites make the critical mistake of structuring their navigation in ways that are inaccessible to search engines, hindering their ability to get listed in search results.

When you’ll explore the Google search console’s “crawl error”, you’ll come across various errors such as not found error, server, etc.

The Not Found error is most commonly encountered when the user messes up the syntax of the URL. The poorly typed URL may end up giving the not found error to the user.
The server errors are not technically your site’s fault as the server on which your site is located fails to provide the search engine with the desired result.

You have the freedom to Implement a 301 redirect for your site or any particular page but make sure to not use many redirecting chains as the GoogleBot may fail to find your site if it has to go through multiple redirect chains.

Ranking by search engines

Using different algorithms and other processes, the search engines go through your content and then segregate and rank it according to its relevance to a particular search.

___________________________________________________________________

Faraaz Usmani

twitter|linkedIn|medium

CHAPTER 2: SEO Crawling, Indexing, and Ranking

Written by Faraaz Usmani