Robots.txt — Understanding Web- Crawling Restrictions

Published in

WhatfixEngineeringBlog

4 min readSep 23, 2020

Ever wondered how google finds all those “About 14,20,00,000 results (0.57 seconds)”?

In simple terms, to achieve this Google uses a technique called Web-Crawling where it indexes web pages across the Internet and stores them in humongous data centres spread across the globe; then through its optimised search algorithm it retrieves most relevant web pages based on your search. Of course, it is not that simple.

Web-Crawling is how all search engines gather their content. To do so they use their own version of something called a “Crawler” or “Spider” or “Spiderbot”. A Crawler is an internet bot that systematically browses the Internet and indexes its content. The largest known crawler today, excluding the crawlers available on the deep web, is the Googlebot used by Google Inc. But how does a Crawler know what to crawl? Can a website restrict crawling?

This is where something called The Robot Exclusion Protocol or commonly known as “robots.txt” comes into picture.

**Figure: Robots.txt — Author:** **Seobility** **— License:** **CC BY-SA 4.0**

What is robots.txt ?

As the name suggests, it is a text file, usually present in the top-level of the web server. It contains a series of instructions for Crawlers defining what portion of the website is allowed for Web-Crawling.

Before we move forward with the details, try hitting the link to checkout robots.txt for Netflix, a popular binge-content provider.

https://www.netflix.com/robots.txt

You must have noticed that the text file had two prominent words User-agent and Disallow. These two together form a group that can be used to define general as well as Crawler-specific instructions for Web-Crawling. Robots.txt can have multiple such groups, defining extensive and intricate rules.

User-agent : Refers to the specific Crawler that is trying to access the website. Example : googlebot, Facebook, etc. In case a rule is for all crawlers, User-agent value is * (Asterisk). Syntax — User-agent: {crawler_name}

Disallow : Refers to the relative paths that are not permitted for crawling. In case a website blocks all of the Web-Crawling, the value of Disallow is / (forward slash denoting root directory path). Sometimes, a company might like to expose certain web pages of theirs to Crawlers in which case they can use the keyword Allow and specify the path. Syntax — Disallow/Allow: {path}

Sitemap : Along with the above, robots.txt usually contains another field called Sitemap. This field tells you the relative path where the sitemap for the website can be found. What is a Sitemap you ask? Well, Sitemaps are an easy way for webmasters to tell search engines what needs to be crawled in their website. It is a blueprint of how your website is laid out with additional information in the form of an xml file.

Sample robots.txt file for stackoverflow.com

Rules for Making these Rules

robots.txt should be a UTF-8 text file.
There should be at least 1 group of rule in robots.txt.
Each group consists of multiple instructions, one directive per line.
Groups are processed from top to bottom, and a user agent can match only one rule set, which is the first, most-specific rule that matches a given user agent.
Rules are case-sensitive. For instance, “Disallow: /file.html will not block File.html or FILE.html from getting crawled.

The Whatfix experience

One might wonder why need this feature at all? Well here is an example, imagine you once read your company’s article that stated, “Median age of India is 28 years”, now you don’t remember the title or the link to the article, so is it lost forever? No. Whatfix provides its users with the option of crawling and indexing their content which then becomes searchable. Meaning, if you searched “Median age of India is 28 years” your result will show you links to all articles that contain the very sentence, saves your day doesn’t it? But this isn’t always simple. Robots.txt plays a very vital role in this regard. We found that numerous company website’s do not allow Web-crawling either for all Crawlers or allow access to very specific ones. In both cases, this search functionality fails to work unless Whatfix Crawler is allowed in the website’s Robots.txt.

Jurisdiction of robots.txt

Does robots.txt really restrict Web-Crawler? Well, theoretically Yes! But sadly not so much in practice. The Robot Exclusion Protocol is purely advisory and has no legal jurisdiction. In fact, spam Crawlers do ignore robots.txt to obtain information. However, if an organization discovers that a certain Crawler is disrespecting robots.txt it may block the Web-Crawling requests from the source IP Address irrespective of the Crawler’s intent. This is why it is always better to program a bot to obey robots.txt.

Other Measures to Restrict Web-Crawling

One might ask if Robot Exclusion Protocol is advisory then what are other effective measures to restrict crawling? Well, the best and most effective way is adding Advanced Firewall rules to control access to the server. In case sensitive information is held within a website use a Virtual Private Network (VPN) or put your website behind some form of Authentication to limit access. As far as Whatfix Crawler is concerned, we surely respect and obey the Robot Exclusion Protocol.

Robots.txt — Understanding Web- Crawling Restrictions

Written by Prakhar Chaube