WHAT IS THE USE OF ROBOTS.TXT?

Suhailshaik
6 min readOct 14, 2020

--

ROBOTS.TXT

What does robots txt mean?
A robots. txt file tells search engine crawlers which pages or files the crawler can or can’t request from your site. This is used mainly to avoid overloading your site with requests; it is not a mechanism for keeping a web page out of Google.

Why is robots txt important?
Robots. txt is the method that helps webmasters instruct search engines to visit specific pages or directories on a website. They have the freedom to allow specific bots to crawl selected pages of a site.

Why should you learn about robots.txt?
Improper usage of the robots.txt file can hurt your ranking
The robots.txt file controls how search engine spiders see and interact with your webpages
This file is mentioned in several of the Google guidelines
This file, and the bots they interact with, are fundamental parts of how search engines work

Where should robots.txt be located?
The robots. txt file must be located at the root of the website host to which it applies. For instance, to control crawling on all URLs below http://www.example.com/ , the robots. txt file must be located at http://www.example.com/robots.txt .

How do I block pages in robots txt?
Web site owners use the /robots. txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol. The “User-agent: *” means this section applies to all robots. The “Disallow: /” tells the robot that it should not visit any pages on the site.

What will disallow robots txt?
txt file might still be indexed without being crawled, and the robots. txt file can be viewed by anyone, potentially disclosing the location of your private content. Disallow crawling of entire site.

What is the limit of a robot txt file?
Your robots. txt file must be smaller than 500KB. John Mueller of Google, reminded webmasters via Google+ that Google has a limit of only being able to process up to 500kb of your robots. txt file.

How does robots.txt work?
Search engines have two main jobs:
Crawling the web to discover content;
Indexing that content so that it can be served up to searchers who are looking for information.

Why is robots.txt used?
To crawl sites, search engines follow links to get from one site to another — ultimately, crawling across many billions of links and websites. This crawling behavior is sometimes known as “spidering.”After arriving at a website but before spidering it, the search crawler will look for a robots.txt file. If it finds one, the crawler will read that file first before continuing through the page. Because the robots.txt file contains information about how the search engine should crawl, the information found there will instruct further crawler action on this particular site. If the robots.txt file does not contain any directives that disallow a user-agent’s activity (or if the site doesn’t have a robots.txt file), it will proceed to crawl other information on the site

What are priorities of the website?
There are three important things that any webmaster should do when it comes to the robots.txt file.
Determine if you have a robots.txt file
If you have one, make sure it is not harming your ranking or blocking content you don’t want blocked
Determine if you need a robots.txt file.

Basic robots.txt examples
Here there are some common robots.txt setups (explained in detail below).

Allow full access

User-agent: *
Disallow:

Block all access

User-agent: *
Disallow: /

Block one folder

User-agent: *
Disallow: /folder/

Block one file

User-agent: *
Disallow: /file.html

How to test your robots.txt file?
1.Open the tester tool for your site, and scroll through the robots.txt code to locate the highlighted syntax warnings and logic errors. The number of syntax warnings and logic errors is shown immediately below the editor.
2.Type in the URL of a page on your site in the text box at the bottom of the page.
3.Select the user-agent you want to simulate in the dropdown list to the right of the text box.
4.Click the TEST button to test access.
5.Check to see if TEST button now reads ACCEPTED or BLOCKED to find out if the URL you entered is blocked from Google web crawlers.
6.Edit the file on the page and retest as necessary. Note that changes made in the page are not saved to your site! See the next step.
7.Copy your changes to your robots.txt file on your site. This tool does not make changes to the actual file on your site, it only tests against the copy hosted in the tool.

What is Google Webmaster Tools?
Google Webmaster Tools is a free service that helps you evaluate and maintain your website’s performance in search results (1). Offered as a free service to anyone who owns a website, Google Webmaster Tools (GWT) is a conduit of information from the largest search engine in the world to you, offering insights into how it sees your website and helping you uncover issues that need fixing You do not need to use GWT for your website to appear in search results, but it can offer you valuable information that can help with your marketing efforts.

What is the use of Google Webmaster?
GWT is offered as a free service to anyone who owns a website, Google Webmaster Tools (GWT) is a conduit of information from the largest search engine in the world to you, offering insights into how it sees your website and helping you uncover issues that need fixing.

Why use Webmaster Tools?
One of the tool’s top applications is that it allows webmasters to make sure that their websites and pages are crawled and processed for Google Indexing. Error reports enable them to discover issues that might prevent their site from doing well in Google search. Webmaster Tools also comes with a set of Google Search Tools which gives data on what keywords are ranking on Google and what domains are linking to the given website.

Googlebot specific instructions

The robot that Google uses to index their search engine is called Googlebot. It understands a few more instructions than other robots.

In addition to “User-name” and “Disallow” Googlebot also uses the Allow instruction.

Allow

Allow:

The “Allow:” instructions lets you tell a robot that it is okay to see a file in a folder that has been “Disallowed” by other instructions. To illustrate this, let’s take the above example of telling the robot not to visit or index your photos. We put all the photos into one folder called “photos” and we made a robots.txt file that looked like this…

User-agent: *
Disallow: /photos

Now let’s say there was a photo called mycar.jpg in that folder that you want Googlebot to index. With the Allow: instruction, we can tell Googlebot to do so, it would look like this…

User-agent: *
Disallow: /photos
Allow: /photos/mycar.jpg

This would tell Googlebot that it can visit “mycar.jpg” in the photo folder, even though the “photo” folder is otherwise excluded.

--

--