How do we protect our site crawling from bad bots

Siva Prakash
MishiPay
Published in
5 min readMay 20, 2021

What is a crawler?

A crawler, A.K.A, a bot, is one of the most important factors in a website. Once you create a website, you generally tend to send it to your friends, family, stakeholders and related circles. Due to this, related people get familiarised with your website. As same some crawlers are market giants, so they help in familiarising your website over the world by introducing it to various search engines.

If no crawlers crawl your website, or if you have manually blocked crawlers from crawling your website, then you may not get any impressions from other sources except direct traffic and impression.

How does it work?

Every search engine has its own bot to crawl websites. (Some of them share bots.) Each bot first comes to your robots.txt file before entering the site. Based on the rules, it is determined if the bot has permission to crawl or not. If the crawler is permitted, then it will enter your website and start crawling. If not, then it will move on to the next site.

Webmasters have 100% control of the crawlers. In this, we can mention the specific directory that we want to be crawled. For example, if you have a copyrighted content on a promotional section on your website, it is better to ban the crawler from crawling that particular section.

Here is how it is done:

#To block only Google products

User-agent: Googlebot
Disallow: /promo

#To block all the crawlers.

User-agent: *
Disallow: /promo

This lets the bot crawl your website but not these sections.

How are Mishipay Handling crawlers?

We, at Mishipay, are very clear as to what should be allowed to crawl and what not. Therefore, we have allowed a major bot and have blocked a few bad bots as producing.

  • We make use of SEMrush audit. Therefore, we have allowed SEMrush bots to crawl our website (based on the needs, we would allow and disallow it)
  • We are monitoring Alexa rank and competitor’s analysis, so we have allowed that
  • Social media bots like Facebook, Twitter and LinkedIn are allowed
  • To begin with, we had mentioned all allowed bots and blocked bots
  • Our sitemap is mentioned in XML format with a complete URL and with HTTPS Due to this, we usually do not get bot traffic. And if we still get some, we have multiple ways to block them
  • We do not have Google AdSense or any other platform for monetisation. But we are still very conscious about our site health and analytics data (If you have such monetisation platforms, then you should be very careful on bot handlings)
  • If we want to add any crawlers to the site, then we have allowed the bots. Our job is done for once and then we remove that bot. This is because we do not want our competitors to crawl our websites for their analysis.

What if we failed to protect from bad bots?

It’s simple: if we fail to handle these bad bots, your business /blog is at risk, and indeed, people from outside will run or monitor your site. Do you truly want that?

If our opponents or the opposing party will impose 100s of requests on our domain URL at the same time, the website will be overloaded and the site will immediately go down. Furthermore, this kind of traffic / requests is so detrimental to SEO that it will result in a 100% bounce rate and a session duration of 0 seconds. Once the Google bot regains these stats, it drives our site back to the bottom of the list in terms of authority, scores, and so on.

How Mishipay Handled Bad bot traffics?

To be honest, we just got bad bot traffic once, and it was 120+ in real-time with a single location and from many pages and posts, which is really dangerous to the analytics data and server, so we took urgent steps to fix it. How did we keep it under control?

Initially, we modified our robots.txt file to make just a few of them; however, all bots are still banned.

Get the traffic source and blocked list from the .htaccess file, as seen below.

# To restrict access to a specific bot

RewriteEngine on
RewriteCond %{HTTP_REFERER} ^http://.*trafficreviews\.club/ [NC]
RewriteRule^(.*)$ – [F,L]


#If we dont know the source? then block a common bad bots

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^.*(agent1|Cheesebot|Catall\ Spider).*$ [NC]
RewriteRule .* - [F,L]

Because the bad bots will come under either agent1 of cheesebot or as a spider so we can eliminate them.

  1. In addition, we added a filter in our GA (Google Analytics) to prevent analytics data abuse.
  2. We have the Cloud Flare service, so we added a portal for bot traffic and set up threshold values to detect bad traffic and bad requests.
  3. As previously said, we have blocked bad bot traffic and requests.

I hope this is useful for your company or website as well.

How to check allowed or disallowed Bots?

Is there a way to tell whether you have approved this particular bot or not? Since we want to ban some bots but, unfortunately, it’s allowed; similarly, we want to allow some bots but it’s blocked by another rule; so, to be transparent, you can check the status on this page. Add your server’s or site’s IP address here so that it can address you the status, and then rework your robots.txt file depending on that status.

Familiar Web crawlers and it’s user Agent name

  • Google Bot — Googlebot (All Google Products)
  • Bing Bot- Bingbot (Bing Browser)
  • Slurp Bot — Slurp (Yahoo)
  • DuckDuckBot — DuckDuckBot
  • Baiduspider — Baiduspider (Chinese Search Engine)
  • YandexBot — YandexBot (Russian Search Engine)
  • SogouSpider — Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07) (Chinese Search Engine)
  • Exabot — Mozilla/5.0 (compatible; Exabot/3.0; +http://www.exabot.com/go/robot) (Mozila) (France)
  • Facebookexternalhit — facebot facebookexternalhit/1.0 (+http://www.facebook.com/externalhit_uatext.php)
  • Alexacrawler — ia_archiver (Amazon’s Alexa internet rankings)

Some bots are exclusive:

Some of them are exceptional bots, which do not respect your robots.txt file, what they are?

  • Web light
  • Google Favicon
  • DuplexWeb-Google
  • Google-Read-Aloud
  • FeedFetcher-Google

Final words

As we all know, bots are just rule followers, not rule breakers, so become mindful of how to handle bots on your web and handle them properly. Get your business to a good start.

--

--