Protect Your Content: How to Block AI Web Crawlers with robots.txt

AI Rabbit
2 min readJul 14, 2024

--

As artificial intelligence and large language models become more prevalent, many website owners are concerned about their content being used without attribution to train these models. This can potentially lead to a loss of audience and revenue. Here’s how you can take steps to protect your content from AI web crawlers.

Why Block AI Crawlers?

AI companies use web crawlers to gather data from websites to train their models. While this can lead to more advanced AI, it often happens without explicit permission or compensation to content creators. By blocking these crawlers, you maintain more control over how your content is used.

How to Block AI Crawlers

The most effective way to block AI crawlers is by using your website’s robots.txt file. This file tells web crawlers which parts of your site they’re allowed to access. Here’s a current list of known AI crawlers and how to block them:

User-agent: Google-Extended
Disallow: /
User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: FacebookBot
Disallow: /
User-agent: OmgiliBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-Agent: PerplexityBot
Disallow: /

To implement this, simply copy the above code into your robots.txt file, which should be located in the root directory of your website.

Staying Up-to-Date

As new AI models and crawlers emerge, it’s important to keep your robots.txt file updated. I’ll be regularly updating this list as new information becomes available.

If you know of any other AI crawlers that should be added to this list, please private message me on Medium. Your input helps keep this resource current and valuable for the community.

--

--