Block AI Bots from Crawling Websites Using Robots.txt

See the live dashboard showing the websites that are blocking AI Bots such as GPTBot, CCBot, Google-extended and ByteSpider from crawling and scraping the content on their website. Learn which AI crawlers / scrapers do what and how to block them using Robots.txt.

Bots such as OpenAI’s GPTBot, the Applebot, CCBot, Google-Extended, and Bytespider analyze, store, or scrape your website’s data in order to provide data to train more advanced LLMs. 

At Originality.ai, we care about responsible development (which includes ethical scraping) and the use (AI Detector) of Generative AI writing tools like ChatGPT.  

This article will do a deep dive into the purpose of AI bots, what they do, how to block them and the interesting battle for the future of AI playing out on a rarely known file called the robots.txt. 

What are AI Bots?

AI bots come in multiple forms, including AI Assistants, AI Data Scrapers, and AI Search Crawlers all to power leading AI tools and AI search engines. Each of these AI bots extracts data from the web. Numerous webmasters find these practices unacceptable and want to keep their data or website information safe from being scraped. 

Why Does it Matter if Websites Block AI Bots?

As more of the internet is blocking AI bots, the number of words available to AI companies like  OpenAI, Anthropic for developing their latest LLM (such as GPT-4o, Claude 3.5) will decline, resulting in slower future improvement of AI tools.

How To Block AI Bots (simple version):

The most common way is to add the following text to your Robots.txt file:

User-agent: name-of-bot

Disallow: /

Example:

User-agent: GPTBot

Disallow: /

See the bottom of this article for a more in depth explanation of the options to block AI bots and a sample Robots.txt file. 

OpenAI in particular, has been active in trying to secure data partnerships to continue to fuel its LLM training efforts. 

What is robots.txt and What Does it Do?

Robots.txt is an internet protocol that provides search engine crawlers with information on which URLs the crawler can access on your website. It is primarily used to prevent overloading the domain with requests produced by crawlers. 

It is important to know that Robots.txt is a request that bots should follow but does not “have” to be followed. 

In terms of filtering and managing crawler traffic to your website, the purpose of robots.txt has a different application depending on the file type:

Web Page Filtering

The leading purpose of robots.txt is to restrict crawler access to specific pages on your website. Suppose you have concerns that your website's server is overwhelmed by too many Google requests. In that case, you can prevent search engine crawlers from accessing specific pages on your website to reduce the utilization.

Media File Filtering

You can use the robots.txt file to manage the crawl traffic to images, videos, and audio files. This would prevent specific media from appearing in the SERP (Search Engine Results Page) on Google or other search engines.

Resource File 

If you think there's a particular utilization rate caused by unimportant style files, images, or scripts, you can use the robots.txt file to restrict the access of particular AI crawlers, scrapers, assistants or other bots.

Types of AI Bots

Let’s review the different types of AI bots and crawlers deployed by companies on the web:

AI Assistants

AI Assistants such as the ChatGPT-User owned by OpenAI at the Meta-ExternalFetcher deployed by Meta play a vital role in responding to user inquiries. The responses can either be in a text or voice format and use the collected web data to construct the most helpful answer possible to the user’s prompt.

AI Data Scrapers

AI web scraping is a procedure conducted by AI Data Scraper bots to harvest as much useful data as possible for LLM training. Companies such as Apple, ByteDance, Common Crawl, OpenAI and Anthropic use AI Data Scrapers to build a large dataset of the web for LLMs to train on.

AI Search Crawlers

Many companies deploy AI Search Crawlers to gather information about specific website pages, titles, keywords, images, and referenced inline links. While AI crawlers have the potential to send traffic to a website some website owners are choosing to still block it.

Description of Popular AI Bots

Deep Analysis of Popular AI Bots

ChatGPT-User - AI Search Assistant 

Overview

ChatGPT-User is a search assistant crawler dispatched by Open AI's ChatGPT as a result of user prompts. Most of its answers would typically include a summary of the website's content as well as a reference link.

Type

The ChatGPT-User crawler's type is AI Assistant, as it is used to intelligently conduct tasks on behalf of the ChatGPT user.

Crawler Behaviour

The ChatGPT-User Search Assistant is expected to make one-off visits given the request of the user, rather than browsing the web automatically like other crawlers.

How to Block ChatGPT-User Search Assistant?

To block this crawler, you must include the following statement in the robots.txt of your website:

User-agent: ChatGPT-User
Disallow: /

Meta-ExternalFetcher - AI Search Assistant 

Overview

The Meta-ExternalFetcher crawler is dispatched by all products of Meta AI to direct user prompts whenever an individual link is required.

Type

The Meta-ExternalFetcher AI Assistant is fetched to intelligently perform tasks on behalf of the Meta AI user.

Crawler Behaviour

Similar to the ChatGPT-user crawler, the Meta-ExternalFetcher generally makes one-off visits based on the user's request, rather than automatically crawling the web.

How to Block Meta-ExternalFetcher AI Assistant?

You must include the following command in your website's robots.txt to prevent Meta-ExternalFetcher's access:

User-agent: Meta-ExternalFetcher
Disallow: /

Amazonbot - AI Search Crawler 

Overview

The Amazonbot web crawler is used by Amazon to index and register search results which allows the Alexa AI Assistant to answer questions more accurately. Most of Alexa's answers generally contain a reference to the website.

Type

Amazonbot is an AI Search Crawler that is used for indexing web content for Alexa's AI-powered search results.

Crawler Behaviour

The specific thing about search crawlers is that they don't adhere to a fixed visitation scheme for websites. The visitation frequency is defined by many factors and typically happens on-demand to a user query, including by the Amazonbot.

How to Block Amazonbot Search Crawler?

You can limit the Amazonbot's access to your website by typing down the following command lines in your website's robots.txt:

User-agent: Amazonbot
Disallow: /

Applebot - AI Search Crawler 

Overview

The Applebot Search Crawler is used to register search results, allowing the Siri AI Assistant to answer user questions more effectively. Most of Siri's responses contain a reference to the websites crawled by Applebot.

Type

Applebot is an AI Search Crawler that indexes web content to construct AI-powered search results.

Crawler Behaviour

The Applebot's behavior varies on multiple factors, such as search demand, crawled websites, and user queries. By default, search crawlers do not rely on fixed visitation to provide results.

How to Block Applebot Search Crawler

While it is not advised to block search crawlers, you can use the following command in the website's robots.txt to prevent the Applebot's access:

User-agent: Applebot
Disallow: /

OAI-SearchBot - AI Search Crawler

Overview

The OAI-SearchBot crawler is utilized to construct an index of websites that can be used as a result of OpenAI's SearchGPT product.

Type

The OAI-SearchBot is an AI Search Crawler used for indexing web content to provide more accurate AI-powered search results for OpenAI's SearchGPT service.

Crawler Behaviour

The OAI-SearchBot's behavior can be defined by the frequency of web searches and user queries. Like any other search crawler, the OAI-SearchBot does not rely on fixed website visitation to provide results.

How to Block OAI-SearchBot Crawler?

Include the following command in the robots.txt file of your website to actively prevent the OAI-SearchBot's access:

User-agent: OAI-SearchBot
Disallow: /

PerplexityBot - AI Search Crawler

Overview

Perplexity uses the PerplexityBot web crawler to index search results for a more effective answer for their AI Assistant.  The answers provided by the assistant normally surface inline references to a variety of web sources.

Type

The PerplexityBot is an AI Search Crawler designed to index results for AI-powered search results by the Perplexity AI Assistant.

Crawler Behaviour

Like most search crawlers, the PerplexityBot does not depend on a fixed visitation schedule for the web sources it promotes. The frequency of visits can vary based on multiple factors, such as user queries.

How to Block PerplexityBot Search Crawler?

You can restrict PerplexityBot's access to your website by including the following agent token rule in the robots.txt:

User-agent: PerplexityBot
Disallow: /

YouBot - AI Search Crawler

Overview

The YouBot is a search crawler deployed by You.ai to index search results for more accurate user answers by the You AI Assistant. The bot generally refers via inline sources to the referenced websites.

Type

The YouBot Search Crawler indexes web content to generate more accurate AI-powered search results.

Crawler Behaviour

The YouBot crawler does not have a set visitation schedule and the frequency of visits often happens on-demand or in response to a user query.

How to Block YouBot Search Crawler?

You must paste the following command into your website's robots.txt file to prevent the YouBot crawler's access:

User-agent: YouBot
Disallow: /

Applebot-Extended AI Data Scraper

Overview

The Applebot-Extended AI Data Scraper is used to train APple's lineup of LLM models that power the company's generative AI features. This Applebot scraper has a wide application in all aspects of Apple intelligence, Services, and Developer Tools.

Type

The Applebot-Extended is an AI Data Scraper used for downloading web content and to train AI or LLM (Large Language Models)

Crawler Behaviour

While it remains unclear how exactly AI Data Scrapers choose which website to crawl, it is known that sources with a higher information density attract this scraper Applebot. It would make sense for an LLM to favor websites that regularly upload and update the on-page web information.

How to Block Applebot-Extended AI Data Scraper?

Include the following command in your website's robots.txt file to block the Applebot-Extended:

User-agent: Applebot-Extended
Disallow: /

Bytespider AI Data Scraper

Overview

Operated by ByteDance, Bytespider is an AI Data Scraper for the Chinese owner of TikTok. It's used to download LLM training data and supply relative data.

Type

Bytespider is an AI Data Scraper used to train Large Language Models by downloading content from the web.

Crawler Behaviour

The Bytespider AI Data Scraper favors web sources with regularly updated and fact-rich information to use as a supply for LLMs.

How to Block Bytespider AI Data Scraper?

Include the following use agent token rule in your website's robots.txt:

User-agent: Bytespider
Disallow: /

CCBot AI Data Scraper

Overview

CCBot is owned by Common Crawl to construct an open-source repository through web crawl data available for anyone to access and use.

Type

The CCBot is an AI Data Scraper purposed to download web content and conduct AI model training.

Crawler Behaviour

The CCBot crawls information-rich web sources to undergo more effective LLM training.

How to Block CCBot AI Data Scraper?

Include the following rule in robots.txt to restrict CCBot's access:

User-agent: CCBot
Disallow: /

ClaudeBot - AI Data Scraper

Overview

The ClaudeBot AI Data Scraper is operated by Anthropic to supply Large Language Models like Claude with training data.

Type

The ClaudeBot is an AI Data Scraper purposed for downloading web content and training AI models.

Crawler Behaviour

ClaudeBot chooses which websites to crawl based on the information density and the regularity of information updates.

How to Block ClaudeBot AI Data Scraper?

The ClaudeBot's access can be revoked by including the following rule in the robots.txt:

User-agent: ClaudeBot
Disallow: /

Diffbot - AI Data Scraper

Overview

The Diffbot is designed to structure, understand, aggregate, and even sell properly structured website data for AI model training and real-time monitoring.

Type

The Diffbot is an AI Data Scraper designed to train AI models and download/structure web information.

Crawler Behaviour

The Diffbot's frequency of visitations is defined by the source's quality of information and updates regularity.

How to Block Diffbot AI Data Scraper?

The Diffbot's crawl can be prevented by applying the following rule in the robots.txt:

User-agent: Diffbot
Disallow: /

FacebookBot - AI Data Scraper

Overview

The FacebookBot is deployed by Meta to enhance the AI speech recognition technology's efficiency and to train AI models.

Type

FacebookBot is an AI Data Scraper crawler, used for registering web content and LLM training.

Crawler Behaviour

The FacebookBot does not have a fixed visitation schedule but perhaps recognizes and relies on sources with richer and well-updated information.

How to Block FacebookBot AI Data Scraper?

The FacebookBot's access can be revoked with the following rule:

User-agent: FacebookBot
Disallow: /

Google-Extended - AI Data Scraper

Overview

The Google-Extended crawler is used to supply training information for AI products owned by Google such as Gemini assistant and the Vertex AI generative APIs.

Type

The Google-Extended crawler is an AI Data Scraper purposed to download information from the web and conduct AI training.

Crawler Behaviour

The Google-Extended bot’s visitation schedule is also flexible but it is much more directed than other crawlers due to Google's rich database of reliable information web sources.

How to Block Google-Extended AI Data Scraper?

The Google-Extended crawler can be blacklisted with the following rule:

User-agent: Google-Extended
Disallow: /

GPTBot - AI Data Scraper

Overview

The GPTBot is developed by OpenAI to crawl web sources and download training data for the company's Large Language Models and products like ChatGPT.

Type

The GPTBot is an AI Data Scraper designed to download and supply a wide range of data from the web.

Crawler Behaviour

Like other AI Data Scrapers, the GPTBot favors information-rich sources and websites to supply more relative information for AI training procedures.

How to Block GPTBot AI Data Scraper?

You can block the GPTBot AI Data Scraper with the following robots.txt rule:

User-agent: GPTBot
Disallow: /

Meta-ExternalAgent - AI Data Scraper

Overview

Meta-ExternalAgent is a crawler technology developed by Meta to improve the company's AI technologies by downloading and indexing web content directly.

Type

The Meta-ExternalAgent uses an AI Data Scraper technology to download and index web content, purposed for AI training.

Crawler Behaviour

Similar to other crawlers developed by the company, the Meta-ExternalAgent uses a flexible crawling strategy to pinpoint information-rich web sources.

How to Block Meta-ExternalAgent AI Data Scraper?

This crawler developed by Meta can be restricted through the following robots.txt rule:

User-agent: Meta-ExternalAgent
Disallow: /

omgili - AI Data Scraper

Overview

The omgili crawler is owned by Webz.io, designed to maintain a constructed library of web crawl data that is then sold to other companies for AI training purposes.

Type

The omgili crawler is an AI Data Scraper that downloads AI training information from the web.

Crawler Behaviour

As the crawled information is then sold by Webz.io, the omgili crawler keeps track of credible and authorized websites with relative information.

How to Block omgili AI Data Scraper?

Use the following rule to prevent the omgili crawler's access to your website:

User-agent: omgili
Disallow: /

Anthropic-AI - Undocumented AI Agent 

Overview

The unconfirmed Anthropic-AI agent tends to be used for downloading relative training data and supplying it to AI-powered products owned by the Company, such as Claude.

Type

The exact type of the Anthropic-AI agent is still unknown due to the absence of disclosure by the company.

Crawler Behaviour

Due to the absence of relative information about the Anthropic-AI agent, the crawler can be used for multiple purposes but it is still difficult to tell.

How to Block Anthropic-AI Agent?

The Anthropic-AI agent can be blocked with the following rule:

User-agent: anthropic-ai
Disallow: /

Claude-Web - Undocumented AI Agent

Overview

Claude-Web is another AI agent operated by Anthropic, without official documentation on its use purposes. It is expected for Claude-Web to provide relative LLM training data for Anthropic.

Type

Claude-Web will either be an AI Data Scraper or a standard Search Crawler for Anthropic's Claude 3.5 Large Language Model.

Crawler Behaviour

Anthropic is holding back on information about Claude-Web's functionalities, but the crawler's behavior will correspond to the agent's type once it is fully disclosed.

How to Block Claude-Web Agent?

The following rule is used to suspend the Claude-Web agent's crawling access to your website:

User-agent: Claude-Web
Disallow: /

Cohere-AI Agent - Undocumented

Overview

Cohere-AI is an undocumented agent developed by Cohere to supply their generative AI tools with relative study information. It retrieves information from the web when prompted by users through Cohere AI.

Type

As no documentation is available for this Cohere AI agent, the crawler's type is still unknown to many of the website owners.

Crawler Behaviour

It is suspected that the Cohere-AI agent will be multi-purposed through different behavioral patterns to supply Cohere users with relative information and inline source links.

How to Block Cohere-AI Agent?

You can suspend the Cohere-AI Agent's access through the following rule:

User-agent: cohere-ai
Disallow: /

Ai2Bot - AI Search Crawler

Overview

The primary function of the Ai2Bot is to crawl “certain domains” and acquire web content for training language models.

Type

As reported by Ai2, the Ai2Bot is an AI Search Crawler, as it analyzes content, images, and videos on the crawled website.

Crawler Behavior

The Ai2Bot only crawls specific websites as reported by the company, but it can widen its range of registered domains by the day.

How to Block Ai2Bot AI Search Crawler?

Include the following rule in your website’s robots.txt to suspend the Ai2Bot:

User-agent: Ai2Bot

Disallow: /

Ai2Bot-Dolma - AI Search Crawler

Overview

The Ai2 company owns the Ai2Bot-Dolma bot and respects robots.txt rules. The acquired content is used to train a variety of language models owned by the company.

Type

Although the bot does not have a specific assigned type, we believe it has the behavior of a standard AI Search Crawler.

Crawler Behavior

The Ai2Bot-Dolma only crawls “certain domains” to find the web content required for training language models.

How to Block Ai2Bot-Dolma AI Search Crawler?

Use the following robots.txt line to restrict Ai2Bot-Dolma’s access:

User-agent: Ai2Bot-Dolma

Disallow: /

FriendlyCrawler - Unknown

Overview

While not much is known about this crawler, it respects the robots.txt and is used to acquire data for machine learning experiments.

Type

The type of the bot is still unknown, but we believe it’s either a Generic Crawler or an AI Data Scraper based on its web behavior.

Crawler Behavior

It’s unclear who’s the operator of this bot, but the data is used for large language model training, machine learning and dataset creation.

How to Block FriendlyCrawler?

Here’s how to suspend the access of FriendlyCrawler to your website:

User-agent: FriendlyCrawler

Disallow: /

GoogleOther - Search Engine Crawler

Overview

The GoogleOther crawler is a bot owned by Google, but it is still unclear whether it is AI-related or artificially intelligent at all. Currently, only a single percentage of top performing websites have blocked the GoogleOther bot.

Type

The GoogleOther bot is a Search Engine Crawler that indexes web content for more accurate search engine results or the Google SERP.

Crawler Behavior

Like any other Search Engine Crawler, the GoogleOther bot does not adhere to a particular visitation schedule and the visitation frequency varies on website activity and content quality.

How to Block GoogleOther Search Engine Crawler?

Add the following rule to your website’s robots.txt file:

User-agent: GoogleOther

Disallow: /

GoogleOther-Image - Generic Crawler

Overview

GoogleOther-Image is the version of GoogleOther proposed to crawl, analyze, and index images on the web

Type

Like GoogleOther, the GoogleOther-Image bot is a generic crawler used by various product teams to make website content publicly accessible. 

Crawler Behavior

This crawler does not stick with a fixated visitation schedule but analyzes the most reliable image sources on the web and indexes particular information.

How to Block GoogleOther-Image Generic Crawler?

You can block this Google crawler with the following command:

User-agent: GoogleOther-Image

Disallow: /

GoogleOther-Video - Generic Crawler

Overview

Same as with the standard GoogleOther crawler and the image-registering bot, GoogleOther-Video crawls and analyzes video content on the web.

Type

This bot is a generic crawler serving a variety of product teams and businesses get better reach with the videos uploaded on their website.

Crawler Behavior

Like the other versions of GoogleOther, this crawler also has a flexible visitation schedule defined by the activity and quality of website video content.

How to Block GoogleOther-Video Generic Crawler?

This Google bot can be blocked with the following robots.txt line:

User-agent: GoogleOther-Video

Disallow: /

ICC-Crawler - Unknown

Overview

The ICC-Crawler is an agent that’s still not categorized by the creator. It is still unknown if this crawler is artificially intelligent or has something to do with AI.

Type

Like with its source, the type of the ICC-Crawler bot is also unknown.

Crawler Behavior

The behavior of the bot varies depending on the crawler’s type, particularly whether it is a Data Scraper, Search Engine Crawler, or Archiver.

How to Block ICC-Crawler?

The ICC-Crawler can be blocked with the following command:

User-agent: ICC-Crawler

Disallow: /

ImagesiftBot - Intelligence Gatherer

Overview

The Imagesift bot is owned by Hive, but it is currently unknown whether the crawler is AI-related or artificially intelligent. 

Type

The ImagesiftBot is an intelligence gatherer that searches for useful insights on the web and registers or indexes the results in a database.

Crawler Behavior

The behavior of the Intelligence Gatherer crawlers depends on the goals of their clients. For instance, a client might be interested in popularizing their brand, which causes the bot to crawl social media more frequently than other unrelated websites.

How to Block ImagesiftBot Intelligence Gatherer?

This crawler can be blocked in the following way:

User-agent: ImagesiftBot

Disallow: /

PetalBot - Search Engine Crawler

Overview

The PetalBot owned by Huawei is currently suspended on 2% of popular indexed websites. It is still unknown whether this crawler is artificially intelligent or related to AI in any way.

Type

PetalBot is a Search Engine Crawler that indexes web content and acquires data from search engine results.

Crawler Behavior

PetalBot’s behavior is defined by the quality of the content and the activity of the registered websites and domains. Search Engine Crawlers tend to crawl websites with a higher content quality and frequent activity.

How to Block PetalBot Search Engine Crawler?

PetalBot’s access can be suspended in the following way:

User-agent: PetalBot

Disallow: /

Scrapy - AI Scraper

Overview

Scrapy is owned by Zyte and is currently blocked on more than 3% of registered domains on the web.

Type

Scrapy is an AI Scraper which are notorious crawlers for not respecting the robots.txt of a website. It will eventually analyze the required information, even if it means accessing a website that has a disallow robots.txt rule for Scrapy.

Crawler Behavior

Predicting the visitation schedule of an AI Scraper is nearly impossible. These crawlers are dispatched with different purposes and it's hard to tell which websites and how often it would crawl them.

How to Block Scrapy AI Scraper?

Although robots.txt does not much do anything against AI Scrapers, here’s how to block Scrapy:

User-agent: Scrapy

Disallow: /

Timpibot - AI Data Scraper

Overview

Timpibot is owned by Timpi and is currently blocked on 3% of popular indexed websites. The sole purpose of Timpibot is to acquire web data for AI model training.

Type

Unlike a standard Scraper, Timpibot is an AI Data Scraper that solely relies on artificial intelligence to crawl and index web content. 

Crawler Behavior

Similar to standard Scrapers, the visitation schedule of AI Data Scrapers is also unclear. These crawlers tend to choose websites with a higher information density and content value, depending on the required information for the AI model training.

How to Block Timpibot AI Data Scraper?

Timpibot’s access can be suspended with the following robots.txt rule:

User-agent: Timpibot

Disallow: /

VelenPublicWebCrawler - Intelligence Gatherer

Overview

The VelenPublicCrawler is operated by Hunter and has been blocked by 0% of top-indexed websites on the web.

Type

The crawler is a standard Intelligence Gatherer, purposed to collect useful insights from web results.

Crawler Behavior

Intelligence Gatherers tend to meet the goals of their clients and in most cases, have specific tasks towards what information to gather.

How to Block VelenPublicWebCrawler Intelligence Gatherer?

VelenPublicWebCrawler can be blocked with the following command:

User-agent: VelenPublicWebCrawler

Disallow: /

Webzio-Extended - AI Data Scraper

Overview

Webzio-Extended is another crawler owned by Webz.io, used to maintain the repository of the acquired crawl data. The information is then sold to other companies and is typically used for AI model training.

Type

Webzio-Extended is an AI Data Scraper that downloads web content for the purpose of training AI models.

Crawler Behavior

AI Data Scrapers such as Webzio-Extended do not stick to a fixed visitation of websites. Richer data sources tend to attract the scrapers more and cause them to crawl more often.

How to Block Webzio-Extended AI Data Scraper?

Webzio-Extended’s access to your website can be suspended with the following command:

User-agent: Webzio-Extended

Disallow: /

Facebookexternalhit - Fetcher

Overview

The facebookexternalhit crawler is a bot dispatched by Meta blocked on more than 6% of registered popular websites. It is still unclear whether the bot is artificially intelligent or related to AI.

Type

Facebookexternalhit is a fetcher, that crawls web results on behalf of an application. 

Crawler Behavior

Fetchers are typically dispatched to visit websites on demand. They are used for presenting the metadata of a particular link, a title, or a thumbnail image for example, to the requesting user.

How to Block Facebookexternalhit Fetcher?

This Meta crawler can be blocked with the following robots.txt rule:

User-agent: facebookexternalhit 

Disallow: /

Img2dataset - Unknown

Overview

It is known that img2dataset is the company’s sole web crawler, purposed to download large number of images and convert them into datasets to train large language models.

Type

The type of img2dataset is still unknown, but we believe that it is an AI Search Engine Crawler, based on its behavior and visitation schedule.

Crawler Behavior

Img2dataset tends to crawl websites with a richer image database and with a variety of themes.

How to Block img2dataset Crawler?

This crawler can be suspended with the following rule:

User-agent: img2dataset

Disallow: /

How to Block AI Bots (Advanced Version):

There are multiple unique possibilities to restrict unwanted bot traffic to your website:

The most common way is to add the following text to your Robots.txt file:

User-agent: name-of-bot
Disallow: /

Example:

User-agent: GPTBot
Disallow: /

Use Robots.txt

Creating an easily accessible robots.txt file involves several easy steps:

1. Create a File Named "robots.txt"

During this step, you must access the website's root directory. This is where the file has to be stored to effectively restrict crawler access to your website. Keep in mind that your website can have a single robots.txt file.

2. Write the robots.txt Rules

You can use any editor to create the file, such as Windows's NotePad, TextEdit, and vi. Ensure the file is saved in UTF-8 encoding and proceed to applying the rules.  Google's crawlers respond to the following set of rules: "user-agent," "disallow," "allow" and "sitemap." Each rule has its different purpose in terms of managing crawlers.

3. Upload the robots.txt File

The next step is to make the newly created robots.txt file publicly accessible by opening a private browsing window within your browser and navigating to the location of the file. You can check if the robots.txt is publicly accessible by typing the site domain, for example, "https://(site name)" and adding a "/robots.txt" at the end.

Sample Robots.txt Blocking AI Data Scraper Bots

As we’ve established earlier, including a restriction rule for a particular bot in your website’s robots.txt will neither deindex the page nor remove it from the SERP. Upon attempting to access your page, the bot will be automatically given an error message and redirected away, without gaining access to the page’s content.

The page will still be discoverable for all users and AI bots but bots carrying the nametag enlisted in the robots.txt file will not have access.

This example file blocks:

  • NO - AI Search Assistant
  • NO - AI Search Crawler
  • ALL - AI Data Scraper

Blocking the data scrapers but not the crawler and search assistant aims to restrict a websites data from being used for training but allow for AI search/assistants to send traffic to the website.

Firewall

There are multiple possibilities when it comes to suspending the access of AI bots with the assistance of a firewall. Let’s review each of the unique possibilities:

  • Set Up IP Blocking

If you’re well aware of the addresses used by bots to access your domain, you can blacklist the IPs through the firewall of your website. It’s a commonly known practice to reduce unintentional traffic to your website and preserve resources. However, bots can cycle through multiple IPs.

  • Use CAPTCHA Software

Perhaps the most widely preferred method for 100% suspending bot traffic to your website implementing a CAPTCHA software that requires human validation. Each new visitor will be prompted to complete a simple challenge like matching puzzle pieces or identifying objects on a set of images to gain access. Firewalls can be modified to trigger CAPTCHAs.

Use a CDN (Content Delivery Network)

CDN services such as Cloudflare or Amazon CloudFront can be integrated with your website’s firewall to reduce the traffic of AI bots. Similar to CAPTCHA challenges, only valid IP addresses owned by real users will be allowed to access your website.

The Arising Challenges of Bot Name Changes

As a wide variety of webmasters rely on robots.txt to suspend unwanted bot traffic to their website, each rule within the file is set towards a specific bot. After companies began to sense the traffic decline of their bots, many decided to assign their bots a new name that wasn’t in the robots.txt file of many websites.

A recent example is how Anthropic has merged their AI Data Scrapers named “ANTHROPIC-AI” and “CLAUDE-WEB” into a new bot named “CLAUDEBOT.” It surely took a while for websites to find out about this and in the meantime, the new bot had unprecedented access to all websites on the internet.

Decline of AI Data Commons and Consequences

With the increasing and widespread use of public information by AI companies to teach large language models, the ineffectiveness of web protocols is becoming evident. As a response to the major data scraping conducted by web data AI companies such as C4, Dolma, and Refined Web, there has been over a 28%-45% decline in crawler access.

https://www.dataprovenance.org/Consent_in_Crisis.pdf

A full 45% of C4 has now been restricted, with many of the restrictions being diverse and scaling the laws of general-purpose AI infrastructures through robots.txt. The demand for data consent is becoming more and more challenging for not only commercial AI but all types of academic research and non-commercial AI use.

Conclusion

The collision of AI companies trying to scrape as much data as possible to train LLMs and publishers working to defend their data/bandwidth from abuse has resulted in an interesting struggle that is playing out in a little-known file on all websites called the robots.txt. 

This guide aims to track how this drama is playing out with our live dashboard seeing which of the top 1000 websites are blocking AI bots, how to block them and which ones you should. 

If we are missing any bots you would like included please reach out.

AI Content Detector & Plagiarism Checker for Marketers and Writers

Use our leading tools to ensure you can hit publish with integrity!