Bots such as OpenAI’s GPTBot, the Applebot, CCBot, Google-Extended, and Bytespider analyze, store, or scrape your website’s data in order to provide data to train more advanced LLMs.
At Originality.ai, we care about responsible development (which includes ethical scraping) and the use (AI Detector) of Generative AI writing tools like ChatGPT.
This article will do a deep dive into the purpose of AI bots, what they do, how to block them and the interesting battle for the future of AI playing out on a rarely known file called the robots.txt.
AI bots come in multiple forms, including AI Assistants, AI Data Scrapers, and AI Search Crawlers all to power leading AI tools and AI search engines. Each of these AI bots extracts data from the web. Numerous webmasters find these practices unacceptable and want to keep their data or website information safe from being scraped.
As more of the internet is blocking AI bots, the number of words available to AI companies like OpenAI, Anthropic for developing their latest LLM (such as GPT-4o, Claude 3.5) will decline, resulting in slower future improvement of AI tools.
The most common way is to add the following text to your Robots.txt file:
User-agent:
name-of-bot
Disallow: /
Example:
User-agent: GPTBot
Disallow: /
See the bottom of this article for a more in depth explanation of the options to block AI bots and a sample Robots.txt file.
OpenAI in particular, has been active in trying to secure data partnerships to continue to fuel its LLM training efforts.
Robots.txt is an internet protocol that provides search engine crawlers with information on which URLs the crawler can access on your website. It is primarily used to prevent overloading the domain with requests produced by crawlers.
It is important to know that Robots.txt is a request that bots should follow but does not “have” to be followed.
In terms of filtering and managing crawler traffic to your website, the purpose of robots.txt has a different application depending on the file type:
The leading purpose of robots.txt is to restrict crawler access to specific pages on your website. Suppose you have concerns that your website's server is overwhelmed by too many Google requests. In that case, you can prevent search engine crawlers from accessing specific pages on your website to reduce the utilization.
You can use the robots.txt file to manage the crawl traffic to images, videos, and audio files. This would prevent specific media from appearing in the SERP (Search Engine Results Page) on Google or other search engines.
If you think there's a particular utilization rate caused by unimportant style files, images, or scripts, you can use the robots.txt file to restrict the access of particular AI crawlers, scrapers, assistants or other bots.
Let’s review the different types of AI bots and crawlers deployed by companies on the web:
AI Assistants such as the ChatGPT-User owned by OpenAI at the Meta-ExternalFetcher deployed by Meta play a vital role in responding to user inquiries. The responses can either be in a text or voice format and use the collected web data to construct the most helpful answer possible to the user’s prompt.
AI web scraping is a procedure conducted by AI Data Scraper bots to harvest as much useful data as possible for LLM training. Companies such as Apple, ByteDance, Common Crawl, OpenAI and Anthropic use AI Data Scrapers to build a large dataset of the web for LLMs to train on.
Many companies deploy AI Search Crawlers to gather information about specific website pages, titles, keywords, images, and referenced inline links. While AI crawlers have the potential to send traffic to a website some website owners are choosing to still block it.
ChatGPT-User is a search assistant crawler dispatched by Open AI's ChatGPT as a result of user prompts. Most of its answers would typically include a summary of the website's content as well as a reference link.
The ChatGPT-User crawler's type is AI Assistant, as it is used to intelligently conduct tasks on behalf of the ChatGPT user.
The ChatGPT-User Search Assistant is expected to make one-off visits given the request of the user, rather than browsing the web automatically like other crawlers.
To block this crawler, you must include the following statement in the robots.txt of your website:
User-agent: ChatGPT-User
Disallow: /
The Meta-ExternalFetcher crawler is dispatched by all products of Meta AI to direct user prompts whenever an individual link is required.
The Meta-ExternalFetcher AI Assistant is fetched to intelligently perform tasks on behalf of the Meta AI user.
Similar to the ChatGPT-user crawler, the Meta-ExternalFetcher generally makes one-off visits based on the user's request, rather than automatically crawling the web.
You must include the following command in your website's robots.txt to prevent Meta-ExternalFetcher's access:
User-agent: Meta-ExternalFetcher
Disallow: /
The Amazonbot web crawler is used by Amazon to index and register search results which allows the Alexa AI Assistant to answer questions more accurately. Most of Alexa's answers generally contain a reference to the website.
Amazonbot is an AI Search Crawler that is used for indexing web content for Alexa's AI-powered search results.
The specific thing about search crawlers is that they don't adhere to a fixed visitation scheme for websites. The visitation frequency is defined by many factors and typically happens on-demand to a user query, including by the Amazonbot.
You can limit the Amazonbot's access to your website by typing down the following command lines in your website's robots.txt:
User-agent: Amazonbot
Disallow: /
The Applebot Search Crawler is used to register search results, allowing the Siri AI Assistant to answer user questions more effectively. Most of Siri's responses contain a reference to the websites crawled by Applebot.
Applebot is an AI Search Crawler that indexes web content to construct AI-powered search results.
The Applebot's behavior varies on multiple factors, such as search demand, crawled websites, and user queries. By default, search crawlers do not rely on fixed visitation to provide results.
While it is not advised to block search crawlers, you can use the following command in the website's robots.txt to prevent the Applebot's access:
User-agent: Applebot
Disallow: /
The OAI-SearchBot crawler is utilized to construct an index of websites that can be used as a result of OpenAI's SearchGPT product.
The OAI-SearchBot is an AI Search Crawler used for indexing web content to provide more accurate AI-powered search results for OpenAI's SearchGPT service.
The OAI-SearchBot's behavior can be defined by the frequency of web searches and user queries. Like any other search crawler, the OAI-SearchBot does not rely on fixed website visitation to provide results.
Include the following command in the robots.txt file of your website to actively prevent the OAI-SearchBot's access:
User-agent: OAI-SearchBot
Disallow: /
Perplexity uses the PerplexityBot web crawler to index search results for a more effective answer for their AI Assistant. The answers provided by the assistant normally surface inline references to a variety of web sources.
The PerplexityBot is an AI Search Crawler designed to index results for AI-powered search results by the Perplexity AI Assistant.
Like most search crawlers, the PerplexityBot does not depend on a fixed visitation schedule for the web sources it promotes. The frequency of visits can vary based on multiple factors, such as user queries.
You can restrict PerplexityBot's access to your website by including the following agent token rule in the robots.txt:
User-agent: PerplexityBot
Disallow: /
The YouBot is a search crawler deployed by You.ai to index search results for more accurate user answers by the You AI Assistant. The bot generally refers via inline sources to the referenced websites.
The YouBot Search Crawler indexes web content to generate more accurate AI-powered search results.
The YouBot crawler does not have a set visitation schedule and the frequency of visits often happens on-demand or in response to a user query.
You must paste the following command into your website's robots.txt file to prevent the YouBot crawler's access:
User-agent: YouBot
Disallow: /
The Applebot-Extended AI Data Scraper is used to train APple's lineup of LLM models that power the company's generative AI features. This Applebot scraper has a wide application in all aspects of Apple intelligence, Services, and Developer Tools.
The Applebot-Extended is an AI Data Scraper used for downloading web content and to train AI or LLM (Large Language Models)
While it remains unclear how exactly AI Data Scrapers choose which website to crawl, it is known that sources with a higher information density attract this scraper Applebot. It would make sense for an LLM to favor websites that regularly upload and update the on-page web information.
Include the following command in your website's robots.txt file to block the Applebot-Extended:
User-agent: Applebot-Extended
Disallow: /
Operated by ByteDance, Bytespider is an AI Data Scraper for the Chinese owner of TikTok. It's used to download LLM training data and supply relative data.
Bytespider is an AI Data Scraper used to train Large Language Models by downloading content from the web.
The Bytespider AI Data Scraper favors web sources with regularly updated and fact-rich information to use as a supply for LLMs.
Include the following use agent token rule in your website's robots.txt:
User-agent: Bytespider
Disallow: /
CCBot is owned by Common Crawl to construct an open-source repository through web crawl data available for anyone to access and use.
The CCBot is an AI Data Scraper purposed to download web content and conduct AI model training.
The CCBot crawls information-rich web sources to undergo more effective LLM training.
Include the following rule in robots.txt to restrict CCBot's access:
User-agent: CCBot
Disallow: /
The ClaudeBot AI Data Scraper is operated by Anthropic to supply Large Language Models like Claude with training data.
The ClaudeBot is an AI Data Scraper purposed for downloading web content and training AI models.
ClaudeBot chooses which websites to crawl based on the information density and the regularity of information updates.
The ClaudeBot's access can be revoked by including the following rule in the robots.txt:
User-agent: ClaudeBot
Disallow: /
The Diffbot is designed to structure, understand, aggregate, and even sell properly structured website data for AI model training and real-time monitoring.
The Diffbot is an AI Data Scraper designed to train AI models and download/structure web information.
The Diffbot's frequency of visitations is defined by the source's quality of information and updates regularity.
The Diffbot's crawl can be prevented by applying the following rule in the robots.txt:
User-agent: Diffbot
Disallow: /
The FacebookBot is deployed by Meta to enhance the AI speech recognition technology's efficiency and to train AI models.
FacebookBot is an AI Data Scraper crawler, used for registering web content and LLM training.
The FacebookBot does not have a fixed visitation schedule but perhaps recognizes and relies on sources with richer and well-updated information.
The FacebookBot's access can be revoked with the following rule:
User-agent: FacebookBot
Disallow: /
The Google-Extended crawler is used to supply training information for AI products owned by Google such as Gemini assistant and the Vertex AI generative APIs.
The Google-Extended crawler is an AI Data Scraper purposed to download information from the web and conduct AI training.
The Google-Extended bot’s visitation schedule is also flexible but it is much more directed than other crawlers due to Google's rich database of reliable information web sources.
The Google-Extended crawler can be blacklisted with the following rule:
User-agent: Google-Extended
Disallow: /
The GPTBot is developed by OpenAI to crawl web sources and download training data for the company's Large Language Models and products like ChatGPT.
The GPTBot is an AI Data Scraper designed to download and supply a wide range of data from the web.
Like other AI Data Scrapers, the GPTBot favors information-rich sources and websites to supply more relative information for AI training procedures.
You can block the GPTBot AI Data Scraper with the following robots.txt rule:
User-agent: GPTBot
Disallow: /
Meta-ExternalAgent is a crawler technology developed by Meta to improve the company's AI technologies by downloading and indexing web content directly.
The Meta-ExternalAgent uses an AI Data Scraper technology to download and index web content, purposed for AI training.
Similar to other crawlers developed by the company, the Meta-ExternalAgent uses a flexible crawling strategy to pinpoint information-rich web sources.
This crawler developed by Meta can be restricted through the following robots.txt rule:
User-agent: Meta-ExternalAgent
Disallow: /
The omgili crawler is owned by Webz.io, designed to maintain a constructed library of web crawl data that is then sold to other companies for AI training purposes.
The omgili crawler is an AI Data Scraper that downloads AI training information from the web.
As the crawled information is then sold by Webz.io, the omgili crawler keeps track of credible and authorized websites with relative information.
Use the following rule to prevent the omgili crawler's access to your website:
User-agent: omgili
Disallow: /
The unconfirmed Anthropic-AI agent tends to be used for downloading relative training data and supplying it to AI-powered products owned by the Company, such as Claude.
The exact type of the Anthropic-AI agent is still unknown due to the absence of disclosure by the company.
Due to the absence of relative information about the Anthropic-AI agent, the crawler can be used for multiple purposes but it is still difficult to tell.
The Anthropic-AI agent can be blocked with the following rule:
User-agent: anthropic-ai
Disallow: /
Claude-Web is another AI agent operated by Anthropic, without official documentation on its use purposes. It is expected for Claude-Web to provide relative LLM training data for Anthropic.
Claude-Web will either be an AI Data Scraper or a standard Search Crawler for Anthropic's Claude 3.5 Large Language Model.
Anthropic is holding back on information about Claude-Web's functionalities, but the crawler's behavior will correspond to the agent's type once it is fully disclosed.
The following rule is used to suspend the Claude-Web agent's crawling access to your website:
User-agent: Claude-Web
Disallow: /
Cohere-AI is an undocumented agent developed by Cohere to supply their generative AI tools with relative study information. It retrieves information from the web when prompted by users through Cohere AI.
As no documentation is available for this Cohere AI agent, the crawler's type is still unknown to many of the website owners.
It is suspected that the Cohere-AI agent will be multi-purposed through different behavioral patterns to supply Cohere users with relative information and inline source links.
You can suspend the Cohere-AI Agent's access through the following rule:
User-agent: cohere-ai
Disallow: /
The primary function of the Ai2Bot is to crawl “certain domains” and acquire web content for training language models.
As reported by Ai2, the Ai2Bot is an AI Search Crawler, as it analyzes content, images, and videos on the crawled website.
The Ai2Bot only crawls specific websites as reported by the company, but it can widen its range of registered domains by the day.
Include the following rule in your website’s robots.txt to suspend the Ai2Bot:
User-agent: Ai2Bot
Disallow: /
The Ai2 company owns the Ai2Bot-Dolma bot and respects robots.txt rules. The acquired content is used to train a variety of language models owned by the company.
Although the bot does not have a specific assigned type, we believe it has the behavior of a standard AI Search Crawler.
The Ai2Bot-Dolma only crawls “certain domains” to find the web content required for training language models.
Use the following robots.txt line to restrict Ai2Bot-Dolma’s access:
User-agent: Ai2Bot-Dolma
Disallow: /
While not much is known about this crawler, it respects the robots.txt and is used to acquire data for machine learning experiments.
The type of the bot is still unknown, but we believe it’s either a Generic Crawler or an AI Data Scraper based on its web behavior.
It’s unclear who’s the operator of this bot, but the data is used for large language model training, machine learning and dataset creation.
Here’s how to suspend the access of FriendlyCrawler to your website:
User-agent: FriendlyCrawler
Disallow: /
The GoogleOther crawler is a bot owned by Google, but it is still unclear whether it is AI-related or artificially intelligent at all. Currently, only a single percentage of top performing websites have blocked the GoogleOther bot.
The GoogleOther bot is a Search Engine Crawler that indexes web content for more accurate search engine results or the Google SERP.
Like any other Search Engine Crawler, the GoogleOther bot does not adhere to a particular visitation schedule and the visitation frequency varies on website activity and content quality.
Add the following rule to your website’s robots.txt file:
User-agent: GoogleOther
Disallow: /
GoogleOther-Image is the version of GoogleOther proposed to crawl, analyze, and index images on the web
Like GoogleOther, the GoogleOther-Image bot is a generic crawler used by various product teams to make website content publicly accessible.
This crawler does not stick with a fixated visitation schedule but analyzes the most reliable image sources on the web and indexes particular information.
You can block this Google crawler with the following command:
User-agent: GoogleOther-Image
Disallow: /
Same as with the standard GoogleOther crawler and the image-registering bot, GoogleOther-Video crawls and analyzes video content on the web.
This bot is a generic crawler serving a variety of product teams and businesses get better reach with the videos uploaded on their website.
Like the other versions of GoogleOther, this crawler also has a flexible visitation schedule defined by the activity and quality of website video content.
This Google bot can be blocked with the following robots.txt line:
User-agent: GoogleOther-Video
Disallow: /
The ICC-Crawler is an agent that’s still not categorized by the creator. It is still unknown if this crawler is artificially intelligent or has something to do with AI.
Like with its source, the type of the ICC-Crawler bot is also unknown.
The behavior of the bot varies depending on the crawler’s type, particularly whether it is a Data Scraper, Search Engine Crawler, or Archiver.
The ICC-Crawler can be blocked with the following command:
User-agent: ICC-Crawler
Disallow: /
The Imagesift bot is owned by Hive, but it is currently unknown whether the crawler is AI-related or artificially intelligent.
The ImagesiftBot is an intelligence gatherer that searches for useful insights on the web and registers or indexes the results in a database.
The behavior of the Intelligence Gatherer crawlers depends on the goals of their clients. For instance, a client might be interested in popularizing their brand, which causes the bot to crawl social media more frequently than other unrelated websites.
This crawler can be blocked in the following way:
User-agent: ImagesiftBot
Disallow: /
The PetalBot owned by Huawei is currently suspended on 2% of popular indexed websites. It is still unknown whether this crawler is artificially intelligent or related to AI in any way.
PetalBot is a Search Engine Crawler that indexes web content and acquires data from search engine results.
PetalBot’s behavior is defined by the quality of the content and the activity of the registered websites and domains. Search Engine Crawlers tend to crawl websites with a higher content quality and frequent activity.
PetalBot’s access can be suspended in the following way:
User-agent: PetalBot
Disallow: /
Scrapy is owned by Zyte and is currently blocked on more than 3% of registered domains on the web.
Scrapy is an AI Scraper which are notorious crawlers for not respecting the robots.txt of a website. It will eventually analyze the required information, even if it means accessing a website that has a disallow robots.txt rule for Scrapy.
Predicting the visitation schedule of an AI Scraper is nearly impossible. These crawlers are dispatched with different purposes and it's hard to tell which websites and how often it would crawl them.
Although robots.txt does not much do anything against AI Scrapers, here’s how to block Scrapy:
User-agent: Scrapy
Disallow: /
Timpibot is owned by Timpi and is currently blocked on 3% of popular indexed websites. The sole purpose of Timpibot is to acquire web data for AI model training.
Unlike a standard Scraper, Timpibot is an AI Data Scraper that solely relies on artificial intelligence to crawl and index web content.
Similar to standard Scrapers, the visitation schedule of AI Data Scrapers is also unclear. These crawlers tend to choose websites with a higher information density and content value, depending on the required information for the AI model training.
Timpibot’s access can be suspended with the following robots.txt rule:
User-agent: Timpibot
Disallow: /
The VelenPublicCrawler is operated by Hunter and has been blocked by 0% of top-indexed websites on the web.
The crawler is a standard Intelligence Gatherer, purposed to collect useful insights from web results.
Intelligence Gatherers tend to meet the goals of their clients and in most cases, have specific tasks towards what information to gather.
VelenPublicWebCrawler can be blocked with the following command:
User-agent: VelenPublicWebCrawler
Disallow: /
Webzio-Extended is another crawler owned by Webz.io, used to maintain the repository of the acquired crawl data. The information is then sold to other companies and is typically used for AI model training.
Webzio-Extended is an AI Data Scraper that downloads web content for the purpose of training AI models.
AI Data Scrapers such as Webzio-Extended do not stick to a fixed visitation of websites. Richer data sources tend to attract the scrapers more and cause them to crawl more often.
Webzio-Extended’s access to your website can be suspended with the following command:
User-agent: Webzio-Extended
Disallow: /
The facebookexternalhit crawler is a bot dispatched by Meta blocked on more than 6% of registered popular websites. It is still unclear whether the bot is artificially intelligent or related to AI.
Facebookexternalhit is a fetcher, that crawls web results on behalf of an application.
Fetchers are typically dispatched to visit websites on demand. They are used for presenting the metadata of a particular link, a title, or a thumbnail image for example, to the requesting user.
This Meta crawler can be blocked with the following robots.txt rule:
User-agent: facebookexternalhit
Disallow: /
It is known that img2dataset is the company’s sole web crawler, purposed to download large number of images and convert them into datasets to train large language models.
The type of img2dataset is still unknown, but we believe that it is an AI Search Engine Crawler, based on its behavior and visitation schedule.
Img2dataset tends to crawl websites with a richer image database and with a variety of themes.
This crawler can be suspended with the following rule:
User-agent: img2dataset
Disallow: /
There are multiple unique possibilities to restrict unwanted bot traffic to your website:
The most common way is to add the following text to your Robots.txt file:
User-agent:
name-of-bot
Disallow: /
Example:
User-agent: GPTBot
Disallow: /
Creating an easily accessible robots.txt file involves several easy steps:
During this step, you must access the website's root directory. This is where the file has to be stored to effectively restrict crawler access to your website. Keep in mind that your website can have a single robots.txt file.
You can use any editor to create the file, such as Windows's NotePad, TextEdit, and vi. Ensure the file is saved in UTF-8 encoding and proceed to applying the rules. Google's crawlers respond to the following set of rules: "user-agent," "disallow," "allow" and "sitemap." Each rule has its different purpose in terms of managing crawlers.
The next step is to make the newly created robots.txt file publicly accessible by opening a private browsing window within your browser and navigating to the location of the file. You can check if the robots.txt is publicly accessible by typing the site domain, for example, "https://(site name)" and adding a "/robots.txt" at the end.
As we’ve established earlier, including a restriction rule for a particular bot in your website’s robots.txt will neither deindex the page nor remove it from the SERP. Upon attempting to access your page, the bot will be automatically given an error message and redirected away, without gaining access to the page’s content.
The page will still be discoverable for all users and AI bots but bots carrying the nametag enlisted in the robots.txt file will not have access.
This example file blocks:
Blocking the data scrapers but not the crawler and search assistant aims to restrict a websites data from being used for training but allow for AI search/assistants to send traffic to the website.
There are multiple possibilities when it comes to suspending the access of AI bots with the assistance of a firewall. Let’s review each of the unique possibilities:
If you’re well aware of the addresses used by bots to access your domain, you can blacklist the IPs through the firewall of your website. It’s a commonly known practice to reduce unintentional traffic to your website and preserve resources. However, bots can cycle through multiple IPs.
Perhaps the most widely preferred method for 100% suspending bot traffic to your website implementing a CAPTCHA software that requires human validation. Each new visitor will be prompted to complete a simple challenge like matching puzzle pieces or identifying objects on a set of images to gain access. Firewalls can be modified to trigger CAPTCHAs.
CDN services such as Cloudflare or Amazon CloudFront can be integrated with your website’s firewall to reduce the traffic of AI bots. Similar to CAPTCHA challenges, only valid IP addresses owned by real users will be allowed to access your website.
As a wide variety of webmasters rely on robots.txt to suspend unwanted bot traffic to their website, each rule within the file is set towards a specific bot. After companies began to sense the traffic decline of their bots, many decided to assign their bots a new name that wasn’t in the robots.txt file of many websites.
A recent example is how Anthropic has merged their AI Data Scrapers named “ANTHROPIC-AI” and “CLAUDE-WEB” into a new bot named “CLAUDEBOT.” It surely took a while for websites to find out about this and in the meantime, the new bot had unprecedented access to all websites on the internet.
With the increasing and widespread use of public information by AI companies to teach large language models, the ineffectiveness of web protocols is becoming evident. As a response to the major data scraping conducted by web data AI companies such as C4, Dolma, and Refined Web, there has been over a 28%-45% decline in crawler access.
A full 45% of C4 has now been restricted, with many of the restrictions being diverse and scaling the laws of general-purpose AI infrastructures through robots.txt. The demand for data consent is becoming more and more challenging for not only commercial AI but all types of academic research and non-commercial AI use.
The collision of AI companies trying to scrape as much data as possible to train LLMs and publishers working to defend their data/bandwidth from abuse has resulted in an interesting struggle that is playing out in a little-known file on all websites called the robots.txt.
This guide aims to track how this drama is playing out with our live dashboard seeing which of the top 1000 websites are blocking AI bots, how to block them and which ones you should.
If we are missing any bots you would like included please reach out.