AI Bot Blocking

We analyzed the top 1000 websites in the world to identify which sites are already blocking GPTBot. We updated this study Sep 22, 2023. OpenAI shared details on how to block its GPTBot on Aug 7, 2023 and this is how the Top 1000 websites are responding.

We analyzed the top 1000 websites in the world to identify which sites are already blocking GPTBot. We updated this study Sep 22, 2023.

OpenAI shared details on how to block its GPTBot on Aug 7, 2023. This study looks at how the top 1000 websites are responding. One concern, as an AI detector, is the risk that LLM's will further scrape content and make AI writing tools that become undetectable to AI checkers? This study analyzes how they are responding.

It was first published here Aug 22, updated Aug 29 and this update is Sep 22, 2023.

Summary of Key Findings (Sep 22, 2023 Update):

  1. 25.9% of the Top 1000 websites are blocking GPTBot
  2. Top Websites Now Blocking GPTBot Are: Pinterest (most recently added - Sep 11, 2023), Amazon, Quora, Indeed
  3. Most Large Media/News Publishers are now all Blocking GPTBot: NYTimes, TheGuardian, CNN.com, USAToday, BusinessInsider, Reuters, WashingtonPost, NPR, CBS, NBC, Bloomberg, CNBC, ESPN
  4. The Top 6 Biggest Websites That First Blocked GPTBot are:
         - Amazon.com - Aug 17, 2023
         - Quora.com - by Aug 22, 2023
         - NYTimes.com - Aug 17, 2023
         - Shutterstock.com - Aug 21, 2023
         - Wikihow.com - Aug 12, 2023
         - CNN.com - Aug 22, 2023
  5. The Common Crawl Bot is being Blocked 13.9% of the time. CCBot was around before GPTBot but only 5% of websites blocked it as of Aug 1, 2023
  6. Only 2 Websites are Attempting to Block Anthropic AI: Reuters blocking both anthorpic-ai and claude-web on Sep 11, 2023 & Corriere.it

Google Extended Update - Sep 29, 2023

Google has provided increased ability to control how Google's AI bots use the content on your website.

Announcing on Sep 28 that Google Extended will be used - https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers

The first 2 websites to block it are abc.net.au, and francebleu.fr


Differences from 1 month ago…

    The Top 100 websites were more likely to block GPTBot then the remaining Top 1000 websites when we ran this study 1 month ago. That difference is now gone 26% of the top 100 and top 1000 sites are blocking GPTBot.

Websites rushing to block GPTBot but other crawlers continue to scrape its content...

It is not clear why some sites would block 1 crawler bot but allow others.

Updated Study Results Data & File:

  • Top 1000 Websites Checked Sep 20, by 5pm EST
  • 933 of the 1000 websites robots.txt were checked
  • GPTBot, ChatGPT-User, CCBot and Anthropic AI bot were all checked
  • 242 blocking GPTBot
  • 61 Blocking ChatGPT-User
  • 130 Blocking CCBot
  • 2 Blocking Anthropic AI Bot

See Updated Results Here

Text to Block ALL AI Bots:

User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent:  CCBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Claude-Web
Disallow: /

it is not clear if "anthropic-ai" and "claude-web" would be effective as there has been no documentation from Anthropic.

Study Method and Data:

If you have any questions about this study please contact us.

Study Method:

  1. The top 1000 most popular websites in the world were identified
  2. Each website robots.txt file was inspected to determine if it was blocking GPTBot or other bots
         - Some websites robots.txt were not able to be identified/inspected. They were excluded from our analysis.
  3. If the site was found to be blocking GPTBot then Archive.org was used to verify when the website started blocking it.
        - Some websites blocked archive.org making the verification of the date the site started blocking GPTBot not possible.

Download Complete Dataset Here:

Download Results Here

GPTBot Study Findings:

26.15% of the Top 1000 Websites Have Blocked GPTBot

Since the launch of GPTBot 26.15% of the top 1000 websites are now blocking it.

The First “Top 100” Website to Block GPTBot was Reuters.com on Aug 8, 2023

OpenAI launched GPTBot on August 7th and shortly after the first “Top 100” website to block GPTBot was Reuters.com

This can be verified looking at Archive.org and inspecting the timestamp for Reuters Robots.txt page: Click Here

Inspecting the timestamp for Reuters Robots.txt in Archive.org

Reuters is also the only website that seems to be trying to block Anthropic and Claude2.

Top 6 Websites That Blocked GPTBot Within The First 2 Weeks

Within the first 2 weeks of launching GPTBot these are the biggest websites in the world that had blocked GPTBot from accessing its site.

  • #8 Amazon.com - Aug 17, 2023 (corrected on Aug 24 - it used to say Aug 19)
  • #12 Quora.com - by Aug 22, 2023
  • #21 NYTimes.com - Aug 17, 2023
  • #30 Shutterstock.com - Aug 21, 2023
  • #36 Wikihow.com - Aug 12, 2023
  • #37 CNN.com - Aug 22, 2023
Top 6 Websites That Blocked GPTBot Within The First 2 Weeks

GPTBot Most Likely to be Blocked Followed by CCBot

26% of the top 1000 websites are blocking GPTBot while 14% are blocking CCBot, 7% blocking ChatGPT-User and only 0.2% are attempting to block Anthropic.

Websites Blocking GPTBot Has Continued to Increase

GPTBot launched Aug 7 and since then more and more sites have blocked it along with driving an increased interest in blocking other LLM bots.

List of Websites Blocking GPTBot & CCBot:

Updated Sep 20

See Most Up To Date List Here

The table Below is Presented as...

How To Block GPTBot:

To block the OpenAI GPTBot from your site install the following code on your Robots.txt file:

# Disallow GPTBot
User-agent: GPTBot
Disallow: /

What robots.txt file looks like to block GPTBot
What Robots.txt Looks Like to Block GPTBot

For additional details see OpenAI’s documentation: https://platform.openai.com/docs/gptbot

Why Did OpenAI Launch GPTBot Now?

Given that OpenAI has already consumed much of the internet using datasets like the CommonCrawl why launch GPTBot?

There are several possible reasons:

  1. OpenAI Lawsuits? OpenAI is facing an increasing number of lawsuits, some of which are related to using content without proper permission. See an up-to-date list of OpenAI ChatGPT Lawsuits.

  2. Public/Government Pressure? Potentially it is part of their follow-through based on the White House open letter they signed.

  3. Improve Future Models? They state a somewhat vague “Web pages crawled with the GPTBot user agent may potentially be used to improve future models” This statement could mean…
         Connecting ChatGPT to the Internet: The GPT-4 knowledge cutoff date is September 2021, this could be one of the steps in fully connecting ChatGPT to the Internet.
         Build a Training Dataset: They could use their GPTBot to scrape the web and build a dataset, although I would assume they have very likely already been doing this.  

Does this Remove Content from Current LLM Models like ChatGPT?

No, blocking GPTBot does not remove the knowledge that LLM’s have gained by training on existing web content they have already accessed.

What is Web Crawling or Scraping and is it Legal?

Web crawling or scraping is a process where automated software visits websites to gather specific information from their pages. This is commonly used by search engines to index content. While useful, this practice can sometimes be contentious, especially if done without the website owner's consent.

In a 2019 case between Linkedin and HIQ Labs the ability to scrape publicly available websites was upheld. See TechCrunch article or see Decision.

However, some of the current lawsuits against OpenAI seem to be challenging this.

Downside to Blocking GPTBot? Consider the Google Analogy.

While it may seem premature to compare GPTBot to giants like Google, the analogy isn't without merit. The most significant concern for websites considering blocking GPTBot is the potential missed opportunity. As ChatGPT evolves and integrates with the internet more intimately, it could serve a role similar to that of a search engine. By providing users with direct links or references from web sources, ChatGPT can direct significant traffic to those sites. If GPTBot is blocked, that site's content may not be among the recommended sources, essentially sidelining potential visitors. In essence, just as blocking Google would prevent a website from appearing in one of the world's most popular search engines, blocking GPTBot might mean missing out on a burgeoning channel of web traffic.

Any Questions - Contact Us:

Hopefully, this study was helpful.

If you have any questions please don’t hesitate to contact us.

Other Generative AI Studies:

Over 30% of Reviews on Capterra are AI Generated

List of Companies Banning ChatGPT

Up to Date List of OpenAI and ChatGPT Lawsuits

400% Increase in AI Generated Amazon Reviews

Customers Love Originality.ai

We deeply understand your needs when it comes to identifying Original content and we are building features around our accurate AI detection and Plagiarism checking that users love!

After testing a number of AI content detection tools, I have found Originality.ai to be one of the best on the market. And now with the ability to detect paraphrased AI content, Orignality.ai is even more powerful. It’s basically my go-to detection tool at this point.

Glenn Gabe

SEO Consultant, GSQI.com

At Clicking Publish, producing original, high-quality content is essential to our success. To maintain these standards, it's important that we verify the work from freelancers and outsourced writers. Originality.ai makes this process easy for us by providing a simple and efficient tool that ensures the content we receive meets our expectations.

Kityo Martin

Clicking Publish

I love the tool. Not only does it detect ACTUAL Al written content, but also writers who write just like Al. Great way to weed out Al and poor writing. Just because content was written by a human doesn't mean they did any better than an Al tool. We had a lot of our writers test positive for Al and they didn't use Al. What was common in all their writing was the lack of original thoughts. It was all regurgitation.

Ryan Cunningham

After doing some serious testing with Originality (which caters for the newerAl tech), I can't fool it (yet).

Joe Davies

Founder, FatJoe

So what can we learn from this? In many cases, the tool tells the right story, even when it's nuanced, like in the case of AI content edited by humans.

Gael Breton

Founder, Authority Hacker

I realize that AI content isn't going away and with human editing, it can save time/make blog content better. That said, I've also had writers submit content that was 100% AI and never told me. A BIG no-no. This tool (Originality.ai) is what I'm using to stop that.

Ron Stefanski

OneHourProfessor.com

In The Press

Originality.ai has been featured for its accurate ability to detect GPT-3, Chat GPT and GPT-4 generated content. See some of the coverage below…

View All Press
Featured by Leading Publications

Originality.ai did a fantastic job on all three prompts, precisely detecting them as AI-written. Additionally, after I checked with actual human-written textual content, it did determine it as 100% human-generated, which is important.

Vahan Petrosyan

searchenginejournal.com

I use this tool most frequently to check for AI content personally. My most frequent use-case is checking content submitted by freelance writers we work with for AI and plagiarism.

Tom Demers

searchengineland.com

After extensive research and testing, we determined Originality.ai to be the most accurate technology.

Rock Content Team

rockcontent.com

Jon Gillham, Founder of Originality.ai came up with a tool to detect whether the content is written by humans or AI tools. It’s built on such technology that can specifically detect content by ChatGPT-3 — by giving you a spam score of 0-100, with an accuracy of 94%.

Felix Rose-Collins

ranktracker.com

ChatGPT lacks empathy and originality. It’s also recognized as AI-generated content most of the time by plagiarism and AI detectors like Originality.ai

Ashley Stahl

forbes.com

Originality.ai Do give them a shot! 

Sri Krishna

venturebeat.com

For web publishers, Originality.ai will enable you to scan your content seamlessly, see who has checked it previously, and detect if an AI-powered tool was implored.

Industry Trends

analyticsinsight.net

AI Content Detector & Plagiarism Checker for Serious Content Publishers

Improve your content quality by accurately detecting duplicate content and artificially generated text.