Websites That Have Blocked OpenAI’s GPTBot CCBot Anthropic Google Extended - 1000 Website Study
We analyzed the top 1000 websites in the world to identify which sites are already blocking GPTBot. We updated this study Sep 22, 2023. OpenAI shared details on how to block its GPTBot on Aug 7, 2023 and this is how the Top 1000 websites are responding.
We analyzed the top 1000 websites in the world to identify which sites are already blocking GPTBot. We updated this study Sep 22, 2023.
OpenAI shared details on how to block its GPTBot on Aug 7, 2023. This study looks at how the top 1000 websites are responding. One concern, as an AI detector, is the risk that LLM's will further scrape content and make AI writing tools that become undetectable to AI checkers? This study analyzes how they are responding.
It was first published here Aug 22, updated Aug 29 and this update is Sep 22, 2023.
Summary of Key Findings (Sep 22, 2023 Update):
25.9% of the Top 1000 websites are blocking GPTBot
Top Websites Now Blocking GPTBot Are: Pinterest (most recently added - Sep 11, 2023), Amazon, Quora, Indeed
Most Large Media/News Publishers are now all Blocking GPTBot: NYTimes, TheGuardian, CNN.com, USAToday, BusinessInsider, Reuters, WashingtonPost, NPR, CBS, NBC, Bloomberg, CNBC, ESPN
The Top 6 Biggest Websites That First Blocked GPTBot are: - Amazon.com - Aug 17, 2023 - Quora.com - by Aug 22, 2023 - NYTimes.com - Aug 17, 2023 - Shutterstock.com - Aug 21, 2023 - Wikihow.com - Aug 12, 2023 - CNN.com - Aug 22, 2023
The Common Crawl Bot is being Blocked 13.9% of the time. CCBot was around before GPTBot but only 5% of websites blocked it as of Aug 1, 2023
Only 2 Websites are Attempting to Block Anthropic AI: Reuters blocking both anthorpic-ai and claude-web on Sep 11, 2023 & Corriere.it
Google Extended Update - Sep 29, 2023
Google has provided increased ability to control how Google's AI bots use the content on your website.
The Top 100 websites were more likely to block GPTBot then the remaining Top 1000 websites when we ran this study 1 month ago. That difference is now gone 26% of the top 100 and top 1000 sites are blocking GPTBot.
Websites rushing to block GPTBot but other crawlers continue to scrape its content...
It is not clear why some sites would block 1 crawler bot but allow others.
Updated Study Results Data & File:
Top 1000 Websites Checked Sep 20, by 5pm EST
933 of the 1000 websites robots.txt were checked
GPTBot, ChatGPT-User, CCBot and Anthropic AI bot were all checked
it is not clear if "anthropic-ai" and "claude-web" would be effective as there has been no documentation from Anthropic.
Study Method and Data:
If you have any questions about this study pleasecontact us.
The top 1000 most popular websites in the world were identified
Each website robots.txt file was inspected to determine if it was blocking GPTBot or other bots - Some websites robots.txt were not able to be identified/inspected. They were excluded from our analysis.
If the site was found to be blocking GPTBot then Archive.org was used to verify when the website started blocking it. - Some websites blocked archive.org making the verification of the date the site started blocking GPTBot not possible.
Public/Government Pressure? Potentially it is part of their follow-through based on the White House open letter they signed.
Improve Future Models? They state a somewhat vague “Web pages crawled with the GPTBot user agent may potentially be used to improve future models” This statement could mean… Connecting ChatGPT to the Internet: The GPT-4 knowledge cutoff date is September 2021, this could be one of the steps in fully connecting ChatGPT to the Internet. Build a Training Dataset: They could use their GPTBot to scrape the web and build a dataset, although I would assume they have very likely already been doing this.
Does this Remove Content from Current LLM Models like ChatGPT?
No, blocking GPTBot does not remove the knowledge that LLM’s have gained by training on existing web content they have already accessed.
What is Web Crawling or Scraping and is it Legal?
Web crawling or scraping is a process where automated software visits websites to gather specific information from their pages. This is commonly used by search engines to index content. While useful, this practice can sometimes be contentious, especially if done without the website owner's consent.
In a 2019 case between Linkedin and HIQ Labs the ability to scrape publicly available websites was upheld. See TechCrunch article or see Decision.
Downside to Blocking GPTBot? Consider the Google Analogy.
While it may seem premature to compare GPTBot to giants like Google, the analogy isn't without merit. The most significant concern for websites considering blocking GPTBot is the potential missed opportunity. As ChatGPT evolves and integrates with the internet more intimately, it could serve a role similar to that of a search engine. By providing users with direct links or references from web sources, ChatGPT can direct significant traffic to those sites. If GPTBot is blocked, that site's content may not be among the recommended sources, essentially sidelining potential visitors. In essence, just as blocking Google would prevent a website from appearing in one of the world's most popular search engines, blocking GPTBot might mean missing out on a burgeoning channel of web traffic.
Any Questions - Contact Us:
Hopefully, this study was helpful.
If you have any questions please don’t hesitate tocontact us.
Founder / CEO of Originality.AI I have been involved in the SEO and Content Marketing world for over a decade. My career started with a portfolio of content sites, recently I sold 2 content marketing agencies and I am the Co-Founder of MotionInvest.com, the leading place to buy and sell content websites. Through these experiences I understand what web publishers need when it comes to verifying content is original. I am not For or Against AI content, I think it has a place in everyones content strategy. However, I believe you as the publisher should be the one making the decision on when to use AI content. Our Originality checking tool has been built with serious web publishers in mind!