Amount of AI Content in Google Search Results - Ongoing Study

In the current media landscape, it’s easy to feel as if AI-generated content is everywhere, and that concern is not without precedent. What started as psychedelic dog-like face images and robotic, nonsensical text monologues has morphed into the modern, intelligent, and increasingly difficult to detect AI content that we encounter everyday. Every article consumed, image viewed and video watched now triggers a small paranoid question in any responsible editor or consumer’s mind: Did a human make this? From helpful editing and re-wording to full on farmed AI-generated content with the purpose of gaming Google’s search engine optimization algorithm, AI content is present in Google’s search results, and it’s here to stay.

In response, we set out to give a data backed, statistical count of AI’s presence in google ratings to answer the question: how much AI content is present in Google?. By sampling the top 20 search results from 500 popular keywords from the beginning of 2019 to the present day, we’re tracking the saturation of AI in search results, all the way back to the release of GPT-2. 

How much AI Content is Present in Google?

Update: Our Results - May 22, 2024

In our latest analysis conducted as of May 22nd, 2024, we've observed a continued increase in the prevalence of AI-generated content within Google's search results. Building upon our previous findings from April 22nd, 2024, where we noted that 11.3% of Google’s top-rated content was suspected to be AI-generated, our latest data reveals a further rise, with AI content now comprising 11.5% of the total!

Update: Our Results - Apr 22, 2024

How much AI Content is Present in Google?

In our latest analysis conducted as of April 22nd, 2024, we've observed a significant uptick in the prevalence of AI-generated content within Google's search results. Building upon our previous findings from March 23rd, 2024, where we noted that 10.2% of Google’s top-rated content was suspected to be AI-generated, our latest data reveals a further increase, with AI content now comprising 11.3% of the total!

This represents a noticeable escalation in AI content integration into Google's search results. In the span of less than a month, we've witnessed a tangible rise in the presence of AI generated content.

To contextualize this growth, consider that before the public release of GPT-2, AI content accounted for just 2.3% of sampled websites. Now, as of April 22nd, 2024, this figure has surged to 11.3%.

Our Results

How much AI Content is Present in Google?

In general, we found a continuously increasing presence of AI content. Before GPT-2 was released to the public, AI content was detected in only 2.3% of our sampled websites. Around five years and three Open AI GPT models later, that percentage increased almost threefold to 10.2% in March 2024. 

We also found that even after Google’s implementation of its helpful content policy, first introduced in August 2022, the machine-content saturation still increased. It is however notable that with the introduction of AI to the general public via Chat-GPT, we did expect to see a spike in AI content in Google, which our data does not confirm. This suggests that Google’s helpful content and spam policies have been at least partially successful at keeping AI Spam at bay. To learn more about Google’s interactions with AI content, check out our study on the March 2024 Google update, and its repercussions.

What are the Consequences of AI Content in Google?

The increasing presence of AI in Google is not to be understated. Learning Language Model (LLM) based AI tools are trained on very large human content datasets, often sourced from the internet via an automated scraping/crawling process. If not curated carefully, as the internet becomes more and more saturated with AI-generated content, these training datasets will as well. A study from May 2023 suggests that through multiple iterations of LLM training on datasets containing machine-generated content, models generate more generic and predictable results, and become more likely to mis-perceive their learning task over time. To offset this, the authors stress the importance of the continuing availability of non-machine generated training materials for future model learning. With future models of GPT and other AI models well on their way, and the continuously increasing rate of AI content in websites, said datasets will become more and more difficult to source.

Image: An AI model becoming more uniform over time (Ilia Shumailov et al.)

Data Collection and Analysis 

The data collection phase of this project was designed with the goal of generating a representative sampling of the average results one would see on Google after searching an informational keyword. We automated each step, allowing us to analyze large quantities of data. The process is novel to us, and we believe as well to the general search engine community. 

Choosing the Keywords

To seed the data points for our study, 500 Google Search keywords were chosen, with the resulting set having the following properties: 

  1. Keywords were informational; i.e., they are searched when looking for an answer to a question. Informational keywords generate search results with large amounts of article text, making them good targets for AI Scanning. Example keywords include: “how to screenshot on mac”, “best albums of all time”, and “what are carbohydrates”.
  2. The set of chosen keywords has a similar search volume (read: popularity) distribution to that of the set of all informational keywords. To illustrate: Imagine we are choosing 10 keywords to represent the top 100 informational keywords. If 10% (i.e., ten) of the top keywords had a search volume of 2,000/month, then following our methodology, 10% (i.e., one) of our chosen 10 keywords would have a search volume of 2,000. 
  3. Keywords should not have large fluctuations in popularity and/or search volume over time. Keywords such as those involving movies, sports events, video game releases, etc, were not considered. 

Finding Keyword Search Results Over Time

For each keyword, we used a search engine optimization (SEO) tool to find their respective top 20 search results. We repeated this process every second month, from January 2019 to present day, resulting in a list of 10,000 websites for each two month period.

Scraping Article Text

For each list of websites/time period, we checked the Internet Archive to see if a website snapshots was available within its respective period. If a snapshot was available, we used a port of the Arc90 Readability Algorithm to extract the main article text from the Archive snapshot. 

AI Scanning the Article Text

All scraped text was run through a data cleaning process, fixing extraneous white space and other punctuation artifacts, removing most non-article text such as citations and footnotes and ensuring that all text was sufficiently long enough to be run through the Originality.AI detector. Text from websites that were not article based, like Youtube and Reddit, was removed.

The text was then run through the Originality.AI detector, and its score was recorded. As the originality score represents the detector’s confidence that a text contains AI from 0 to 1, we consider a score of 0.5 or above to be a positive AI detection.

Image: Text from an Archive snapshot gets scraped and run through the AI detector

In summary: How can I offset the increasing AI content in Google?

Two things are clear: AI content is becoming more present and Google is increasing its efforts to keep unhelpful AI content at bay. Using tools like Originality.ai can help ensure that the content on your website is original, helpful, and plays well within Google’s rules for search engine results.

Similar Articles:

AI In Reddit Writing Communities

Does Google Penalize AI Content? 

How Does AI Content Detection Work? 

Sources: 

Ilia Shumailov et al. (2023). The Curse of Recursion: Training on Generated Data Makes Models Forget. Retrieved from arXiv:2305.17493