In the current media landscape, it’s easy to feel as if AI-generated content is everywhere, and that concern is not without precedent. What started as psychedelic dog-like face images and robotic, nonsensical text monologues has morphed into the modern, intelligent, and increasingly difficult to detect AI content that we encounter everyday. Every article consumed, image viewed and video watched now triggers a small paranoid question in any responsible editor or consumer’s mind: Did a human make this? From helpful editing and re-wording to full on farmed AI-generated content with the purpose of gaming Google’s search engine optimization algorithm, AI content is present in Google’s search results, and it’s here to stay.
In response, we set out to give a data backed, statistical count of AI’s presence in google ratings to answer the question: how much AI content is present in Google?. By sampling the top 20 search results from 500 popular keywords from the beginning of 2019 to the present day, we’re tracking the saturation of AI in search results, all the way back to the release of GPT-2.
Update: Our Results - September 24, 2024
In our latest findings, reviewed on September 24th, 2024, the amount of AI content appearing in Google Search Results is steadily increasing.
In July 2024, there was a brief drop in AI content to 12.92%. However, the decrease was notably brief. Now, as of September 2024, it's as high as 13.97%, surpassing the previous peak of 13.95% in June 2024.
We'll continue monitoring the presence of AI content in Google, so check back on this live dashboard to stay up to date!
Update: Our Results - August 26, 2024
In our latest findings, reviewed on August 26th, 2024, the amount of AI-generated content in Google Search Results has moderately increased to 13.08% in August 2024, after a slight dip to 12.92% in July 2024.
Update: Our Results - July 24, 2024
Our latest review of this live study, on July 24th, 2024 found that for the first time since March 2024, the quantity of AI content appearing in Google's results dropped!
As of our July 2024 analysis, AI-generated content in Google Search Results dipped from it's all-time high of 13.95% in June 2024 to 12.92% in July 2024.
Update: Our Results - June 26, 2024
As we continue to monitor the quantity of AI-generated content that appears in Google search results, our most recent analysis, as of June 24th, 2024, notes a substantial increase. Our findings reveal that AI-generated content appearing in Google’s top-rated results rose from 11.5% on May 22nd, 2024, to 13.95% as of June 24th, 2024!
For further context, when we reviewed this data on June 15th, 2023 — approximately one year ago — only 7.12% of content appearing in Google results was AI-generated. That means AI-generated content in Google has almost doubled over the last 12 months.
Update: Our Results - May 22, 2024
In our latest analysis conducted as of May 22nd, 2024, we've observed a continued increase in the prevalence of AI-generated content within Google's search results. Building upon our previous findings from April 22nd, 2024, where we noted that 11.3% of Google’s top-rated content was suspected to be AI-generated, our latest data reveals a further rise, with AI content now comprising 11.5% of the total!
Update: Our Results - Apr 22, 2024
In our latest analysis conducted as of April 22nd, 2024, we've observed a significant uptick in the prevalence of AI-generated content within Google's search results. Building upon our previous findings from March 23rd, 2024, where we noted that 10.2% of Google’s top-rated content was suspected to be AI-generated, our latest data reveals a further increase, with AI content now comprising 11.3% of the total!
This represents a noticeable escalation in AI content integration into Google's search results. In the span of less than a month, we've witnessed a tangible rise in the presence of AI generated content.
To contextualize this growth, consider that before the public release of GPT-2, AI content accounted for just 2.3% of sampled websites. Now, as of April 22nd, 2024, this figure has surged to 11.3%.
In general, we found a continuously increasing presence of AI content. Before GPT-2 was released to the public, AI content was detected in only 2.3% of our sampled websites. Around five years and three Open AI GPT models later, that percentage increased almost threefold to 10.2% in March 2024.
We also found that even after Google’s implementation of its helpful content policy, first introduced in August 2022, the machine-content saturation still increased. It is however notable that with the introduction of AI to the general public via Chat-GPT, we did expect to see a spike in AI content in Google, which our data does not confirm. This suggests that Google’s helpful content and spam policies have been at least partially successful at keeping AI Spam at bay. To learn more about Google’s interactions with AI content, check out our study on the March 2024 Google update, and its repercussions.
The increasing presence of AI in Google is not to be understated. Learning Language Model (LLM) based AI tools are trained on very large human content datasets, often sourced from the internet via an automated scraping/crawling process. If not curated carefully, as the internet becomes more and more saturated with AI-generated content, these training datasets will as well. A study from May 2023 suggests that through multiple iterations of LLM training on datasets containing machine-generated content, models generate more generic and predictable results, and become more likely to mis-perceive their learning task over time. To offset this, the authors stress the importance of the continuing availability of non-machine generated training materials for future model learning. With future models of GPT and other AI models well on their way, and the continuously increasing rate of AI content in websites, said datasets will become more and more difficult to source.
Image: An AI model becoming more uniform over time (Ilia Shumailov et al.)
The data collection phase of this project was designed with the goal of generating a representative sampling of the average results one would see on Google after searching an informational keyword. We automated each step, allowing us to analyze large quantities of data. The process is novel to us, and we believe as well to the general search engine community.
To seed the data points for our study, 500 Google Search keywords were chosen, with the resulting set having the following properties:
For each keyword, we used a search engine optimization (SEO) tool to find their respective top 20 search results. We repeated this process every second month, from January 2019 to present day, resulting in a list of 10,000 websites for each two month period.
For each list of websites/time period, we checked the Internet Archive to see if a website snapshots was available within its respective period. If a snapshot was available, we used a port of the Arc90 Readability Algorithm to extract the main article text from the Archive snapshot.
All scraped text was run through a data cleaning process, fixing extraneous white space and other punctuation artifacts, removing most non-article text such as citations and footnotes and ensuring that all text was sufficiently long enough to be run through the Originality.AI detector. Text from websites that were not article based, like Youtube and Reddit, was removed.
The text was then run through the Originality.AI detector, and its score was recorded. As the originality score represents the detector’s confidence that a text contains AI from 0 to 1, we consider a score of 0.5 or above to be a positive AI detection.
Image: Text from an Archive snapshot gets scraped and run through the AI detector
Two things are clear: AI content is becoming more present and Google is increasing its efforts to keep unhelpful AI content at bay. Using tools like Originality.ai can help ensure that the content on your website is original, helpful, and plays well within Google’s rules for search engine results.
AI In Reddit Writing Communities
Does Google Penalize AI Content?
How Does AI Content Detection Work?
Ilia Shumailov et al. (2023). The Curse of Recursion: Training on Generated Data Makes Models Forget. Retrieved from arXiv:2305.17493