AI Content Detector Accuracy Review + Open Source Dataset and Research Tool
We believe that it is crucial for AI content detectors reported accuracy to be open, transparent and accountable. The reality is, each person seeking AI-detection services deserves to know which detector is the most accurate for their specific use case.
We believe that it is crucial for AI content detectors reported accuracy to be open, transparent and accountable. The reality is, each person seeking AI-detection services deserves to know which detector is the most accurate for their specific use case.
We believe that it is crucial for AI checkers reported accuracy to be open, transparent and accountable. The reality is, each person seeking AI-detection services deserves to know which detector is the most accurate for their specific use case.
The world needs reliable AI detection tools, but no AI detection tool is ever going to be 100% perfect. Users should understand the individual limitations of these tools so that they can wield them responsibly, which means the developers of AI detectors should be as transparent as possible about the capabilities and limitations of their detectors.
That’s why we here at Originality have made this guide, which aims to provide you with an answer to the question of: What AI content detector is the most accurate? Additionally, we are proposing a standard for testing AI detector effectiveness, along with the release of an Open Source tool to help increase the transparency and accountability with all AI content detectors.
Open-sourcing a research tool we developed to assist anyone (researcher, journalist, customer or other AI detector) in testing multiple AI detectors on their own (or our) benchmark dataset. An even easier-to-use open-source AI detector efficacy tool here.
Providing detailed instructions and including the calculation in the tool to help identify the most important AI vs Original Human classifier efficacy metrics.
Transparently reporting our own tools accuracy on multiple publicly available datasets.
If you have been asked or want to evaluate an AI content detector's potential use case for your organization this article is for you.
This guide will help you understand AI detectors and their limitations by showing you…
How AI detectors work
How to calculate AI detector's effectiveness
How to complete your own tests (using one of the open-sourced tools we provide)
What we think should and should not be considered AI content
How accurate our AI content detector is based on the testing we have done
If you can trust our AI detector's effectiveness
How all AI detectors stack up in terms of effectiveness by type of content
If you have any questions, suggestions, research questions or potential commercial use cases please contact us.
● Originality.AI Launches Version 3.0 replacing version 2.0 resulting in an improvement on the most challenging dataset created from the newest LLMs.
Accuracy improved from 90.2% to 98.8%
False Positives reduced from 2.9% to 2.8%
● Originality.ai tested against our own publicly available benchmark and 5 additional publicly available datasets resulting in the following F1 scores. F1 is the best measure of a tools performance measuring both the times it accurately predicted AI content and accurately predicted Human content. All data and testing results are available here.
“Originality Adversarial Dataset” F1 = 97.5
“How Close is ChatGPT” F1 = 98.9
“Benchmark… Scientific Papers” F1 = 90.8
“Ghostbuster” F1 = 99.4
“One Class Learning” F1 = 1
“Check Me if You Can”: F1 = 94.6
● Across all tests Originality.ai has increased its accuracy further establishing Originality.ai as the most accurate AI checker.
An oversimplistic view that “AI detectors are perfect” or “AI detectors dont work” are equally bad.
We still have an offer to OpenAI (or anyone willing to take us up on it) to backup their claim that AI detectors dont work
Why did we create this guide and tools? We believe…
In the transparent and accountable development and use of AI.
That AI detectors have a role to play in mitigating the potential negative societal impacts of generative AI.
AI detection tools' “accuracy” should be communicated with the same transparency and accountability that we want to see in AI’s development and use. Our hope is this study will move us all closer to that ideal.
At Originality.AI we love AI-generated content… but believe in transparency and accountability in its development, use and detection. Personally, I don’t want a writer or agency I have hired to create content for my audience and generate it with AI without my knowledge.
Originality.AI helps ensure there is trust in the originality of the content being produced by writers, students, job applicants or journalists.
Why is transparency and accountability important...
FTC Warns Against Unsupported AI Content Detection Accuracy Claims
Claimed accuracy rates with no supporting studies are clearly a problem.
We hope the days of AI detection tools claiming 99%+ accuracy with no data to support it are over. A single number is not good enough in the face of the societal problems AI content can produce and the important role AI content detectors have to play.
The FTC has come out on multiple occasions to warn against tools claiming AI detection accuracy or unsubstantiated AI efficacy.
“If you’re selling a tool that purports to detect generative AI content, make sure that your claims accurately reflect the tool’s abilities and limitations. “ source
“you can’t assume perfection from automated detection tools. Please keep that principle in mind when making or seeing claims that a tool can reliably detect if content is AI-generated.” source
“Marketers should know that — for FTC enforcement purposes — false or unsubstantiated claims about a product’s efficacy are our bread and butter” source
We fully agree with the FTC on this and have provided the tool needed for others to replicate this study for themselves.
Societal Impacts of Undetectable AI-Generated Content are Real
AI Content Detectors need to be a part of the solution and the current unsupported AI detection accuracy claims and research papers that have tackled this problem are simply not good enough in the face of the societal risks LLM generated content pose including…
Academic Dishonesty / AI Plagiarism
Fake Product Reviews
Fake Job Applications
Fake University Application Essays
Fake Scholarship Applications
Originality.AI Version History:
Along with this study we are releasing the latest version of our AI content detector. Below is our release history.
● 1.1 – Nov 2022 BETA (released before Chat-GPT)
GPT-2, GPT-NEO, GPT-J andGPT-3 accurate detection. But was able to be “tricked” with Paraphrasing
First GPT-3 trained detector
● 1.4 – Apr 2023
Improved ChatGPT detection
Accurate on GPT4 Generated Content
Only tool capable of accurately detecting Paraphrased content.
Reduced the number of false positives with increased training on human-generated content
● 2.0 - Aug 2023
Reduced False Positives
Improved Accuracy on the Hardest to Detect AI Content (GPT4, ChatGPT & Paraphrased)
Release of Open Source Benchmark Dataset.
Release of Open Source AI Detection Efficacy Testing Tool(s).
Between 1.4 and 2.0 there were many models that our team built which slightly increased AI detection capabilities but we were not going to release a model until it materially reduced false positives.
● 3.0 - Feb 2024
Trained on the newest LLM’s (Grok, Mixtral, GPT-4 Turbo, Gemini, Claude 2)
Accuracy increased on our toughest testing dataset from 90.2% to 98.8%
False Positives have been slightly improved from 2.9% to 2.8%
Below is a brief summary of the 3 general approaches that an AI detector (or called in Machine Learning speak a “classifier”) can use to distinguish between AI-generated and human-generated text.
1. Feature-Based Approach:
The feature-based approach uses the fact that there can potentially be consistently identifiable and known differences that exist in all text generated by an LLM like ChatGPT when compared to human text. Some of these features that tools look to use are explained below.
Burstiness: Burstiness in text refers to the tendency of certain words to appear in clusters or "bursts" rather than being evenly distributed throughout a document. AI-generated text can potentially have more predictability (less burstiness) since AI models tend to re-use certain words or phrases more often than a human writer would. Some tools attempt to identify AI text using burstiness (more burstiness = human, less burstiness = AI).
Perplexity: Perplexity is a measure of how well a probability model predicts the next word. In the context of text analysis, it quantifies the uncertainty of a language model by calculating the likelihood of the model producing a given text. Lower perplexity means that the model is less surprised by the text, indicating the text was more likely AI-generated. High perplexity scores can indicate human-generated text.
Frequency Features: Frequency features refer to the count of how often certain words, phrases, or types of words (like nouns, verbs, etc.) appear in a text. For example, AI-generated might overuse certain words, underuses others, or uses certain types of words at rates that are inconsistent with human writing. These features might be able to help detect AI-generated text.
Readability or Fluency Features:Studies have shown that earlier (ie 2019) LLMs would generate text that has similar readability scores.
Punctuation: This pertains to the use and distribution of various punctuation marks in a text. AI-generated text often exhibits correct and potentially predictable use of punctuation. For instance, it might use certain types of punctuation more often than a human writer would, or it might use punctuation in ways that are grammatically correct but stylistically unusual. By analyzing punctuation patterns someone might attempt to create a detector that can predict AI-generated content.
Advantages - Once patterns are identified they can be repeatedly identified in a very cost-effective and fast manner.
Disadvantages - Modern LLMs such as ChatGPT4 and Bard can produce varied enough content that these detectors can be bypassed with clever ChatGPT prompts.
Uses a pre-trained language model to identify text generated by a model similar to itself. Basically asking itself how likely the content the AI is seing was generated by a similar version of itself (note don’t try asking ChatGPT… it doesn’t work like that).
Advantages - Easier to build and does not require supervised training
Uses a large language model such as BERT or RoBERTa and trains on a set of human and AI-generated text. It learns to identify the differences between the two in order to predict if the content is AI or Original.
Advantages - Can produce the most effective detection
Disadvantages - These can be more expensive to train and operate. They can also lag behind in detection capabilities for the newest AI tools until their training is updated.
One test involved hundreds of thousands of samples and looked at Originality V1.4 vs V2.0 while the second proposes a smaller Challenging Benchmark Dataset and we compared multiple AI content detectors' performance against that dataset. The third uses a published open source dataset for testing AI content detectors effectiveness.
Tests on the second and third datasets were run the week of July 24 all using our open-sourced AI content detection accuracy tool or if an API was not available via humans entering the text and recording the results.
The second test can be replicated using the benchmark dataset and our open-sourced tool.
The 4th test is a series of tests completed on other available datasets testing Originality.AI’s effectiveness.
Introducing Our Benchmark Adversarial AI Detection Dataset:
In the spirit of openness and contributing to the understanding of the effectiveness and limitations of AI detectors we are open-sourcing this “challenging” benchmark dataset to help with the evaluation of different AI detection methods. If someone was working to make AI writing undetectable this is the type of content that they would produce.
This benchmark dataset includes samples from some of the most challenging prompts and settings for LLM models including ChatGPT4, GPT-4 and Paraphrasers etc. Additionally, it includes known human content.
Disclaimer from Our AI Research Team: This data set is a very small (insignificant) data set randomly sampled from the test dataset that we built for different experiments. Completely unrelated and not part of our training/validation/test set, completely random, unbiased to ensure fairness and no "cherry-picking."
The table below shows the datasets and a brief explanation of each.
The dataset(s) provided might be applicable for your use case or potentially if you are evaluating AI detection tools effectiveness for another type of content you will need to produce your own dataset. For example, I would not rely solely on these results if you are looking for an AI detector to identify fake social media messages or online reviews. Use our Open-Source Tool to make running your data and evaluating detectors' performance much easier.
Testing Method & New Open-Source Testing Tools:
To make the running of tests easy, repeatable and accurate we created and decided to open-source our tools to help others do the same. The main Tool allows you to enter the API key for multiple AI content detectors and plug in your own data to then receive not just the results from the tool but also a complete statistical analysis of the detection effectiveness calculations.
This tool makes it incredibly easy for you to run your test content against all AI content detectors that have an available API.
The reason we built and open-sourced this tool to run tests is so that we can increase the transparency into tests by…
Running all tests at basically the same time on the same day
Ensuring the exact same text with no difference in formatting is sent to each tool
Quickly testing datasets as they become available
Providing an opportunity for potential customers or researchers to test their own data and make an informed decision about which AI detector is ideal for their use case.
The speed at which new LLM’s are launching and the speed AI detection is evolving means that accuracy studies which take 4 months from test to publication are hopelessly outdated.
Features of This Tool:
Free & Open Sourced
Able to Scan A Text Dataset With Multiple AI Detectors
Our View on the Use of AI Detectors Within Academia & False Positives in General.
We do not believe that AI detection scores alone should be used for academic honesty purposes and disciplinary action.
The rate of false positives (even if low) is still too high to be relied upon for disciplinary action.
Here is a guide we created to help writers or students reduce false positives in AI content detector usage. Plus we created a free AI detector chrome extension to help writers/editors/students/teachers visualize the creation process and prove originality.
How To Evaluate AI Detectors “Accuracy”:
Below are the best practices and methods used to evaluate the effectiveness of AI classifiers (ie AI content detectors). There is some nerdy data below, but if you are looking for even more info here is a good primer on evaluating the performance of a classifier.
One single number related to a detector's effectiveness without additional context is useless!
Don’t trust a SINGLE “accuracy” number without additional context.
Here are the metrics we look at to evaluate a detector's efficacy…
The confusion matrix and the F1(more on it later) together are the most important measures we look at. In one image you can quickly see the ability of an AI model to correctly identify both Original and AI generated content.
True Positive (TP) – AI detector correctly identified content as AI.
False Negative (FN) – AI detector incorrectly identified AI content as Human.
False Positive (FP) – AI detector incorrectly identified human content as AI.
True Negative (TN) – AI detector correctly identified human content as AI.
True Positive Rate - AI Text Detection Capabilities
Identifies AI content correctly x% of the time. True Positive Rate TPR (also known as sensitivity, hit rate or recall).
Identifies human content correctly x% of the time. True Negative Rate TNR (also known as specificity or selectivity).
True Negative Rate TNR = TN / (TN & FP) = 1- FPR
What % of your predictions were correct? Accuracy alone can provide a misleading number. This is in part why you should be skeptical of AI detectors' claimed “accuracy” numbers if they do not provide additional details for their accuracy numbers. The following metric is what we use, along with our open source tool to measure accuracy..
Combines Recall and Precision to create one measure to rank all detectors, often used when ranking multiple models. It calculates the harmonic mean of precision and sensitivity.
F1 = 2 x (PPV x TPR) / (PPV + TPR) where Precision (PPV) = TP / (TP + FP)
Metrics Considered but Not Used:
ROC & AUROC: Not used since we can't adjust the sensitivity of other tools and some tools do not provide a percentage.
Precision: PPV = TP / (TP + FP) - Not used
But… What Should be Considered AI Content?
So what should and should not be considered AI content. As “cyborg” writing combining humans and AI assistants rise what should and shouldn’t be considered AI content is tricky!
Some studies have made some really strange decisions on what to claim as “ground truth” human or AI-generated content.
In fact, there was one study that used human-written text in multiple languages that were then translated (using AI tools) to English and called it “ground truth” Human content.
Description of Dataset:
Classifying the AI Translated Dataset (02-MT) as Human-written???
We think this approach is crazy!
Our position is that if the effect of putting content into a machine is that the output from that machine is unrecognizable when comparing the two documents then it should be the aim of an AI detector to identify the output text as AI generated.
The alternative is that any content could be translated and presented as Original work since it would pass both AI and Plagiarism detection.
Here is what we think should and should not be classified as AI-generated content:
AI-Generated and Not Edited = AI-Generated Text
AI-Generated and Human Edited = AI-Generated Text
AI Outline, Human Written and heavily AI Edited = AI-Generated Text
AI Research and Human Written = Original Human-Generated
Human Written and Edited with Grammarly = Original Human-Generated
Human Written and Edited = Original Human-Generated
Test #1 - Benchmark Testing Originality.AI V2.0 vs New V3.0:
Finally! let's get to the tests.
As new and improved LLM’s are released we need to both update our model and our benchmark testing.
With our most recent and most challenging benchmark to date here is the performance improvement of Originality.ai moving from Version 2.0 to Version 3.0. This test includes hundreds of thousands of AI and human generated samples that our detectorhas never trained on.
1. General Performance: Version 3.0 demonstrated a marked improvement in overall accuracy and F1 scores, suggesting a more accurate and balanced detection capability in distinguishing between human and AI-generated texts.
2. Dataset-Specific Performance:
Total Dataset: Version 3.0 exhibited a notable increase in the true positive rate and a decrease in the false negative rate, indicating enhanced detection of AI-generated content.
AI-Generated Text: Significant advancements were observed in the detection of AI-generated content, with a substantial increase in the true positive rate and a reduction in the false negative rate.
Model-Specific Analysis: The analysis of texts generated by specific models such as GPT-4 Turbo and ChatGPT showed almost perfect performance metrics in version 3.0, highlighting its improved capability to identify content generated by the latest AI technologies.
3. Improvement in Detection of Human-Written Texts: Although both versions maintained a high level of accuracy in identifying human-written texts, version 3.0 showed slight improvements in this area, further refining its ability to differentiate between human and AI-generated content effectively.
In conclusion, the AI Detector model version 3.0 significantly outperforms its predecessor, particularly in its ability to identify AI-generated content, including those generated by state-of-the-art models like GPT-4 Turbo. These findings underscore the importance of continuous model updates and benchmarking against newly developed AI technologies to enhance the reliability and accuracy of AI detection tools.
Test #2 - Adversarial AI Detection Dataset - 6 Tools Compared
In the next 2 tests we will look at the performance of many AI content detectors to evaluate their relative effectiveness.
To complete the tests and make them repeatable for others to execute we used…
For Test #2 it is important to remember this is a “Challenging” dataset with adversarial settings on GPT-3, GPT-4, ChatGPT and the Paraphraser data. It is not an accurate reflection of AI detection tools on most “generic” AI-generated content.
Test #2 - Results & Raw Data Shared
Results Including Data and Scores Can be downloaded and viewed here:
The table is sorted by F1 (a number that balances both a detectors ability to correctly identify AI content and correctly identify human content).
All tools performed reasonably well on False Positives ranging from a low of 0.8% to a high of 7.6% False Positive Rate.
The ability to identify AI content (True Positive Rate) varied wildly from 19.8% to 98.4%.
Test #2 - Confusion Matrix for Each AI Detector:
Originality.AI - Confusion Matrix - Test #2 Adversarial Dataset Testing
Winston.AI - Confusion Matrix - Test #2 Adversarial Dataset Testing
Sapling.AI - Confusion Matrix - Test #2 Adversarial Dataset Testing
GPTZero - Confusion Matrix - Test #2 Adversarial Dataset Testing
Content at Scale - Confusion Matrix - Test #2 Adversarial Dataset Testing
CopyLeaks - Confusion Matrix - Test #2 Adversarial Dataset Testing
Limitations of Test #2:
Human Data Entry vs API: We did not have API access to several tools and had a team manually checking the results, this could introduce error. For the tools ContentAtScale, TurnItIn and WinstonAI the results could have some human error. False positives were double-checked.
Dataset Quality: This benchmark dataset came from a MUCH larger dataset and did not get a human review to clean it. The result is that there are some entries that are not great samples.
New Updates to Detectors: Our model was run on Feb 13th 2024 and all other tests were run within a 1 week window between July 24th-July 28th 2023 but these results are a snapshot of a moment-in-time performance and not reflective of future performance.
Limited Dataset Size: As our AI research team wrote, 2000 samples should not be considered a conclusive efficacy test.
If you would like to run your own or other datasets to test the accuracy of AI detectors easily you can use our Open Source tool and pick any of the datasets below…
Open-Source AI Detector Comparison Tool & Dataset: Here
Test #3 - List of other AI Detection Datasets & 5 More Tests:
Here are some additional datasets that you can use in your own testing.
We did not run ALL the tools through these datasets but did run Originality.AI through each of them and have shared the results for how Originality performed below.
Each of these datasets comes from a publicly available research paper.
Test 3-A - How Close is ChatGPT to Human Experts?
Our results clearly show that our model beats all human evaluators. The top category “Pair-experts” is beaten by 8.8% by our model. And also our model beats both the models present in this paper in avg F1 score in all possible settings where all combinations of input types are considered(averaged).
Even though our model never trained on these research articles still outperforms 4 models when accuracy is averaged proportional to the size of the test dataset. Three of the models beat our model with a small margin, when they are trained on datasets that have almost all categories of test set in the train set.
Test 3-E - Check Me If You Can: Detecting ChatGPT-Generated Academic Writing using CheckGPT
This is a very challenging dataset as this is only focused on Academic writing which we are not built for, especially in the challenging fields, that are Physics, Computer Science and Humanities and Social Sciences.
We have sampled 9k samples randomly from the dataset provided for testing.
Our model performed well with accuracy of 94.5%, which is very near to the result shown by the best model introduced by this paper, considering that they have trained the model with the same dataset but with mutually exclusive train and test samples.
In the paper they have tested the efficiency of GPTZero as well, as part of their analysis, which is an out-of-context model like ours. But we have achieved 94.5% accuracy compared to 61.2% of GPTZero’s. Especially in detecting AI text of 96.7% compared to 24.2% of GPTZero’s.
This study (The effectiveness of software designed to detect AI-generated writing: A comparison of 16 AI text detectors, William Walters) identified Originality.ai as the most popular AI detector (included in the most "best AI detector" articles) and in testing found it to have "perfect or near-perfect accuracy with all three sets of documents: GPT-3.5 papers, GPT-4 papers, and human-generated papers"
The end result is we have run tests across our own dataset and all publicly available datasets which continue to demonstrate the efficacy of Originality.AI AI detection.
As these tests have shown not all tools are created equal! There have been many quickly created tools that simply use a popular Open Source GPT-2 detector (195k downloads last month).
Why is our model more accurate?
Below are a few of the main reasons we suspect Originality.AI’s AI detection performance is significantly better than alternatives…
Larger Model - We suspect (can’t confirm) that we use a much larger model… there is no way we could offer a free or ad supported option given our models' compute cost per scan.
Focus on Content Writers - The datasets we have constructed focus on a main use case (content that is published online) and we are not a generalist AI detector. This means our detector is trained exclusively on online publications like blog posts, articles, and website copy, which means it can more accurately discern differences between human and AI-generated content in these types of writing. Our model does not get trained on classic literature which is not reflective of modern writing.
Train on Harder Datasets - The datasets we continue to create and train our AI on focuses on increasingly adversarial detection bypassing methods. The better our AI gets the more clever the prompt engineering or playground settings need to be to bypass us and then we train on that new more challenging dataset.
The AI/ML team and core product team at Originality.AI have worked relentlessly to build and improve on the most effective AI content detector!
Originality.AI Launches Version 3.0 improving from the previous Version 2.0
Across 6 Datasets Originality’s Latest Version Was the Most Accurate & Effective Detector in Each Test
6 AI Content Detectors Were Tested on a new Challenging Benchmark Dataset with Originality.ai being the most accurate
Open Source Tool and Benchmark Dataset for Efficient Detector Testing Developed and Released
We hope this post will help you understand more about AI detectors and give you the tools to complete your own analysis if you want to.
In transparent and accountable development and use of AI.
AI detectors have a role to play in mitigating some of the potential negative societal impacts of generative AI.
AI detection tools “accuracy” should be communicated with the same transparency and accountability that we want to see in AI’s development and use.
Our hope is this study has moved us closer to achieving this and that our open-source initiatives will help others to be able to do the same.
If you have any questions on if Originality.AI would be the right solution for your organization please contact us.
If you are looking to run your own tests please contact us. We are always happy to support any study (academic or journalist or curious mind).
Founder / CEO of Originality.AI I have been involved in the SEO and Content Marketing world for over a decade. My career started with a portfolio of content sites, recently I sold 2 content marketing agencies and I am the Co-Founder of MotionInvest.com, the leading place to buy and sell content websites. Through these experiences I understand what web publishers need when it comes to verifying content is original. I am not For or Against AI content, I think it has a place in everyones content strategy. However, I believe you as the publisher should be the one making the decision on when to use AI content. Our Originality checking tool has been built with serious web publishers in mind!