We believe that it is crucial for AI content detectors reported accuracy to be open, transparent and accountable. The reality is, each person seeking AI-detection services deserves to know which detector is the most accurate for their specific use case.
The world needs reliable AI detection tools, but no AI detection tool is ever going to be 100% perfect. Users should understand the individual limitations of these tools so that they can wield them responsibly, which means the developers of AI detectors should be as transparent as possible about the capabilities and limitations of their detectors.
That’s why we here at Originality have made this guide, which aims to provide you with an answer to the question of: What AI content detector is the most accurate? Additionally, we are proposing a standard for testing AI detector effectiveness, along with the release of an Open Source tool to help increase the transparency and accountability with all AI content detectors.
We hope to achieve this idealistic goal by…
If you have been asked or want to evaluate an AI content detector's potential use case for your organization this article is for you.
This guide will help you understand AI detectors and their limitations by showing you…
If you have any questions, suggestions, research questions or potential commercial use cases please contact us.
● Originality.AI Launches Version2.0 improving from the previous Version 1.4
● Across 4 Datasets Originality’s Latest Version Was the Most Accurate & Effective Detector in Each Test
● Open Source Tool and Benchmark Dataset for Efficient Detector Testing Developed and Released
AI detection tools' “accuracy” should be communicated with the same transparency and accountability that we want to see in AI’s development and use. Our hope is this study will move us all closer to that ideal.
At Originality.AI we love AI-generated content… but believe in transparency and accountability in its development, use and detection. Personally, I don’t want a writer or agency I have hired to create content for my audience and generate it with AI without my knowledge.
Originality.AI helps ensure there is trust in the originality of the content being produced by writers, students, job applicants or journalists.
Why is transparency and accountability important...
Claimed accuracy rates with no supporting studies are clearly a problem.
We hope the days of AI detection tools claiming 99%+ accuracy with no data to support it are over. A single number is not good enough in the face of the societal problems AI content can produce and the important role AI content detectors have to play.
The FTC has come out on multiple occasions to warn against tools claiming AI detection accuracy or unsubstantiated AI efficacy.
“If you’re selling a tool that purports to detect generative AI content, make sure that your claims accurately reflect the tool’s abilities and limitations. “ source
“you can’t assume perfection from automated detection tools. Please keep that principle in mind when making or seeing claims that a tool can reliably detect if content is AI-generated.” source
“Marketers should know that — for FTC enforcement purposes — false or unsubstantiated claims about a product’s efficacy are our bread and butter” source
We fully agree with the FTC on this and have provided the tool needed for others to replicate this study for themselves.
The misunderstanding of how to detect AI-generated content has already caused a significant amount of pain including a professor who incorrectly failed an entire class.
AI Content Detectors need to be a part of the solution and the current unsupported AI detection accuracy claims and research papers that have tackled this problem are simply not good enough in the face of the societal risks LLM generated content pose including…
Along with this study we are releasing the latest version of our AI content detector. Below is our release history.
● 1.1 – Nov 2022 BETA (released before Chat-GPT)
● 1.4 – Apr 2023
● 2.0 - Aug 2023
Our AI detector works by leveraging supervised learning of a carefully fine-tuned large AI language model.
We use a large language model (LLM) and then feed this model millions of carefully selected records of known AI and known human content. It has learned to recognize patterns between the two.
More details on our AI content detection here: https://originality.ai/blog/how-does-ai-content-detection-work
Below is a brief summary of the 3 general approaches that an AI detector (or called in Machine Learning speak a “classifier”) can use to distinguish between AI-generated and human-generated text.
1. Feature-Based Approach:
2. Zero-Shot Approach:
3. Fine Tuning AI Model Approach:
The test below looks at the performance of multiple detectors using all of the strategies identified above.
This post covers 4 main tests and some supporting tests that were all completed on the latest version of the Originality.AI AI Content Detector.
One test involved hundreds of thousands of samples and looked at Originality V1.4 vs V2.0 while the second proposes a smaller Challenging Benchmark Dataset and we compared multiple AI content detectors' performance against that dataset. The third uses a published open source dataset for testing AI content detectors effectiveness.
Tests on the second and third datasets were run the week of July 24 all using our open-sourced AI content detection accuracy tool or if an API was not available via humans entering the text and recording the results.
The second test can be replicated using the benchmark dataset and our open-sourced tool.
The 4th test is a series of tests completed on other available datasets testing Originality.AI’s effectiveness.
In the spirit of openness and contributing to the understanding of the effectiveness and limitations of AI detectors we are open-sourcing this “challenging” benchmark dataset to help with the evaluation of different AI detection methods. If someone was working to make AI writing undetectable this is the type of content that they would produce.
This benchmark dataset includes samples from some of the most challenging prompts and settings for LLM models including ChatGPT4, GPT-4 and Paraphrasers etc. Additionally, it includes known human content.
The table below shows the datasets and a brief explanation of each.
Download the dataset here
The dataset(s) provided might be applicable for your use case or potentially if you are evaluating AI detection tools effectiveness for another type of content you will need to produce your own dataset. For example, I would not rely solely on these results if you are looking for an AI detector to identify fake social media messages or online reviews. Use our Open-Source Tool to make running your data and evaluating detectors' performance much easier.
To make the running of tests easy, repeatable and accurate we created and decided to open-source our tools to help others do the same. The main Tool allows you to enter the API key for multiple AI content detectors and plug in your own data to then receive not just the results from the tool but also a complete statistical analysis of the detection effectiveness calculations.
This tool makes it incredibly easy for you to run your test content against all AI content detectors that have an available API.
The reason we built and open-sourced this tool to run tests is so that we can increase the transparency into tests by…
The speed at which new LLM’s are launching and the speed AI detection is evolving means that accuracy studies which take 4 months from test to publication are hopelessly outdated.
Features of This Tool:
Link to GitHub: https://github.com/OriginalityAI/AI-detector-research-tool
In addition to the tool mentioned above we have provided 3 additional ways to easily run a dataset through our tool…
We do not believe that AI detection scores alone should be used for academic honesty purposes and disciplinary action.
The rate of false positives (even if low) is still too high to be relied upon for disciplinary action.
Here is a guide we created to help writers or students reduce false positives in AI content detector usage. Plus we created a free AI detector chrome extension to help writers/editors/students/teachers visualize the creation process and prove originality.
Below are the best practices and methods used to evaluate the effectiveness of AI classifiers (ie AI content detectors). There is some nerdy data below, but if you are looking for even more info here is a good primer on evaluating the performance of a classifier.
One single number related to a detector's effectiveness without additional context is useless!
Don’t trust a SINGLE “accuracy” number without additional context.
Here are the metrics we look at to evaluate a detector's efficacy…
The confusion matrix and the F1(more on it later) together are the most important measures we look at. In one image you can quickly see the ability of an AI model to correctly identify both Original and AI generated content.
Identifies AI content correctly x% of the time. True Positive Rate TPR (also known as sensitivity, hit rate or recall).
Identifies human content correctly x% of the time. True Negative Rate TNR (also known as specificity or selectivity).
What % of your predictions were correct? Accuracy alone can provide a misleading number. This is in part why you should be skeptical of AI detectors' claimed “accuracy” numbers if they do not provide additional details for their accuracy numbers. The following metric is what we use, along with our open source tool to measure accuracy..
Combines Recall and Precision to create one measure to rank all detectors, often used when ranking multiple models. It calculates the harmonic mean of precision and sensitivity.
So what should and should not be considered AI content. As “cyborg” writing combining humans and AI assistants rise what should and shouldn’t be considered AI content is tricky!
Some studies have made some really strange decisions on what to claim as “ground truth” human or AI-generated content.
In fact, there was one study that used human-written text in multiple languages that were then translated (using AI tools) to English and called it “ground truth” Human content.
Source…
Description of Dataset:
Classifying the AI Translated Dataset (02-MT) as Human-written???
We think this approach is crazy!
Our position is that if the effect of putting content into a machine is that the output from that machine is unrecognizable when comparing the two documents then it should be the aim of an AI detector to identify the output text as AI generated.
The alternative is that any content could be translated and presented as Original work since it would pass both AI and Plagiarism detection.
Here is what we think should and should not be classified as AI-generated content:
Some journalists, such as Kristi Hines, have done a great job at trying to evaluate what AI content is and whether AI content detectors should be trusted by reviewing several studies - https://www.searchenginejournal.com/should-you-trust-an-ai-detector/491949/.
Finally! let's get to the tests. These are the results of the latest Originality.AI AI Detector that we have deployed testing against our very large (hundreds of thousands)) and continually more challenging benchmark dataset…
Version v2.0 of the model shows improved performance compared to v1.4. The accuracy score increased from 0.9387 to 0.9562, and the F1 score improved from 0.9508 to 0.9645.
In v2.0:
Overall, v2.0 demonstrates better performance across datasets, resulting in higher accuracy and F1 scores.
In the next 2 tests we will look at the performance of many AI content detectors to evaluate their relative effectiveness.
To complete the tests and make them repeatable for others to execute we used…
For Test #2 it is important to remember this is a “Challenging” dataset with adversarial settings on GPT-3, GPT-4, ChatGPT and the Paraphraser data. It is not an accurate reflection of AI detection tools on most “generic” AI-generated content.
Results Including Data and Scores Can be downloaded and viewed here:
Human Data Entry vs API: We did not have API access to several tools and had a team manually checking the results, this could introduce error. For the tools ContentAtScale, TurnItIn and WinstonAI the results could have some human error. False positives were double-checked.
Dataset Quality: This benchmark dataset came from a MUCH larger dataset and did not get a human review to clean it. The result is that there are some entries that are not great samples.
New Updates to Detectors: All tests were run within a 1 week window between July 24-July 28 but these results are a snapshot of a moment in times performance and not reflective of future performance.
Limited Dataset Size: As our AI research team wrote, 2000 samples should not be considered a conclusive efficacy test.
If you would like to run your own or other datasets to test the accuracy of AI detectors easily you can use our Open Source tool and pick any of the datasets below…
Anytime we are developing the test (ie dataset) to judge our own work there are risks of perceived or actual bias.
Therefore we have provided a list of other publicly available datasets (below) that we tested Originality.AI on as well as run our Open Source Tool on a publicly available dataset against other detectors that have an API for Test #3.
Dataset Used for Test #3: Are ChatGPT Detectors Biased
The dataset consists of 749 samples taken from student essays and ChatGPT prompts including some adversarial prompt engineering.
Results Including Data and Scores Can be downloaded and viewed here:
Potentially Cyborg Writing Causes Unacceptably High False Positives: This paper helped show an issue where there is potentially a bias against non-native english speakers. One current theory we are working to prove/resolve is that this bias and very high false positive rate amongst all detectors is what we call “Cyborg” writing where there is a heavy reliance on writing aids that involve early AI like Grammarly.
Limited Number of Samples: This dataset is much better than many studies that have incredibly small datasets however even at 794 it is very small and should not be considered conclusive.
Limited Number of Tools Compared: We ran the test against all tools with API access but did not (yet) manually check the dataset against other tools.
Here are some additional datasets that you can use in your own testing.
We did not run ALL the tools through these datasets but did run Originality.AI through each of them and have shared the results for Originality performed below.
Studies/Dataset we chose not to list face similar issues…
This study (The effectiveness of software designed to detect AI-generated writing: A comparison of 16 AI text detectors, William Walters) identified Originality.ai as the most popular AI detector (included in the most "best AI detector" articles) and in testing found it to have "perfect or near-perfect accuracy with all three sets of documents: GPT-3.5 papers, GPT-4 papers, and human-generated papers"
The end result is we have run tests across our own dataset and all publicly available datasets which continue to demonstrate the efficacy of Originality.AI AI detection.
Below is a list of all AI content detectors and a link to a review of each. For a more thorough comparison of all AI detectors and their features have a look at this post: 22 AI Content Detection Tools
List of Tools:
As these tests have shown not all tools are created equal! There have been many quickly created tools that simply use a popular Open Source GPT-2 detector (195k downloads last month).
Below are a few of the main reasons we suspect Originality.AI’s AI detection performance is significantly better than alternatives…
The AI/ML team and core product team at Originality.AI have worked relentlessly to build and improve on the most effective AI content detector!
The Results…
We hope this post will help you understand more about AI detectors and give you the tools to complete your own analysis if you want to.
We believe…
Our hope is this study has moved us closer to achieving this and that our open-source initiatives will help others to be able to do the same.
If you have any questions on if Originality.AI would be the right solution for your organization please contact us.
If you are looking to run your own tests please contact us. We are always happy to support any study (academic or journalist or curious mind).