The Most Accurate AI Content Detector
Try Our AI Detector
AI Studies

AI Content Detector Accuracy Review + Open Source Dataset and Research Tool

We believe that it is crucial for AI content detectors reported accuracy to be open, transparent, and accountable. The reality is, each person seeking AI-detection services deserves to know which detector is the most accurate for their specific use case. 

Trusted By Industry Leaders
Trusted By Industry Leaders

Introduction

Our text compare tool is a fantastic, lightweight tool that provides plagiarism checks between two documents. Whether you are a student, blogger or publisher, this tool offers a great solution to detect and compare similarities between any two pieces of text. In this article, I will discuss the different ways to use the tool, the primary features of the tool and who this tool is for. There is an FAQ at the bottom if you run into any issues when trying to use the tool.

What makes Originality.ai’s text comparison tool stand out?

Keyword density helper – This tool comes with a built-in keyword density helper in some ways similar to the likes of SurferSEO or MarketMuse the difference being, ours is free! This feature shows the user the frequency of single or two word keywords in a document, meaning you can easily compare an article you have written against a competitor to see the major differences in keyword densities. This is especially useful for SEO’s who are looking to optimize their blog content for search engines and improve the blog’s visibility.

Ways to compare

File compare – Text comparison between files is a breeze with our tool. Simply select the files you would like to compare, hit “Upload” and our tool will automatically insert the content into the text area, then simply hit “Compare” and let our tool show you where the differences in the text are. By uploading a file, you can still check the keyword density in your content.

URL compare

Comparing text between URLs is effortless with our tool. Simply paste the URL you would like to get the content from (in our example we use a fantastic blog post by Sherice Jacob found here) hit “Submit URL” and our tool will automatically retrieve the contents of the page and paste it into the text area, then simply click “Compare” and let our tool highlight the difference between the URLs. This feature is especially useful for checking keyword density between pages!

Simple text compare

You can also easily compare text by copying and pasting it into each field, as demonstrated below.

Features of Originality.ai’s Text Compare Tool

Ease of use

Our text compare tool is created with the user in mind, it is designed to be accessible to everyone. Our tool allows users to upload files or enter a URL to extract text, this along with the lightweight design ensures a seamless experience. The interface is simple and straightforward, making it easy for users to compare text and detect the diff.

Multiple text file format support

Our tool provides support for a variety of different text files and microsoft word formats including pdf file, .docx, .odt, .doc, and .txt, giving users the ability to compare text from different sources with ease. This makes it a great solution for students, bloggers, and publishers who are looking for file comparison in different formats.

Protects intellectual property

Our text comparison tool helps you protect your intellectual property and helps prevent plagiarism. This tool provides an accurate comparison of texts, making it easy to ensure that your work is original and not copied from other sources. Our tool is a valuable resource for anyone looking to maintain the originality of their content.

User Data Privacy

Our text compare tool is secure and protects user data privacy. No data is ever saved to the tool, the users’ text is only scanned and pasted into the tool’s text area. This makes certain that users can use our tool with confidence, knowing their data is safe and secure.

Compatibility

Our text comparison tool is designed to work seamlessly across all size devices, ensuring maximum compatibility no matter your screen size. Whether you are using a large desktop monitor, a small laptop, a tablet or a smartphone, this tool adjusts to your screen size. This means that users can compare texts and detect the diff anywhere without the need for specialized hardware or software. This level of accessibility makes it an ideal solution for students or bloggers who value the originality of their work and need to compare text online anywhere at any time.

The world needs reliable AI detection tools, but no AI detection tool is ever going to be 100%  perfect. Users should understand the individual limitations of these tools in regards to AI detector accuracy so that they can wield them responsibly, which means the developers of AI detectors should be as transparent as possible about the capabilities and limitations of their detectors. 

Below is our own analysis of our detector’s efficacy. To review third-party data on Originality.ai's AI detector accuracy see this meta-analysis of multiple academic studies on AI text detection.

You can try the Originality.ai AI Detector for free here.

Try Our AI Detector

This guide aims to provide you with an answer to the question of: What AI content detector is the most accurate? Additionally, we are proposing a standard for testing AI detector effectiveness and AI detector accuracy, along with the release of an Open Source tool to help increase the transparency and accountability with all AI content detectors. 

We hope to achieve this idealistic goal by…

  1. Open-sourcing a benchmark dataset to help researchers identify AI detection effectiveness
  2. Open-sourcing a research tool we developed to assist anyone (researcher, journalist, customer, or other AI detector) in testing multiple AI detectors on their own (or our) benchmark dataset. An even easier-to-use open-source AI detector efficacy tool here.
  3. Providing detailed instructions and including the calculation in the tool to help identify the most important AI vs Original Human classifier efficacy metrics.
  4. Transparently reporting our own tool’s accuracy on multiple publicly available datasets.

If you have been asked or want to evaluate an AI content detector's potential use case for your organization, this article is for you.

This guide will help you understand AI detectors and their limitations by showing you…

  • How AI detectors work
  • How to calculate an AI detector's effectiveness
  • How to complete your own tests (using one of the open-sourced tools we provide)
  • What we think should and should not be considered AI content
  • How accurate our AI content detector is based on the testing we have done
  • If you can trust our AI detector's effectiveness
  • How all AI detectors stack up in terms of effectiveness by type of content

If you have any questions, suggestions, research questions, or potential commercial use cases please contact us.

TL;DR

●   You asked, and we listened. With the overwhelming appreciation for Lite, which allows for some light AI editing (think Grammarly), we’re making it our default model!

  • With Lite becoming the new default, Standard 2.0.0 and Standard 2.0.1 have retired.

●    Originality.ai Launches Version 3.0.1 Turbo (the most accurate AI detector ever) resulting in an improvement on the most challenging dataset created from the newest LLMs.

●    Use Version 3.0.1 Turbo — If your risk tolerance for AI is ZERO! It is designed to identify any use of AI even light AI editing.

●    Use Version Lite — If you want to minimize false positives and are okay with light AI editing. 

●    Across all tests Originality.ai has increased its accuracy, further establishing Originality.ai as the most accurate AI checker.

‍Do AI Detectors Work? OpenAI Says No???

Oversimplistic views that “AI detectors are perfect” or “AI detectors don't work” are equally bad. 

We still have an offer to OpenAI (or anyone willing to take us up on it) to back up their claim that AI detectors don't work.

Do Ai Detectors Work

Why did we create this guide and tools? We believe…

  • In the transparent and accountable development and use of AI. 
  • That AI detectors have a role to play in mitigating the potential negative societal impacts of generative AI.

AI detection tools' “accuracy” should be communicated with the same transparency and accountability that we want to see in AI’s development and use. Our hope is this study will move us all closer to that ideal.

At Originality.ai we love AI-generated content… but believe in transparency and accountability in its development, use, and detection. Personally, I don’t want a writer or agency I have hired to create content for my audience and generate it with AI without my knowledge. 

Originality.ai helps ensure there is trust in the originality of the content being produced by writers, students, job applicants or journalists. 

Why is transparency and accountability important... 

FTC Warns Against Unsupported AI Content Detection Accuracy Claims

Claimed accuracy rates with no supporting studies are clearly a problem. 

We hope the days of AI detection tools claiming 99%+ accuracy with no data to support it are over. A single number is not good enough in the face of the societal problems AI content can produce and the important role AI content detectors have to play.

The FTC has come out on multiple occasions to warn against tools claiming AI detection accuracy or unsubstantiated AI efficacy.  

“If you’re selling a tool that purports to detect generative AI content, make sure that your claims accurately reflect the tool’s abilities and limitations. “ source

“you can’t assume perfection from automated detection tools. Please keep that principle in mind when making or seeing claims that a tool can reliably detect if content is AI-generated.” source

“Marketers should know that — for FTC enforcement purposes — false or unsubstantiated claims about a product’s efficacy are our bread and butter” source

We fully agree with the FTC on this and have provided the tool needed for others to replicate this study for themselves. 

The misunderstanding of how to detect AI-generated content has already caused a significant amount of pain including a professor who incorrectly failed an entire class.

Societal Impacts of Undetectable AI-Generated Content Are Real

AI Content Detectors need to be a part of the solution to undetectable AI-generated content and the current unsupported AI detection accuracy claims and research papers that have tackled this problem are simply not good enough in the face of the societal risks LLM-generated content poses including…

  1. Mass Propaganda
  2. Fake News
  3. Toxic Spam
  4. Academic Dishonesty / AI Plagiarism
  5. Hallucinations
  6. Cheating Writers
  7. Cheating Agencies
  8. Fake Product Reviews
  9. Fake Job Applications
  10. Fake University Application Essays
  11.  Fake Scholarship Applications
Societal Impacts of Undetectable AI-Generated Content

Originality.ai Version History:

Along with this study, we are releasing the latest version of our AI content detector. Below is our release history.

●    1.1 – Nov 2022 BETA (released before Chat-GPT)

  • GPT-2, GPT-NEO, GPT-J, and GPT-3 accurate detection. But was able to be “tricked” with Paraphrasing
  • First GPT-3 trained detector

●    1.4 – Apr 2023

  • Improved ChatGPT detection
  • Accurate on GPT4 Generated Content
  • Only tool capable of accurately detecting Paraphrased content.
  • Reduced the number of false positives with increased training on human-generated content

●    2.0 Standard — Aug 2023

  • Reduced False Positives
  • Improved Accuracy on the Hardest to Detect AI Content (GPT4, ChatGPT & Paraphrased)
  • Release of Open Source Benchmark Dataset.
  • Release of Open Source AI Detection Efficacy Testing Tool(s).
  • Between 1.4 and 2.0 there were many models that our team built which slightly increased AI detection capabilities but we were not going to release a model until it materially reduced false positives.
  • September 2024 Update: This model has been retired from our platform. Sign up to try out Lite, our latest model and your top pick!

●    3.0 Turbo — Feb 2024

  • Trained on the newest LLM’s (Grok, Mixtral, GPT-4 Turbo, Gemini, Claude 2)
  • Accuracy increased on our toughest testing dataset from 90.2% to 98.8%
  • False Positives have been slightly improved from 2.9% to 2.8% 

Even easier to use Open Source AI detection efficacy research tool released

●    2.0.1 Standard (BETA) — July 2024

  • Improved version of the flagship 2.0.0 Standard model.
  • We’re releasing this version in BETA testing.
  • September 2024 Update: Thank you for your feedback! BETA testing has now concluded, and with Lite being your top choice, this model has been retired.

 ●    1.0.0 Lite — July 2024

  • Highly accurate with 98% accuracy in detecting AI content.
  • An under 1% false positive rate
  • Allows for lightly AI-edited content (like Grammarly’s grammar and spelling suggestions) while still differentiating between light AI editing and fully generated AI content.

 ●    3.0.1 Turbo — October 2024

  • Highly accurate with 99%+ accuracy in detecting AI content.
  • An under 3% false positive rate. 
  • Best for use cases where there’s a 0 tolerance policy for AI content.
  • Robust against bypassing methods — extremely challenging to bypass.

Basic Explanation of How Our AI Detector Works

Our AI detector works by leveraging supervised learning of a carefully fine-tuned large AI language model.

We use a large language model (LLM)  and then feed this model millions of carefully selected records of known AI and known human content. It has learned to recognize patterns between the two.

More details on our AI content detection.

How Originality.ai Detector Works

How AI Content Detectors Work:

Below is a brief summary of the 3 general approaches that an AI detector (or called in Machine Learning speak a “classifier”) can use to distinguish between AI-generated and human-generated text. 

1. Feature-Based Approach:

  • The feature-based approach uses the fact that there can potentially be consistently identifiable and known differences that exist in all text generated by an LLM like ChatGPT when compared to human text. Some of these features that tools look to use are explained below.
  • Burstiness: Burstiness in text refers to the tendency of certain words to appear in clusters or "bursts" rather than being evenly distributed throughout a document. AI-generated text can potentially have more predictability (less burstiness) since AI models tend to reuse certain words or phrases more often than a human writer would. Some tools attempt to identify AI text using burstiness (more burstiness = human, less burstiness = AI).
  • Perplexity: Perplexity is a measure of how well a probability model predicts the next word. In the context of text analysis, it quantifies the uncertainty of a language model by calculating the likelihood of the model producing a given text. Lower perplexity means that the model is less surprised by the text, indicating the text was more likely AI-generated. High perplexity scores can indicate human-generated text.
  • Frequency Features: Frequency features refer to the count of how often certain words, phrases, or types of words (like nouns, verbs, etc.) appear in a text. For example, AI generation might overuse certain words, underuse others, or use certain types of words at rates that are inconsistent with human writing. These features might be able to help detect AI-generated text.
  • Readability or Fluency Features: Studies have shown that earlier (ie 2019) LLMs would generate text that has similar readability scores.
  • Punctuation: This pertains to the use and distribution of various punctuation marks in a text. AI-generated text often exhibits correct and potentially predictable use of punctuation. For instance, it might use certain types of punctuation more often than a human writer would, or it might use punctuation in ways that are grammatically correct but stylistically unusual. By analyzing punctuation patterns, someone might attempt to create a detector that can predict AI-generated content.
  • Advantages: Once patterns are identified, they can be repeatedly identified in a very cost-effective and fast manner. 
  • Disadvantages: Modern LLMs such as ChatGPT4 and Bard can produce varied enough content that these detectors can be bypassed with clever ChatGPT prompts.
  • Examples: GPTZero, Winston AI

2. Zero-Shot Approach:

  • Uses a pre-trained language model to identify text generated by a model similar to itself. Basically, asking itself how likely the content the AI is seeing was generated by a similar version of itself (note: don’t try asking ChatGPT… it doesn’t work like that). 
  • Advantages: Easier to build and does not require supervised training
  • Disadvantages: Susceptible to bypassing with paraphrasing
  • Examples: GPTZero, ZeroGPT

3. Fine Tuning AI Model Approach:

  • Uses a large language model such as BERT or RoBERTa and trains on a set of human and AI-generated text. It learns to identify the differences between the two in order to predict if the content is AI or Original. 
  • Advantages: Can produce the most effective detection 
  • Disadvantages: These can be more expensive to train and operate. They can also lag behind in detection capabilities for the newest AI tools until their training is updated.
  • Examples: Originality.ai AI Detector, OpenAI Text Classifier (taken offline)

The test below looks at the performance of multiple detectors using all of the strategies identified above. 

Testing Plan:

This post covers four main tests and some supporting tests that were all completed on the latest version of the Originality.ai AI Content Detector

One test involved hundreds of thousands of samples and looked at Originality V1.4 vs V2.0 vs V3.0, while the second proposes a smaller Challenging Benchmark Dataset, and we compared multiple AI content detectors' performance against that dataset. The third uses a published open-source dataset for testing AI content detectors effectiveness.

Tests on the second and third datasets were run the week of July 24, all using our open-sourced AI content detection accuracy tool or if an API was not available via humans entering the text and recording the results. 

The second test can be replicated using the benchmark dataset and our open-sourced tool.

The 4th test is a series of tests completed on other publicly available datasets testing Originality.ai’s effectiveness. 

Introducing Our Benchmark Adversarial AI Detection Dataset:

In the spirit of openness and contributing to the understanding of the effectiveness and limitations of AI detectors we are open-sourcing this “challenging” benchmark dataset to help with the evaluation of different AI detection methods. If someone was working to make AI writing undetectable, this is the type of content that they would produce. 

This benchmark dataset includes samples from some of the most challenging prompts and settings for LLM models including ChatGPT4, GPT-4, and Paraphrasers etc. Additionally, it includes known human content. 

  • Disclaimer from Our AI Research Team: This dataset is a very small (insignificant) dataset randomly sampled from the test dataset that we built for different experiments. Completely unrelated and not part of our training/validation/test set, completely random, unbiased to ensure fairness and no "cherry-picking."

The table below shows the datasets and a brief explanation of each.

Download the dataset here

Adversarial AI Detection Dataset

What is the Best Test? Use Your Own Data!

The dataset(s) provided might be applicable for your use case or potentially if you are evaluating AI detection tools effectiveness for another type of content you will need to produce your own dataset. For example, I would not rely solely on these results if you are looking for an AI detector to identify fake social media messages or online reviews. Use our Open-Source Tool to make running your data and evaluating detectors' performance much easier. 

Testing Method & New Open-Source Testing Tools:

To make the running of tests easy, repeatable and accurate we created and decided to open-source our tools to help others do the same. The main tool allows you to enter the API key for multiple AI content detectors and plug in your own data to then receive not just the results from the tool but also a complete statistical analysis of the detection effectiveness calculations.

This tool makes it incredibly easy for you to run your test content against all AI content detectors that have an available API. 

The reason we built and open-sourced this tool to run tests is so that we can increase the transparency into tests by…

  1. Running all tests at basically the same time on the same day
  2. Ensuring the exact same text with no difference in formatting is sent to each tool
  3. Quickly testing datasets as they become available
  4. Providing an opportunity for potential customers or researchers to test their own data and make an informed decision about which AI detector is ideal for their use case.

The speed at which new LLMs are launching and the speed AI detection is evolving means that accuracy studies which take 4 months from test to publication are hopelessly outdated.

Features of This Tool:

  • Free & Open Sourced
  • Able to Scan A Text Dataset With Multiple AI Detectors
  • Quickly Provides Results
  • Automatically Calculates Detector Efficacy Metrics (confusion matrix, accuracy, false positive rates, etc.)

Link to GitHub: https://github.com/OriginalityAI/AI-detector-research-tool

In addition to the tool mentioned above, we have provided three additional ways to easily run a dataset through our tool…

  1. Check for AI Content in Microsoft Excel 
  2. Check for AI Content in Google Sheets
  3. Check for AI Content in AirTable

Our View on the Use of AI Detectors Within Academia & False Positives in General.

We do not believe that AI detection scores alone should be used for academic honesty purposes and disciplinary action. 

The rate of false positives (even if low) is still too high to be relied upon for disciplinary action.

Here is a guide we created to help writers or students reduce false positives in AI content detector usage. Plus, we created a free AI detector Chrome extension to help writers/editors/students/teachers visualize the creation process and prove originality.

Our newly released Version 1.0.0 Lite model is best for educators and academic settings as it allows for light AI editing with popular tools like Grammarly (grammar and spelling suggestions).

How To Evaluate AI Detectors “Accuracy”:

Below are the best practices and methods used to evaluate the effectiveness of AI classifiers (i.e., AI content detectors). There is some nerdy data below, but if you are looking for even more info, here is a good primer for evaluating the performance of a classifier. 

One single number related to a detector's effectiveness without additional context is useless!

Don’t trust a SINGLE “accuracy” number without additional context. 

Here are the metrics we look at to evaluate a detector's efficacy… 

Confusion Matrix 

The confusion matrix and the F1 (more on it later) together are the most important measures we look at. In one image you can quickly see the ability of an AI model to correctly identify both Original and AI-generated content

  • True Positive (TP) – AI detector correctly identified content as AI.
  • False Negative (FN) – AI detector incorrectly identified AI content as Human.
  • False Positive (FP) – AI detector incorrectly identified human content as AI.
  • True Negative (TN) – AI detector correctly identified human content as human.
Version 1.4 Confusion Matrix on a GPT-4 & Human Dataset Test

Version 1.4 Confusion Matrix on a GPT-4 & Human Dataset Test

True Positive Rate — AI Text Detection Capabilities

Identifies AI content correctly x% of the time. True Positive Rate TPR (also known as sensitivity, hit rate or recall).

  • True Positive Rate TPR = TP / (TP & FN)

True Negative Rate — Human-Text Detection Capabilities:

Identifies human content correctly x% of the time. True Negative Rate TNR (also known as specificity or selectivity).

  • True Negative Rate TNR = TN / (TN & FP)  = 1- FPR

Accuracy:

What % of your predictions were correct? Accuracy alone can provide a misleading number. This is in part why you should be skeptical of AI detectors' claimed “accuracy” numbers if they do not provide additional details for their accuracy numbers. The following metric is what we use, along with our open source tool to measure accuracy..

  • Accuracy = True / (True + False) = (TP + TN) / (TP + TN + FB +FN)

F1

Combines Recall and Precision to create one measure to rank all detectors, often used when ranking multiple models. It calculates the harmonic mean of precision and sensitivity.

  • F1 = 2 x (PPV x TPR) / (PPV + TPR) where Precision (PPV) = TP / (TP + FP)

Metrics Considered but Not Used:

  • ROC & AUROC: Not used since we can't adjust the sensitivity of other tools and some tools do not provide a percentage. 
  • Precision: PPV = TP / (TP + FP) - Not used

But… What Should be Considered AI Content?

So, what should and should not be considered AI content? As “cyborg” writing combining humans and AI assistants rise what should and shouldn’t be considered AI content is tricky!

Some studies have made some really strange decisions on what to claim as “ground truth” human or AI-generated content. 

In fact, there was one study that used human-written text in multiple languages that were then translated (using AI tools) to English and called it “ground truth” Human content. 

Source…

Description of Dataset:

Description of Dataset

Classifying the AI Translated Dataset (02-MT) as Human-written???

Classifying the AI Translated Dataset (02-MT)

https://arxiv.org/pdf/2306.15666.pdf

We think this approach is crazy! 

Our position is that if the effect of putting content into a machine is that the output from that machine is unrecognizable when comparing the two documents then it should be the aim of an AI detector to identify the output text as AI generated. 

The alternative is that any content could be translated and presented as Original work since it would pass both AI and plagiarism detection.

Here is what we think should and should not be classified as AI-generated content:

  • AI-Generated and Not Edited = AI-Generated Text
  • AI-Generated and Human Edited = AI-Generated Text
  • Human Written and Heavily AI Edited = AI-Generated Text
  • AI Outline*, Human Written, and Heavily AI Edited = AI-Generated Text
  • Human Written and Lightly Edited with AI = Results vary depending on level of editing done by AI
  • AI Research and Human Written = Original Human-Generated
  • Human Written and Human Edited = Original Human-Generated

*AI Outline is defined as using AI (an LLM) to create a content idea, do some research, and/or create an outline. The level at which AI is used during this process may vary and could potentially affect the likelihood the text is detected as AI or human

Some journalists, such as Kristi Hines, have done a great job at trying to evaluate what AI content is and whether AI content detectors should be trusted by reviewing several studies - https://www.searchenginejournal.com/should-you-trust-an-ai-detector/491949/

Review a meta-analysis of AI-detector accuracy studies for further insight into the efficacy of AI-detectors.

Test #1 — Benchmark Testing Originality.ai for July 2024 Release of  Lite:

Finally! let's get to the tests. 

As new and improved LLMs are released, we need to update our models and our benchmark testing. 

Introducing Version Lite 1.0.0 

Our July 2024 release also includes the launch of Version Lite 1.0.0; see the results from our benchmark testing of the new Lite model.

1. General Performance: The Lite model focuses on accurately identifying human-written content (or minimizing the False Positive rate) and exhibits outstanding performance on ‘human’ and ‘AI-lightly-edited’ sources.

2. Dataset-Specific Performance

  • Version Lite 1.0.0 is capable of differentiating lightly AI-edited content from content that is entirely AI-generated. 
  • It achieved a remarkable 1% False Positive Rate (0.007 on our internal benchmark), which is ½ of the Version 2.0.0 Standard model (retired), and an acceptable False Negative Rate of 0.02.
  • It demonstrated the best accuracy on 4 out of 10 of the AI model datasets.

3. Key Findings — Exceptional False Positive Rate and Excels at Identifying Lightly AI-Edited Content

The accuracy of Version Lite 1.0.0 in detecting content which is lightly edited with AI tools such as Grammarly’s suggestions with a low false positive rate, makes it an asset for academic and educational settings.

Test #2 — Adversarial AI Detection Dataset: 6 Tools Compared

In the next tests we will look at the performance of many AI content detectors to evaluate their relative effectiveness.

To complete the tests and make them repeatable for others to execute, we used…

  1. Our newly introduced Open Source AI Detector Accuracy Tool
  2. Publicly Available Datasets to Compare the Tools:

For Test #2 it is important to remember this is a “Challenging” dataset with adversarial settings on GPT-3, GPT-4, ChatGPT and the Paraphraser data. It is not an accurate reflection of AI detection tools on most “generic” AI-generated content. 

Test #2 — Results & Raw Data Shared 

Results, Including Data and Scores, Can be downloaded and viewed here:

Test #2 — AI Detector Efficacy Results:

Note: This test was completed with 2.0.0 Standard; this model is being retired. 

See the updated results for Lite, our newest model, below.

AI Detector Efficacy Results
  • The table is sorted by F1 (a number that balances both a detector’s ability to correctly identify AI content and correctly identify human content).
  • All tools performed reasonably well on False Positives, ranging from a low of 0.8% to a high of 7.6% False Positive Rate. 
  • The ability to identify AI content (True Positive Rate) varied wildly from 19.8% to 98.4%.

Test #2 — Confusion Matrix for Each AI Detector:

Originality.ai — Confusion Matrix — Test #2 Adversarial Dataset Testing

[Turbo Model 3.0.1]
[Lite Model 1.0.0]

Winston.AI — Confusion Matrix — Test #2 Adversarial Dataset Testing

Winston.AI - Confusion Matrix Test 2 Adversarial Dataset Testing
Winston.ai F1 = 52.0

Sapling.AI — Confusion Matrix — Test #2 Adversarial Dataset Testing

Sapling.AI - Confusion Matrix Test 2 Adversarial Dataset Testing
Sapling.ai F1 = 37.9

GPTZero — Confusion Matrix — Test #2 Adversarial Dataset Testing

GPTZero - Confusion Matrix Test 2 Adversarial Dataset Testing
GPTZero F1 = 34.2

Content at Scale — Confusion Matrix — Test #2 Adversarial Dataset Testing

Content at Scale - Confusion Matrix Test 2 Adversarial Dataset Testing
ContentatScale F1 = 33.4

CopyLeaks — Confusion Matrix — Test #2 Adversarial Dataset Testing

CopyLeaks - Confusion Matrix Test 2 Adversarial Dataset Testing
CopyLeaks F1 = 32.9

Limitations of Test #2:

Human Data Entry vs API: We did not have API access to several tools and had a team manually checking the results, this could introduce errors. For the tools ContentAtScale, TurnItIn, and WinstonAI the results could have some human error. False positives were double-checked. 

Dataset Quality: This benchmark dataset came from a MUCH larger dataset and did not get a human review to clean it. The result is that there are some entries that are not great samples. 

New Updates to Detectors: Our model was run on Feb 13th 2024 and all other tests were run within a 1 week window between July 24th-July 28th 2023 but these results are a snapshot of a moment-in-time performance and not reflective of future performance.

Limited Dataset Size: As our AI research team wrote, 2000 samples should not be considered a conclusive efficacy test. 

If you would like to run your own or other datasets to test the accuracy of AI detectors easily, you can use our Open Source tool and pick any of the datasets below…

  • Open-Source AI Detector Comparison Tool & Dataset: Here

Test #3 — List of other AI Detection Datasets & 5 More Tests:

Here are some additional datasets that you can use in your own testing. 

We did not run ALL the tools through these datasets but did run Originality.ai through each of them, and have shared the results for how Originality performed below.

Each of these datasets comes from a publicly available research paper.

Below, Models Lite and Turbo are compared.

Test 3-A — How Close is ChatGPT to Human Experts? 

Originality.ai — Confusion Matrix — Test #3-A — ChatGPT to Human

[Turbo Model 3.0.1]

[Lite Model 1.0.0]

Dataset & Results

Test 3-B — Benchmark Dataset for Identifying Machine-Generated Scientific Papers

Originality.ai — Confusion Matrix — Test #3-B — Identifying Machine-Generated Papers

[Turbo Model 3.0.1]

[Lite Model 1.0.0]

Dataset and Results

Test 3-C — Detecting Text Ghostwritten by Large Language Models

Originality.ai — Confusion Matrix — Test #3-C — Ghostbuster

[Turbo Model 3.0.1]

[Lite Model 1.0.0]

Dataset and Results

Test 3-D — One-Class Learning for AI-Generated Essay Detection

Originality.ai — Confusion Matrix — Test #3-D — AI-Generated Essay Detection

[Turbo Model 3.0.1]

Dataset & Results

Test 3-E — Check Me If You Can: Detecting ChatGPT-Generated Academic Writing using CheckGPT

  • This is a very challenging dataset as this is only focused on academic writing which we are not built for, especially in the challenging fields, that are physics, computer science, humanities, and social sciences.
  • We have sampled 9k samples randomly from the dataset provided for testing.
  • Our model performed well with accuracy of 94.5%, which is very near to the result shown by the best model introduced by this paper, considering that they have trained the model with the same dataset but with mutually exclusive train and test samples.
  • In the paper they have tested the efficiency of GPTZero as well, as part of their analysis, which is an out-of-context model like ours. But we have achieved 94.5% accuracy compared to 61.2% of GPTZero’s. Especially in detecting AI text of 96.7% compared to 24.2% of GPTZero’s.
  • Paper: https://arxiv.org/pdf/2306.05524.pdf
  • Dataset: https://huggingface.co/datasets/julianzy/GPABenchmark

Originality.ai — Confusion Matrix — Test #3-E — Check Me If You Can

[Turbo Model 3.0.1]

Dataset & Results

Other 3rd Party Studies 

Studies/Dataset we chose not to list face similar issues…

  • Small Sample Size (using a 100-sample test is simply crazy!)
  • Big Delay — If there is a long delay between the test and the paper being published, it is a problem based on the rate of progress occurring in the industry
  • Dataset is not publicly available for many papers. This is always unfortunate! Anytime tools are compared or accuracy claims made the appropriate dataset should be made available.
  • Easy AI Content — It is expensive and tricky to build challenging AI vs Human datasets while it is very easy to build a simple AI dataset. A GPT-2 generated dataset test shows nothing.
  • Other Benchmark Datasets:

Here are 6 additional studies completed by 3rd parties and their findings showing Originality to be the most accurate…

The end result is we have run tests across our own dataset and all publicly available datasets which continue to demonstrate the efficacy of Originality.ai AI detection.

Complete List of All AI Content Detectors:

Below is a list of all AI content detectors and a link to a review of each. For a more thorough comparison of all AI detectors and their features, have a look at this post: 22 AI Content Detection Tools

List of Tools:

  1. HuggingFace
  2. GLTR.io AI 
  3. Passed.AI
  4. Writer.com 
  5. Willieai.com 
  6. GPTZero
  7. ContentAtScale
  8. CopyLeaks
  9. POE Poem of Quotes
  10. DetectGPT
  11. On-Page.AI
  12. GPTRadar.com
  13. Percent Human
  14. Grover 
  15. KazanSEO
  16. Sapling
  17. CrossPlag
  18. CheckForAI.com
  19. Draft & Goal
  20. GPTkit.ai
  21. ParaphrasingTool.ai 
  22. OpenAI Text Classifier (removed)
  23. AI Writing Check
  24. Winston AI
  25. InkForAll
  26. ContentDetector.ai
  27. WriteFull
  28. ZeroGPT
  29. TurnItIn
  30. Orginality.ai

As these tests have shown not all tools are created equal! There have been many quickly created tools that simply use a popular Open Source GPT-2 detector (195k downloads last month). 

Why is our model more accurate?

Below are a few of the main reasons we suspect Originality.ai’s AI detection performance and overall AI detector accuracy are significantly better than alternatives… 

  1. Larger Model: We suspect (can’t confirm) that we use a much larger model… there is no way we could offer a free or ad supported option given our models' compute cost per scan.
  2. Focus on Content Writers: The datasets we have constructed focus on a main use case (content that is published online), and we are not a generalist AI detector. This means our detector is trained exclusively on online publications like blog posts, articles, and website copy, which means it can more accurately discern differences between human and AI-generated content in these types of writing. Our model does not get trained on classic literature which is not reflective of modern writing.
  3. Train on Harder Datasets: The datasets we continue to create and train our AI on focuses on increasingly adversarial detection bypassing methods. The better our AI gets the more clever the prompt engineering or playground settings need to be to bypass us and then we train on that new more challenging dataset.
Originality.ai Model is More Accurate

Final Thoughts

The AI/ML team and core product team at Originality.ai have worked relentlessly to build and improve on the most effective AI content detector!

The Results… 

●    Originality.ai Launches Lite 1.0.0.

  • Lite 1.0.0 has high accuracy at 98%, a 1% False Positive Rate, allows for light AI editing (such as Grammarly suggestions), and is ideal for academia.

●    Originality.ai Launches Version 3.0.1 Turbo

  • Turbo 3.0.1 has 99%+ accuracy, an under 3% false positive rate, and is robust against bypassing methods.
  • When your tolerance for AI in writing is near 0, then use Version 3.0.1 Turbo
  • If you allow for some minor edits with AI, then use Lite

●    Across 6 Datasets, Originality’s Latest Version Was the Most Accurate & Effective Detector in Each Test

●    6 AI Content Detectors Were Tested on a new, Challenging Benchmark Dataset, with Originality.ai being the most accurate

●    Open Source Tool and Benchmark Dataset for Efficient Detector Testing Developed and Released 

We hope this post will help you understand more about AI detectors, AI detector accuracy, and give you the tools to complete your own analysis if you want to. 

We believe…

  1. In transparent and accountable development and use of AI. 
  2. AI detectors have a role to play in mitigating some of the potential negative societal impacts of generative AI.
  3. AI detection tool’s “accuracy” should be communicated with the same transparency and accountability that we want to see in AI’s development and use

Our hope is this study has moved us closer to achieving this and that our open-source initiatives will help others to be able to do the same.     

If you have any questions on whether Originality.ai would be the right solution for your organization, please contact us

If you are looking to run your own tests, please contact us. We are always happy to support any study (academic, journalist, or curious mind).

Additionally, to learn more about how Originality.ai performs in third-party academic research and studies, review our meta-analysis of accuracy studies.

Try our AI detector for yourself.

Jonathan Gillham

Founder / CEO of Originality.ai I have been involved in the SEO and Content Marketing world for over a decade. My career started with a portfolio of content sites, recently I sold 2 content marketing agencies and I am the Co-Founder of MotionInvest.com, the leading place to buy and sell content websites. Through these experiences I understand what web publishers need when it comes to verifying content is original. I am not For or Against AI content, I think it has a place in everyones content strategy. However, I believe you as the publisher should be the one making the decision on when to use AI content. Our Originality checking tool has been built with serious web publishers in mind!

Frequently Asked Questions

Do I have to fill out the entire form?

No, that’s one of the benefits, only fill out the areas which you think will be relevant to the prompts you require.

Why is the English so poor for some prompts?

When making the tool we had to make each prompt as general as possible to be able to include every kind of input. Not to worry though ChatGPT is smart and will still understand the prompt.

In The Press

Originality.ai has been featured for its accurate ability to detect GPT-3, Chat GPT and GPT-4 generated content. See some of the coverage below…

View All Press
Featured by Leading Publications

Originality.ai did a fantastic job on all three prompts, precisely detecting them as AI-written. Additionally, after I checked with actual human-written textual content, it did determine it as 100% human-generated, which is important.

Vahan Petrosyan

searchenginejournal.com

I use this tool most frequently to check for AI content personally. My most frequent use-case is checking content submitted by freelance writers we work with for AI and plagiarism.

Tom Demers

searchengineland.com

After extensive research and testing, we determined Originality.ai to be the most accurate technology.

Rock Content Team

rockcontent.com

Jon Gillham, Founder of Originality.ai came up with a tool to detect whether the content is written by humans or AI tools. It’s built on such technology that can specifically detect content by ChatGPT-3 — by giving you a spam score of 0-100, with an accuracy of 94%.

Felix Rose-Collins

ranktracker.com

ChatGPT lacks empathy and originality. It’s also recognized as AI-generated content most of the time by plagiarism and AI detectors like Originality.ai

Ashley Stahl

forbes.com

Originality.ai Do give them a shot! 

Sri Krishna

venturebeat.com

For web publishers, Originality.ai will enable you to scan your content seamlessly, see who has checked it previously, and detect if an AI-powered tool was implored.

Industry Trends

analyticsinsight.net

Frequently Asked Questions

Why is it important to check for plagiarism?

Tools for conducting a plagiarism check between two documents online are important as it helps to ensure the originality and authenticity of written work. Plagiarism undermines the value of professional and educational institutions, as well as the integrity of the authors who write articles. By checking for plagiarism, you can ensure the work that you produce is original or properly attributed to the original author. This helps prevent the distribution of copied and misrepresented information.

What is Text Comparison?

Text comparison is the process of taking two or more pieces of text and comparing them to see if there are any similarities, differences and/or plagiarism. The objective of a text comparison is to see if one of the texts has been copied or paraphrased from another text. This text compare tool for plagiarism check between two documents has been built to help you streamline that process by finding the discrepancies with ease.

How do Text Comparison Tools Work?

Text comparison tools work by analyzing and comparing the contents of two or more text documents to find similarities and differences between them. This is typically done by breaking the texts down into smaller units such as sentences or phrases, and then calculating a similarity score based on the number of identical or nearly identical units. The comparison may be based on the exact wording of the text, or it may take into account synonyms and other variations in language. The results of the comparison are usually presented in the form of a report or visual representation, highlighting the similarities and differences between the texts.

String comparison is a fundamental operation in text comparison tools that involves comparing two sequences of characters to determine if they are identical or not. This comparison can be done at the character level or at a higher level, such as the word or sentence level.

The most basic form of string comparison is the equality test, where the two strings are compared character by character and a Boolean result indicating whether they are equal or not is returned. More sophisticated string comparison algorithms use heuristics and statistical models to determine the similarity between two strings, even if they are not exactly the same. These algorithms often use techniques such as edit distance, which measures the minimum number of operations (such as insertions, deletions, and substitutions) required to transform one string into another.

Another common technique for string comparison is n-gram analysis, where the strings are divided into overlapping sequences of characters (n-grams) and the frequency of each n-gram is compared between the two strings. This allows for a more nuanced comparison that takes into account partial similarities, rather than just exact matches.

String comparison is a crucial component of text comparison tools, as it forms the basis for determining the similarities and differences between texts. The results of the string comparison can then be used to generate a report or visual representation of the similarities and differences between the texts.

What is Syntax Highlighting?

Syntax highlighting is a feature of text editors and integrated development environments (IDEs) that helps to visually distinguish different elements of a code or markup language. It does this by coloring different elements of the code, such as keywords, variables, functions, and operators, based on a predefined set of rules.

The purpose of syntax highlighting is to make the code easier to read and understand, by drawing attention to the different elements and their structure. For example, keywords may be colored in a different hue to emphasize their importance, while comments or strings may be colored differently to distinguish them from the code itself. This helps to make the code more readable, reducing the cognitive load of the reader and making it easier to identify potential syntax errors.

How Can I Conduct a Plagiarism Check between Two Documents Online?

With our tool it’s easy, just enter or upload some text, click on the button “Compare text” and the tool will automatically display the diff between the two texts.

What Are the Benefits of Using a Text Compare Tool?

Using text comparison tools is much easier, more efficient, and more reliable than proofreading a piece of text by hand. Eliminate the risk of human error by using a tool to detect and display the text difference within seconds.

What Files Can You Inspect with This Text Compare Tool?

We have support for the file extensions .pdf, .docx, .odt, .doc and .txt. You can also enter your text or copy and paste text to compare.

Will My Data Be Shared?

There is never any data saved by the tool, when you hit “Upload” we are just scanning the text and pasting it into our text area so with our text compare tool, no data ever enters our servers.

Software License Agreement

Copyright © 2023, Originality.ai

All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

  1. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Will My Data Be Shared?

This table below shows a heat map of features on other sites compared to ours as you can see we almost have greens across the board!

More From The Blog

Al Content Detector & Plagiarism Checker for Marketers and Writers

Use our leading tools to ensure you can hit publish with integrity!