AI Studies

AI Detection Accuracy Studies — Meta-Analysis of 6 Studies

A comprehensive overview and meta-analysis of academic research and studies that demonstrate the exceptional performance of Originality.ai in detecting AI-generated text.

In the many studies below looking at which AI detector is the most accurate, Originality.ai has consistently emerged as the most accurate AI text detector, outperforming various other tools. 

This article provides a meta-analysis of multiple research studies that showcase Originality.ai’s superior detection capabilities. These findings validate Originality.ai’s own AI detector accuracy study. They show that Originality.ai has outstanding performance when distinguishing AI-generated content from human-written text, demonstrating reliable third-party evidence of our efficacy. 

Key Findings (TL;DR)

Originality.ai AI Detector identified as the most effective in all 6 published 3rd part studies below

Originality.ai stands out as the most accurate tool for AI-generated text detection across multiple studies with high precision, recall, and overall accuracy. Originality.ai’s AI Content Checker has consistently outperformed other tools in detecting AI content and ensuring the authenticity of human-written text.

The following studies have been analyzed to assess the accuracy of AI-generated Text Detection Tools.

An Empirical Study of AI-Generated Detection Tools

The Effectiveness of Software Designed to Detect AI-Generated Writing: A Comparison of 16 AI Text Detectors

RAID: A Shared Benchmark for Robust Evaluation of Machine-Generated Text Detectors

The Great Detectives: Humans vs. AI Detectors in Catching Large Language Model-Generated Medical Writing

Characterizing the Increase in Artificial Intelligence Content Detection in Oncology Scientific Abstracts From 2021 to 2023

Students are using large language models and AI detectors can often detect their use

Rankings

Study Title Originality.ai’s Accuracy Performance Highlights Key Competitors
An Empirical Study of AI-Generated Text Detection Tools 97% Highest true positives, Lowest false negatives GPTZero, Writer
The Effectiveness of Software Designed to Detect AI-Generated Writing: A Comparison of 16 AI Text Detectors 97% 100% accuracy on GPT-3.5 and GPT-4 papers Copyleaks, TurnItIn
RAID: A Shared Benchmark for Robust Evaluation of Machine-Generated Text Detectors 85% Most accurate across base and adversarial datasets, Exceptional performance on paraphrased content Binoculars, FastDetectGPT
The great detectives: humans versus AI detectors in catching large language model-generated medical writing 100% 100% accuracy on ChatGPT-generated and AI-rephrased articles ZeroGPT, GPT-2 Output Detector
Characterizing the Increase in AI Content Detection in Oncology Scientific Abstracts 96% 96% Accuracy for AI-generated (GPT-3.5, GPT-4) abstracts with over 95% sensitivity GPTZero, Sapling
Students are using large language models and AI detectors can often detect their use 91% Highest accuracy of 91% for Human vs AI and 82% for Human vs Disguised text GPTZero, ZeroGPT, Winston

Study Summaries

Study 1: An Empirical Study of AI-Generated Text Detection Tools

Based on An Empirical Study of AI-Generated Text Detection Tools, Originality.ai is the leading tool in detecting AI-generated text, achieving the highest accuracy rate of 97%, outperforming five other tools in identifying human-written content.

(Accuracy Comparison of AI Text Detection Tools on AH&AITD)

Key Findings

  • Accuracy: 97%
  • Precision: 98%
  • Recall: 96%
  • F1-score: 97%

Study Details

  • Tools Evaluated: Originality.ai, Zylalab, GPTKIT, GPTZero, Sapling, Writer
  • Dataset: 11,580 samples from AH&AITD dataset
  • Evaluation Criteria: Accuracy, Precision, Recall, F1 score, ROC curve, Confusion Matrix

Performance Highlights

  • Highest True Positives: 5,547
  • Lowest False Negatives: 243
  • Second Lowest False Positives: 94
  • Second Highest True Negatives: 5,696

Source

https://www.opastpublishers.com/peer-review/an-empirical-study-of-aigenerated-text-detection-tools-6354.html

Study 2: The Effectiveness of Software Designed to Detect AI-Generated Writing: A Comparison of 16 AI Text Detectors

According to this comprehensive study on “The Effectiveness of Software Designed to Detect AI-Generated Writing,” where 16 AI text detectors were evaluated, Originality.ai demonstrated remarkable accuracy identifying AI-generated content. It ranked as a top performer across GPT-3.5, GPT-4, and human-written papers with an overall accuracy of 97%.

(% of all 126 documents for which each detector gave correct, uncertain, or incorrect responses)

Key Findings

  • Overall Accuracy: 97%
  • GPT-3.5 Accuracy: 100%
  • GPT-4 Accuracy: 100%
  • Human Papers Accuracy: 95%

Study Details

  • Tools Evaluated: Originality.ai, Copyleaks, TurnItIn, Scribbr, ZeroGPT, Grammica, GPTZero, Crossplag, OpenAI, IvyPanda, GPT Radar, SEO.ai, Content at Scale, Writer, Sapling, ContentDetector.ai
  • Top Performers: Originality.ai, Copyleaks, TurnItIn
  • Dataset: 126 short papers/essays that were generated by AI or first-year college students.
  • Evaluation Criteria: Overall accuracy, accuracy with each type of document, decisiveness, the number of false positives, and the number of false negatives.

Performance Highlights

  • Overall Accuracy: Very high
  • Accuracy, GPT-3.5: Very high
  • Accuracy, GPT-4: Very high
  • Decisiveness: High
  • False Positives: Few
  • False Negatives: Few

Source

https://www.degruyter.com/document/doi/10.1515/opis-2022-0158/html

Study 3: RAID: A Shared Benchmark for Robust Evaluation of Machine-Generated Text Detectors

In the largest and most comprehensive study to date, RAID: A Shared Benchmark for Robust Evaluation of Machine-Generated Text Detectors, Originality.ai outperformed 11 leading AI detectors, achieving a remarkable accuracy of 85% on the base dataset and 96.7% on the paraphrased content in identifying AI-generated content. 

Key Findings

  • Base Dataset Accuracy: 85%
  • Adversarial Techniques: 1st in 9 out of 11 tests
  • Content Domains: 1st in 5 out of 8 domains
  • Paraphrased Content Accuracy: 96.7%

Study Details

  • Tools Evaluated: some text
    • Commercial: Originality.ai, GPTZero, Winston, ZeroGPT 
    • Metric-Based: GLTR, Binoculars, Fast DetectGPT, LLMDet 
    • Neural: RoBERTa-Base (GPT2), RoBERTaLarge (GPT2), RoBERTa-Base (ChatGPT), RADAR 
  • Dataset: 6,287,820 textssome text
  • Evaluation Criteria: some text
    • 11 Types of Adversarial attacks (strategies to make text undetectable)
    • Accuracy at 5% False Positive Threshold for all tests 

Performance Highlights

  • Most Accurate AI Detector on Base Dataset
  • Most Accurate AI Detector on Adversarial Datasets
  • The Most Accurate AI Detector Across All Domains
  • Exceptional Performance on Paraphrased Content

Source

https://arxiv.org/abs/2405.07940

Study 4: The Great Detectives: Humans versus AI Detectors in Catching Large Language Model-generated Medical Writing

The study, The Great Detectives: Humans versus AI Detectors in Catching Large Language Model-generated Medical Writing, directly compares the accuracy of advanced AI detectors and human reviewers in detecting AI-generated medical writing after paraphrasing.

Six common AI content detectors and four human reviewers were employed to differentiate between the original and AI-generated articles. Originality AI emerged as the most sensitive and accurate platform for detecting AI-generated (including paraphrased) content.

(Accuracy of six AI content detectors in identifying AI-generated articles)

Key Findings

  • ChatGPT-Generated Articles Accuracy: 100%
  • AI-Rephrased Articles Accuracy: 100% 
  • Human evaluators performed worse than AI detectors

Study Details

  • Tools Evaluated: some text
    • Six AI detectors: Originality.ai, TurnItIn, GPTZero, ZeroGPT, Content at Scale, GPT-2 Output Detector
    • Four Human Reviewers: Two student reviewers and Two professorial reviewers
  • Dataset: 150 texts (academic papers) 
  • Evaluation Criteria: AI score or Perplexity score

Performance Highlights

  • Only AI detector to identify 100% of AI Content
  • Only AI detector to identify 100% AI-Rephrased Content

Source

https://link.springer.com/article/10.1007/s40979-024-00155-6

Study 5: Characterizing the Increase in Artificial Intelligence Content Detection in Oncology Scientific Abstracts From 2021 to 2023

The study Characterizing the Increase in Artificial Intelligence Content Detection in Oncology Scientific Abstracts From 2021 to 2023 examines the effectiveness of three AI-content detectors (Originality.ai, GPTZero, and Sapling) in identifying AI-generated content in scientific abstracts submitted to the ASCO Annual Meetings from 2021 to 2023.

(Accuracy of AI content detectors in classifying human-written and AI-generated content)

Key Findings

  • Perfect AUROC scores of 1.00 for GPT-3.5 and nearly perfect for GPT-4
  • High AUPRC for distinguishing AI-generated from human-written abstracts

Study Details

  • Three Tools Evaluated: Originality.ai, GPTZero, Sapling
  • Dataset: 15,553 oncology scientific abstracts from ASCO Annual Meetings (2021-2023) 
  • Evaluation Criteria: AUPRC, AUROC, Brier Score

Performance Highlights

  • GPT-3.5 vs. Human: 99.7%
  • GPT-4 vs. Human: 98.7%
  • Mixed GPT-3.5 vs. Human: 87.8%
  • Mixed GPT-4 vs. Human: 81.5%

Source

https://ascopubs.org/doi/pdfdirect/10.1200/CCI.24.00077

Study 6: Students Are Using Large Language Models and AI Detectors Can Often Detect Their Use

The study Students are using large language models and AI detectors can often detect their use, aimed to explore how students use LLMs in their college work at the University of Wisconsin-Madison and evaluate the effectiveness of AI Detectors in identifying AI-generated text. 

They evaluated five AI detectors (Content at Scale, GPTZero, ZeroGPT, Winston, and Originality.ai, however, due to poor performance, Content at Scale was not further analyzed.

(Accuracy of AI content detectors)

Key Findings

  • Highest Accuracy of 91% for Human vs. AI and 82% for Human vs Disguised Text
  • Top F1 Score of 92% for Human vs. AI and a near-top score of 80% for Human vs. Disguised Text

Study Details

  • Three Tools Evaluated: Originality.ai, GPTZero, Winston, ZeroGPT
  • Dataset: 459 unique essays on the regulation of the tryptophan operon (human-written, AI-generated, disguised AI-generated) 
  • Evaluation Criteria: Accuracy, Precision, Recall, F1 score

Performance Highlights

  • Accuracy (Human vs. AI): 0.91
  • Precision  (Human vs. AI): 0.85
  • Recall (Human vs. AI): 1.0
  • F1 Score (Human vs. AI): 0.92
  • F1 Score (Human vs. Disguised): 0.80

Source

https://www.frontiersin.org/articles/10.3389/feduc.2024.1374889/full

Jonathan Gillham

Founder / CEO of Originality.ai I have been involved in the SEO and Content Marketing world for over a decade. My career started with a portfolio of content sites, recently I sold 2 content marketing agencies and I am the Co-Founder of MotionInvest.com, the leading place to buy and sell content websites. Through these experiences I understand what web publishers need when it comes to verifying content is original. I am not For or Against AI content, I think it has a place in everyones content strategy. However, I believe you as the publisher should be the one making the decision on when to use AI content. Our Originality checking tool has been built with serious web publishers in mind!

More From The Blog

AI Content Detector & Plagiarism Checker for Serious Content Publishers

Improve your content quality by accurately detecting duplicate content and artificially generated text.