The study by William H. Walters, "The Effectiveness of Software Designed to Detect AI-Generated Writing: A Comparison of 16 AI Text Detectors," provides a comprehensive evaluation of 16 publicly available AI text detectors. Based on the study’s findings, showed remarkable accuracy in identifying AI-generated text. 

Key Findings (TL;DR)

  • Three AI text detectors —, Copyleaks, and TurnItIn — demonstrated exceptional accuracy across all three sets of documents examined in this study: GPT-3.5 papers, GPT-4 papers, and human-generated papers. 
  • Although most of the other detectors in the study could distinguish between GPT-3.5 papers and human-generated papers with a reasonably high accuracy they were not as effective at distinguishing between GPT-4 and student-written papers.
Study Details

This study evaluated the accuracy of 16 publicly available AI text detectors in discerning AI-generated from human-generated writing. The evaluated document included 126 essays generated by AI and humans across various domains. 

Each detector’s performance was assessed for its overall accuracy, its accuracy with each type of document, its decisiveness, the number of false positives, and the number of false negatives. was one of the top three highest-performing AI detectors — demonstrating high accuracy with all three sets of documents.

AI Text Detection Tools

  • 16 AI detection tools:, Copyleaks, TurnItIn, Scribbr, ZeroGPT, Grammica, GPTZero, Crossplag, OpenAI, IvyPanda, GPT Radar,, Content at Scale, Writer, Sapling,
  • Top performers:, Copyleaks, TurnItIn


GPT 3.5 and GPT 4 were each used to generate 42 short papers (literature reviews), similar to the type of papers typically assigned to students in first-year composition courses at US universities. The topics covered social sciences, natural sciences, and the humanities.

The 42 human-written documents were gathered from a set of 178 papers submitted by Manhattan College English 110 (First Year Composition) students during the 2014 to 2015 academic year. 

The 2014 to 2015 timeframe was selected because it occurred before the widespread availability of AI tools such as ChatGPT to ensure the papers were written without AI.

This dataset consists of 126 documents divided into the following:

  • 42 undergraduate essays generated by ChatGPT-3.5
  • 42 generated by ChatGPT-4
  • 42 human-written essays from a first-year composition course

Evaluation Criteria

Each detector’s performance was assessed for:

  • Its overall accuracy across all 126 documents.
  • Its accuracy when tested against each of the three sets of documents.
  • Its decisiveness (the relative number of uncertain responses).
  • The number of false positives (human-generated papers designated as AI by the detector)
  • The number of false negatives (AI-generated papers designated as human by the detector). 

The analysis involved four steps:

  1. Preparing the three sets of documents.
  2. Selecting the 16 AI text detectors to include in the study.
  3. Using each detector to evaluate the 126 documents and code the responses as AI, human, or uncertain.
  4. Evaluating the accuracy of each detector and its ability to identify AI-generated and human-generated text.’s Performance excelled at detecting both AI-generated and human-written documents with high accuracy. The study positioned it as a top-tier AI-generated text detection tool with a 97% overall accuracy rate.

All 126 documents

  • Correct: 98%
  • Incorrect: 0%
  • Uncertain: 2%
Percentage of documents for which each detector gave correct or incorrect responses
Percentage of all 126 documents for each detector
GPT-3.5 Papers

  • Correct: 100%
  • Incorrect: 0%
  • Uncertain: 0%
Percentage of the 42 GPT-3.5 documents for which each detector gave correct, uncertain, or incorrect responses
GPT-4 Papers

  • Correct: 100%
  • Incorrect: 0%
  • Uncertain: 0%
Percentage of the 42 GPT4 documents for which each detector gave correct, uncertain, or incorrect responses
Human Papers

  • Correct: 95%
  • Incorrect: 0%
  • Uncertain: 5%
Percentage of the 42 human-generated documents for which each detector gave correct, uncertain, or incorrect responses
Effectiveness of

  • Overall Accuracy: Very high
  • Accuracy, GPT-3.5: Very high
  • Accuracy, GPT-4: Very high
  • Decisiveness: High
  • False Positives: Few
  • False Negatives: Few
Effectiveness of the 16 AI text detectors
Final Thoughts is positioned as a top-tier AI-generated text detection tool with exceptional accuracy. Its performance is on par with, and sometimes even better than, other leading tools like CopyLeaks and TurnItIn, ensuring the authenticity and integrity of the written content.

