2024 Study on Large Language Model-Generated Text in Medical Writing Concludes That Excels at Detecting AI-Generated and AI-Paraphrased Text is exceptional at identifying AI-generated medical articles, according to the study “The Great Detectives: Humans vs. AI Detectors in Catching Large Language Model-Generated Medical Writing,” 2024.

Key Findings (TL;DR)

  • emerged as the most sensitive and accurate platform for detecting 100% of ChatGPT-generated and paraphrased content.
  • In contrast, human reviewers were less accurate. Student reviewers only identified 76% of AI-paraphrased articles correctly. Then, professors who reviewed the articles misclassified 12% of human-written content as AI-generated.

Study Details

The study evaluated six common AI content detectors and four human reviewers (two students and two professors). Then, both the AI-generated text detector and the human reviewers had the task of distinguishing between 150 academic papers consisting of original, ChatGPT-generated, and AI-rephrased content.

(An outline of the study)

AI Text Detection Tools

  • Six AI Text Detectors Used:, TurnItIn, GPTZero, ZeroGPT, Content at Scale, and GPT-2 Output Detector.
  • Four Human Reviewers: Two students and two professors. 


The dataset consisted of 150 academic papers.

  • 50 Original Papers: Rehabilitation-related articles from four peer-reviewed journals.
  • 50 AI-Generated Papers: ChatGPT generated the introduction, discussion, and conclusion sections based on the original titles, methods, and results. 
  • 50 AI-Rephrased Papers: Wordtune rephrased the ChatGPT-generated articles. 

Evaluation Criteria

  • Accuracy, misclassification rate, ROC curve, sensitivity, specificity, time taken (only for human reviewers), and reasons for classification (only for human reviewers).’s AI Detector Results

Finding 1: detected 100% of both ChatGPT-generated and AI-rephrased articles

(The accuracy of six AI content detectors in identifying AI-generated articles)

Finding 2: scored the highest mean AI Score

  • For ChatGPT-generated articles: 98.74%
  • For AI-rephrased articles: 99.74% 

(The mean AI scores of 50 ChatGPT-generated articles before and after rephrasing)

Finding 3: ranked third for the lowest percentage of misclassification or uncertainty

(the percentage of misclassification of human-written articles as AI-generated ones by detectors)
  • Turnitin and GPT-2 Output Detector were excellent at identifying human-written texts but not as effective with AI-rephrased content.

Finding 4: Human reviewers were less accurate

  • Student reviewers identified only 76% of AI-rephrased articles.
  • Professors misclassified 12% of human-written texts as AI-generated.

In the graphs below, Reviewer 1 and Reviewer 2 are college students, while Reviewer 3 and Reviewer 4 are professors.

(the accuracy of four human reviewers in identifying AI-rephrased articles)

(the percentage of misclassifying human-written articles as AI-rephrased ones by reviewers)

Final Thoughts

Of the AI detectors and human reviewers involved in the study, stands out as the most effective AI-generated text detection tool for identifying AI-generated medical writing (including paraphrased content). Using can significantly enhance the peer-review process and uphold academic integrity in scientific publishing.

Jonathan Gillham

Founder / CEO of I have been involved in the SEO and Content Marketing world for over a decade. My career started with a portfolio of content sites, recently I sold 2 content marketing agencies and I am the Co-Founder of, the leading place to buy and sell content websites. Through these experiences I understand what web publishers need when it comes to verifying content is original. I am not For or Against AI content, I think it has a place in everyones content strategy. However, I believe you as the publisher should be the one making the decision on when to use AI content. Our Originality checking tool has been built with serious web publishers in mind!

