AI Studies

According to ASCO Research AI-Generated Text in Scientific Abstracts is Increasing, and’s High Accuracy is Key to Preserving the Integrity of Scientific Literature

AI-generated text in scientific literature is becoming increasingly common — according to research by the American Society of Clinical Oncology, excels at detecting AI-generated content in scientific abstracts.

AI-generated content continues to become increasingly common in every sector, and scientific abstracts are no exception. In 2024, the American Society of Clinical Oncology (ASCO) noted a significant increase in the use of large language models (LLMs) for writing scientific abstracts in their study, “Characterizing the Increase in Artificial Intelligence Content Detection in Oncology Scientific Abstracts From 2021 to 2023.” 

The ASCO study evaluated the performance of three AI content detectors (, GPTZero, and Sapling) in identifying AI-generated content in scientific abstracts. The scientific abstracts were submitted to the ASCO Annual Meetings from 2021 to 2023.

Key Findings (TL;DR)

  • accurately identified 96% of AI-generated abstracts by GPT-3.5 and GPT-4, with a sensitivity of over 95%.
  • showed a moderate correlation (Spearman) with GPTZero (ρ = 0.284) and a lower correlation with Sapling (ρ = 0.143), indicating consistent performance.

Study Details

This study analyzed 15,553 oncology scientific abstracts from ASCO Annual Meetings between 2021 and 2023. AI-generated content in the abstracts increased significantly from 2021 to 2023. Logistic regression models were used to evaluate the association of predicted AI content with submission years and abstract characteristics. 

AI Text Detection Tools

  • Three AI-Content Detection Tools: GPTZero,, and Sapling.

The table below clarifies why the researchers chose GPTZero,, and Sapling for this study and why they excluded other AI-generated text detection tools.


  • The dataset comprised 15,553 abstracts from ASCO Annual Meetings from 2021 to 2023, accessed through ASCO’s Data Library. 
  • Key characteristics were tabulated for each abstract. The characteristics included abstract track, venue of presentation, inclusion of clinical trial number, as well as the countries and regions of the institutions the first author was affiliated with.

The table below shows characteristics of ASCO Annual Meeting abstracts and authors, from 2021 to 2023.

Evaluation Criteria

  • AUROC (Area Under Receiver Operating Curve), AUPRC (Area Under Precision-Recall Curve) — for accuracy.
  • Brier Score — for evaluating prediction error.
  • Logistic Regression — for analyzing the association of AI content with abstract characteristics.
  • Spearman Correlation — for comparing predictions between different detectors.’s AI Detector Results

Finding 1: Perfect AUROC score of 1.00 for GPT-3.5 and nearly perfect for GPT-4

(Accuracy of AI content detectors in classifying human-written and AI-generated content)
  • AUROC for GPT-3.5 vs. Human: Perfect scores of 1.000 for all years
  • AUROC for GPT-4 vs Human: Slight improvement over the years, reaching up to 0.997.
  • AUROC for Mixed GPT-3.5 vs Human: 0.782 in 2021, improving slightly to 0.788 in 2023.
  • AUROC for Mixed GPT-4 vs Human: 0.706 in 2021, improving to 0.715 in 2023.

Finding 2: High AUPRC to differentiate between AI-generated and human-written abstracts

  • AUPRC for GPT-3.5 vs. Human: High performance, showing slight improvement over the years.
  • AUPRC for GPT-4 vs Human: Strong performance with a slight improvement over the years.
  • AUPRC for Mixed GPT-3.5 vs. Human: High performance, improving steadily.
  • AUPRC for Mixed GPT-4 vs Human: Strong performance, improving over the years.

Finding 3: Low to Moderate Spearman Correlation with other detectors ensuring consistent performance

  • Moderate correlation with GPTZero (ρ = 0.284) 
  • Lower correlation with Sapling (ρ = 0.143)
(Correlation between outputs across pairs of AI content detectors)
(Correlation and Agreement Between Predictions From Pairs of Detectors)

Finding 4: Ranked second with a lower Brier Score, indicating accurate predictions with minimal error, suggesting a low rate of false positives

  • Brier Score for GPT-3.5 vs. Human: Low scores, around 0.013 to 0.015.
  • Brier Score for GPT-4 vs. Human: Improvement from 0.027 in 2021 to 0.025 in 2023.
  • Brier Score for Mixed GPT-3.5 vs. Human: Around 0.400, showing high prediction error.
  • Brier Score for Mixed GPT-4 vs. Human: Around 0.426, showing high prediction error.

Final Thoughts excels in detecting AI-generated content in scientific abstracts. Its high accuracy, low false-positive rate, and adaptability to different abstract characteristics make it a critical tool for researchers, publishers, and academic institutions committed to preserving the integrity of scientific literature. AI-generated text detection tools are particularly important for maintaining trust in scientific research and publications.

Jonathan Gillham

Founder / CEO of I have been involved in the SEO and Content Marketing world for over a decade. My career started with a portfolio of content sites, recently I sold 2 content marketing agencies and I am the Co-Founder of, the leading place to buy and sell content websites. Through these experiences I understand what web publishers need when it comes to verifying content is original. I am not For or Against AI content, I think it has a place in everyones content strategy. However, I believe you as the publisher should be the one making the decision on when to use AI content. Our Originality checking tool has been built with serious web publishers in mind!

More From The Blog

AI Content Detector & Plagiarism Checker for Serious Content Publishers

Improve your content quality by accurately detecting duplicate content and artificially generated text.