AI Studies

Is GPT-o1 Content Detectable?

In September 2024, OpenAI released a new model — GPT-o1. Review a brief study on Originality.ai’s accuracy at detecting GPT-o1 generated text.

OpenAI released a new series of AI models designed to spend more time thinking before they respond — GPT-o1-preview. In their release, OpenAI describes that it can reason through complex tasks and solve harder problems than previous models in science, coding, and math.

As the new model was trained with reasoning traces and can spend time considering before it answers, in some domains, this has led to greater performance than before. In order to maintain the authenticity and integrity of written content available online, it is necessary to have a greater AI content detector as well.

This brief study looks at 1000 GPT-o1-preview generated text results to find out whether the Originality.ai AI Detector can detect GPT-o1-preview.

TL;DR — Is GPT-o1 AI-Generated Content Detectable?

  • Yes — GPT-o1 generated-text is detectable with a high degree of accuracy across the Originality.ai AI detection models.
  • Results Overview — 3.0.0 Turbo demonstrated 93.47% accuracy, 1.0.0 Lite had 91.66% accuracy, and 2.0.1 Standard had 94.47% accuracy.
  • Originality.ai showed strong performance in contrast to GPTZero — While GPTZero struggled to detect 'rewrite human-written text' samples, Originality.ai continued to show robust AI detection capabilities.
  • Accuracy Will Quickly Improve — As with most model releases by OpenAI, Originality.ai will quickly work to improve our accuracy to 99%+. 

Try the Originality.ai AI Detector. Then, learn about AI content detection accuracy and Originality’s strong performance in a meta-analysis of six third-party studies

Dataset

To evaluate the detectability of GPT-o1, we prepared a dataset of 1000 GPT-o1-preview generated text samples.

The Method: gathering AI-generated text data

For AI-text generation, we used GPT-o1-preview based on three approaches:

  1. Rewrite prompts: The first approach generated content by providing the model with a custom prompt and articles (probably generated by LLMs) as a reference to rewrite from (450 Samples).
  2. Rewrite human-written text: The second approach involved generating content with the aim of finding out whether AI rewrites of human-written text could bypass AI detection. The samples for this approach were fetched from an open-source dataset (325 Samples).
    1. One-Class Learning for AI-Generated Essay Detection
      1. Paper: https://www.mdpi.com/2076-3417/13/13/7901
      2. Dataset: https://github.com/rcorizzo/one-class-essay-detection
  3. Write articles from scratch: The third approach, generated articles from scratch based on topics ranging from fiction to nonfiction, such as history, medicine, mental health, content marketing, social media, literature, robots, the future, etc. (225 Samples).

Evaluation

To evaluate the efficacy, we used the Open Source AI Detection Efficacy tool that we released:

A brief overview of the Originality.ai AI detection models

Originality.ai has three models, 3.0.0 Turbo,  2.0.1 Standard, and 1.0.0 Lite, for AI text detection.

  • 3.0.0 Turbo — If your risk tolerance for AI is ZERO! It is designed to identify any use of AI, even light AI.
  • 2.0.1 Standard — A balanced model that is a great option if you are okay with slight use of AI (i.e., AI editing).
  • 1.0.0 Lite — If you permit light AI editing (such as Grammarly’s spelling or grammar suggestions).

For additional information on each of these models, check out our AI detector and read our AI detection accuracy guide.

Evaluating the AI detection models

The open-source testing tool returns a variety of metrics for each detector you test, each of which reports on a different aspect of that detector’s performance, including:

  • Sensitivity (True Positive Rate): The percentage of time the detector identifies AI correctly.
  • Specificity (True Negative Rate): The percentage of time the detector identifies human-written content correctly.
  • Accuracy: The percentage of the detector’s predictions that were correct.
  • F1: The harmonic mean of Specificity and Precision, often used as an agglomerating metric when ranking the performance of multiple detectors.

For a detailed discussion of these metrics, what they mean, how they're calculated, and why we chose them, check out our blog post on AI detector evaluation. For a succinct snapshot, the confusion matrix is an excellent representation of a model's performance.

Below is an evaluation of all these models on the above dataset. 

Confusion Matrix

Figure 1. Confusion Matrix on AI only dataset with Model 1.0.0 Lite
Figure 2. Confusion Matrix on AI only dataset with Model 2.0.1 Standard
Figure 3. Confusion Matrix on AI only dataset with Model 3.0.0 Turbo

Results of the Evaluation :

For this smaller test to identify the ability of Originality.ai’s AI detector to detect GPT-o1-preview content, we reviewed the True Positive Rate or the percentage (%) of time that the model correctly identified AI text as AI out of 1000 samples of GPT-o1-preview content. 

1.0.0 Lite:

  • Recall (True Positive Rate) = 91.66%

2.0.1 Standard:

  • Recall (True Positive Rate) = 94.47%

3.0.0 Turbo:

  • Recall (True Positive Rate) = 93.47%

Can GPTZero Detect GPT-o1?

To compare the efficacy of AI detectors, we also evaluated the dataset on the GPTZero AI detector. We've included the results of this performance below.

GPTZero Performance:

Additionally, we also ran the tool — GPTZero on the same dataset and here is its performance:

  • Recall (True Positive Rate) = 56.88%
Figure 4. Confusion Matrix on AI only dataset with GPTZero

Comparison Table: 

Tool Comparison Table
Tool F1 Accuracy TPR
Originality Turbo 97% 93% 93%
Originality Standard (2.0.1) 97% 94% 94%
Originality Lite 96% 92% 92%
GPTZero 73% 57% 57%

Based on the results, we were able to see that GPTZero significantly struggled with detecting Rewrite human-written text, whereas Originality continued to demonstrate a strong performance.

Final Thoughts

Overall, Originality.ai continues to demonstrate an outstanding capability to identify AI-generated content, including the latest releases of AI models such as OpenAI’s GPT-o1, GPT-4o and GPT-4o-mini

Each of Originality.ai’s AI detection models detected GPT-o1 with a high degree of accuracy from 3.0.0 Turbo with 93.47% accuracy to 2.0.1 Standard with 94.47% accuracy, and 1.0.0 Lite with 91.66% accuracy. Our machine learning engineers are continuing to improve our accuracy to 99%+, as with most new models released by OpenAI.

Jonathan Gillham

Founder / CEO of Originality.ai I have been involved in the SEO and Content Marketing world for over a decade. My career started with a portfolio of content sites, recently I sold 2 content marketing agencies and I am the Co-Founder of MotionInvest.com, the leading place to buy and sell content websites. Through these experiences I understand what web publishers need when it comes to verifying content is original. I am not For or Against AI content, I think it has a place in everyones content strategy. However, I believe you as the publisher should be the one making the decision on when to use AI content. Our Originality checking tool has been built with serious web publishers in mind!

More From The Blog

AI Content Detector & Plagiarism Checker for Marketers and Writers

Use our leading tools to ensure you can hit publish with integrity!