by Jonathan Gillham
Share

In this detailed post, our team looks at the accuracy of our AI detection and its ability to accurately detect Chat GPT, GPT-4, paraphrased and other popular NLP models.
Additionally, we compare the accuracy of Originality.AI to other AI detection tools including Open AI’s classifier, GPTZero, Writer.AI and CopyLeaks.
We try and provide a lot of detail on the source of the testing data so that you can replicate this test and come to your own conclusion.
Release Notes:
The AI at Originality.AI has been being developed since before ChatGPT took over the world. It has been updated several times and below is a summary of the major releases.
- 1.1 – Nov 2022 – GPT-2 and GPT-3 accurate detection. But was able to be “tricked” with Paraphrasing
- 1.4 – Apr 2023
- Improved Chat GPT detection
- Trained on and Acceptable Accuracy on GPT-4 Generated Content
- Only tool capable of detecting Paraphrased content. The most common way to bypass Originality.AI detection was to use ChatGPT to generate content and then use a paraphraser like Quillbot to paraphrase the content. With this release, this is no longer a viable way to pass off AI content as human.
- Reduced the number of false positives with increased training on human-generated content
- See a companion blog post addressing how clients should think about false positives and what writers can do if wrongly accused of using AI – https://originality.ai/ai-content-detector-false-positives/
Summary:
Originality.AI is by far the most accurate AI detection tool on the market across all NLP models.
It especially outperforms the alternative detection tools on the latest and most challenging NLP models (ChatGPT, GPT-4)
Additionally, it is the only AI detection tool trained to be able to detect Paraphrased content (such as Quillbot).
Confusion Matrix for Each Tool Across the 1200 Samples Tests (same data as the graph above but presented in a “confusion matrix”)
Why is Originality.AI So Much More Accurate?
Our AI team is wildly smart and have been working on this problem since BEFORE ChatGPT. However, many of these companies have VERY smart machine learning engineers or PhDs.
A significant difference in Originality.AI compared to the other tools listed above is that Originality.AI is a premium tool first (does not have a significant free option) and therefore we are willing to use A LOT more computing power when running our detection model.
That is why other tools can offer free AI detection tools. A free AI detection tool for Originality.AI would be financially unsustainable.
I would compare it to being in a car race where we are able to use a monster 12-cylinder engine with hundreds of horsepower while the competition is trying to compete with a go-kart engine.
AI Detection Scores Will Never Be Enough – Visualize the Creation Process!
Although Originality.AI has the most accurate AI detection score we don’t believe that they should be relied on 100%. False positives (although low) will never get to ZERO.
That is the reason we built the Free Google Chrome extension that allows you to visualize the creation process of a document.
AI Content Detector Chrome Extension
Originality.AI AI Content Detection Accuracy Test
For the most recent release, Originality.AI 1.4 released April 9, 2023 here are the details
Prepare data
- Our training data consists of millions of data samples, half of which are labeled as man-written, the rest are generated from more than 10 models including important ones like GPT-2 /J/ 3/3/3.5/4 ChatGPT, Paraphrase.
- Our model is also analyzed and evaluated on millions of data.
- To generate data generated from chat models like ChatGPT or GPT-4, we created short conversations with 2 to 5 messages back and forth prompts and models like ChatGPT or GPT-4. Then, the final response data from these models will be fed into training.
Based on millions of documents, used by us to evaluate our model. Therefore, the results presented below will represent reliable scores.
The results show the change from the current version of Originality.AI to the new version.
The tables below used to understand the accuracy of the AI Detection Models is called a “confusion matrix” and the image below showing Originality.AI’s accuracy for GPT-4 detection has comments helping you understand how you should read these images…
GPT 3 / 3.5 Content Detection Accuracy – Previous Model vs New Release
- This and our previous version worked very well for content generated from AI models from GPT-3/3.5 and earlier with a True Positive rate of up to 97.52% as previously announced.
ChatGPT Content Detection Accuracy – Previous Model vs New Release
- The content from ChatGPT has changed the game, it has a tendency to “human” inside the ability to communicate and improve the content in the conversation making content discovery extremely challenging.
- That is also the reason why our model, although improved a lot, can distinguish most of the content from ChatGPT, but there are still challenges to improve.
GPT-4 Content Detection Accuracy – Previous Model vs New Release
Paraphrased (ie Quillbot) Content Detection Accuracy – Previous Model vs New Release
- The most common “loophole” for using AI and then bypassing detectors has been closed with the latest release from Originality.AI. Now content that has been run through a paraphrasing tool like Quillbot will be identified as AI-generated.
- Here is an in depth article exploring the ability to bypass AI and plagiarism detection on all platforms except Originality.AI using Quillbot – https://originality.ai/paraphrase-plagiarism-checker/
Improved Accuracy on the Most Challenging AI Models
The latest release was focused heavily on training with the most challenging AI models to improve its detection ability.
Testing on millions of documents from the MOST difficult AI models namely GPT-4, ChatGPT and Paraphrasing tools here is our accuracy…
New Chat Model April 25, 2023 – HuggingChat AI Content Detection
HuggingChat has been released by HuggingFace and an initial test was completed checking the AI detection capabilities of Originality and 3 other AI detectors. The results and complete study plus video of the test being completed is shown below.
Full Study – HuggingChat AI Detector
Video – HuggingChat AI Detector Testing
Key Takeaways:
- It is easy to see here that the version 1.1 model was unable to accurately predict the content generated from the Paraphrase, ChatGPT, and GPT-4 models, making its True Positive score (0.6102) very poor. Especially with paraphrased content, is extremely difficult to detect and it greatly affects the score of the model.
- The current 1.4 model, improves both the accurate prediction of the data generated from the Paraphrase, ChatGPT, and GPT-4 models and is also more accurate in reducing the False Positive rate.
- Our improvements mainly come from data construction, using more data generated from multiple models and more quality has markedly improved model performance.
- In the past, our model had a bit of trouble detecting content generated from the latest advanced models like ChatGPT, and GPT-3.5/4. This was completely overcome when we generated data from these models and comprehensively added it to our training dataset.
- Now, the error rate has dropped dramatically as the model can fully detect content generated from ChatGPT, GPT-4, or Bard.
- Moreover, with paraphrased content, the model has worked better and is the best on the market today.
The next section will be some comparisons with other similar tools.
Accuracy of ChatGPT GPT-4 AI Detection Tools: GPTZero.me vs Writer.com vs CopyLeaks vs Originality.AI
Our model comparison tests with models below such as GPTZero, Copyleaks, and Writer were all evaluated against a small set of 1200 data samples. Which for each model ChatGPT, GPT-2/3/4/J, Paraphrase includes 100 samples written by humans and 100 samples created from the above models.
Originality.AI vs GPTZero.me
For similar settings, we compare our model with GPTZero.
ChatGPT Detection Accuracy – Originality.AI vs GPTZero.me
- True Positive for Originality.AI was 83% while only 24% for GPTZero.me
- False Positives were 3% for Originality.AI and 6% for GPTZero.me
GPT-2 Detection Accuracy – Originality.AI vs GPTZero.me
- True Positive for Originality.Ai was 100% while for GPTZero.me it was 42%
- False Positives for Originality.AI was 1% while for GPTZero.me it was 11%
GPT-3 Detection Accuracy – Originality.AI vs GPTZero.me
- True Positive for Originality.Ai was 100% while for GPTZero.me it was 44%
- False Positives for Originality.AI was 1% while for GPTZero.me it was 8%
GPT-4 Detection Accuracy – Originality.AI vs GPTZero.me
- True Positive for Originality.Ai was 99% while for GPTZero.me it was 29%
- False Positives for Originality.AI was 1% while for GPTZero.me it was 12%
Paraphrase (Quillbot) Detection Accuracy:
Note – GPTZero.me (or any AI detector other than Originality.AI) make no claims of being able to detect paraphrased content.
- True Positive for Originality.Ai was 96% while for GPTZero.me it was 14%
- False Positives for Originality.AI was 3% while for GPTZero.me it was 17%
Winner = Originality.AI
Indeed, the evaluation results speak volumes about the superiority of our model over the GPT Zero model in detecting AI-generated text. With consistently higher true positive rates and true negative rates across all the evaluation sets, our model is a clear winner. The differences in performance are so significant that they can be easily visualized, and the results are self-explanatory. There is little left to say when the superiority of our model is so evident from the evaluation results.
If you want to try and “trick” GPTZero then clearly ChatGPT, GPT-4 or Paraphrasing of any content makes it “undetectable” by GPTZero.me
Originality.AI vs OpenAI Text Classifier
Since we have 2 classes AI and Human, while OpenAI has 5 classes according to degrees, we will therefore include the ambiguous classes unclear if it is, possibly, or likely to be AI. The remaining classes are very unlikely, unlikely is Human.
You will note the scores for Originality.AI in the images below remain unchanged from the same tests that were run above looking at the GPTZero.me accuracy.
These results are similar to what Content Marketing Agencies like TopContent produced when they tested OpenAI’s text classifier and recommend Originality.AI
ChatGPT Detection Accuracy – Originality.AI vs OpenAI Text Classifier
- True Positive for Originality.AI was 83% while only 17% for Open AI Text Classifier
- False Positives were 3% for Originality.AI and 10% for Open AI Text Classifier
GPT-2 Detection Accuracy – Originality.AI vs OpenAI Text Classifier
- True Positive for Originality.AI was 100% while only 26% for Open AI Text Classifier
- False Positives were 1% for Originality.AI and 11% for Open AI Text Classifier
GPT-3 Detection Accuracy – Originality.AI vs OpenAI Text Classifier
- True Positive for Originality.AI was 100% while only 81% for Open AI Text Classifier
- False Positives were 1% for Originality.AI and 12% for Open AI Text Classifier
GPT-4 Detection Accuracy – Originality.AI vs OpenAI Text Classifier
- True Positive for Originality.AI was 100% while only 14% for Open AI Text Classifier
- False Positives were 1% for Originality.AI and 7% for Open AI Text Classifier
Winner = Originality.AI
Again the evaluation results speak to the superiority of our model compared to the OpenAI Text Classifier. Originality.AI achieved consistently higher true positve and true negative rates across all evaluation sets.
Originality.AI vs CopyLeaks vs Writer.com
GPTZero.me and OpenAI Text Classifier were the strongest alternatives to Originality.AI but other tools were also tested.
The confusion matrix’s performing the same test as completed above were done for the 2 tool… CopyLeaks AI Detector and Writer.com AI Detector.
To enhance confidence in our experimental results, we decided to compare our model with two other popular tools in the market – Copyleaks and Writer. The findings of our comparison reveal that our model outperforms both Copyleaks and Writer in accurately detecting AI-generated text.
During the experiment, we utilized a dataset containing an equal number of AI-generated and human-written text samples. However, both Copyleaks and Writer produced outputs skewed towards human-written text, indicating that they lack the ability to accurately detect AI-generated text. In contrast, our model demonstrated superior performance in detecting AI-generated text with high true positive and true negative rates.
ChatGPT
GPT-2
GPT-3
GPT-4
Experiments on other datasets
To be more objective we tested on a dataset that was completely outside our training set. The Google Blogger Corpus dataset is not part of our training dataset and may have a completely different distribution. We wanted to compare false positives for content written by humans. With a threshold of 0.5, based on 50000 data samples over 256 tokens in length. The error rate for the v1.4 model is 0.28% compared to 3.07% for the v1.1 model. That’s a big improvement in this updated version.
This is a link to the 50k highly likely human-written data that we evaluate. You can verify our results from it.
Conclusion
The results above show that the latest model I trained has overcome the previous weaknesses. Reduced false positives and predictive data generated from the latest models like ChatGPT and GPT-4.
We can see that our latest model is fully capable of detecting content generated from the latest models like ChatGPT, GPT-4
Moreover, the data generated from the more powerful model has contributed to making our model much stronger than before.
HuggingFace has launched a ChatGPT “clone” called HuggingChat based on Open Assistant a 30-billion-parameter LLaMa model. In this article, we do a quick study to see if any of the existing AI detectors on the market are capable of detecting Hugging Chat content. For a more in-depth AI detection accuracy study across multiple NLP models […]
We recently examined the top 20 webpages for 1,000 popular keywords to find out if AI-generated content influences their position on Google Search. We used Originality.AI's scanning tool to determine if the content was likely produced by AI or humans. Our study revealed some interesting findings, and I'm excited to share these insights with you today.
MotionInvest.com is the leading website brokerage for Content Websites. Learn how it uses Originality.AI to do the most in depth originality check within the industry for all the websites it lists for sale.