AI Writing

AI Content Detection Algorithms

Since the introduction of ChatGPT, there has been a growing concern surrounding the use of Artificial Intelligence (AI) for creating content. To combat cheating and fraud, AI content detection tools were developed using algorithms specifically designed to detect and flag AI-generated content. Although still in the early stages of development, these apps have already shown

Since the introduction of ChatGPT, there has been a growing concern surrounding the use of Artificial Intelligence (AI) for creating content. To combat cheating and fraud, AI content detection tools were developed using algorithms specifically designed to detect and flag AI-generated content.

Although still in the early stages of development, these apps have already shown promise as an effective way to deter students from manipulating AI writing tools to fabricate essays, papers, or even entire dissertations.

To gain insight into AI content detection tools, we must explore the algorithms that drive them. In this article, let’s examine several of the various methods utilized.

What is a Language Model?

A language model is an AI algorithm trained to predict the next word in a sequence. It accomplishes this by analyzing huge amounts of text data and leveraging probability to determine the most likely outcome.

Language models are essential components of all AI writing tools, enabling them to generate content. This same capability also makes them invaluable for detecting AI-generated content – after all, it takes one to know one!

What Are the Different AI Detection Models?

BERT

BERT (Bidirectional Encoder Representations from Transformers) is an AI language model in 2018 by Google researchers. It utilizes a bidirectional approach to language modeling, meaning it looks at the context of both the preceding and succeeding words to make predictions.

The bidirectional capability is due to the model being pre-trained on two tasks. The first is Masked Language Modeling (MLM). It is the task of masking a word in a sentence and forcing the model to read the words on either side of the masked word in order to predict it.

The second task BERT is pre-trained on is Next Sentence Prediction (NSP). This allows the model to learn the relationships between sentences by predicting if the given sentence follows the previous one.

Regarding training data, BERT is trained specifically on datasets from Wikipedia and Google’s BookCorpus.

RoBERTa

RoBERTa (Robustly Optimized BERT Approach) is an optimized version of BERT developed by Facebook’s AI research team. It is trained on a much larger dataset, over 160GB of text, than BERT and uses an additional training task to further improve accuracy.

The additional training task RoBERTa utilizes is called Dynamic Masking. This involves masking multiple words in a sentence and having the model predict the masked word that best fits the context of all of them at once.

RoBERTa also has a longer maximum sequence length, meaning it can handle more data than BERT can process in one go. This results in better performance when dealing with longer text passages such as essays, papers, and dissertations.

GPT-2

OpenAI released GPT-2 (Generative Pre-trained Transformer 2) in 2019. It is an AI language model trained on a dataset of 8 million web pages. It is also based on a transformer model architecture and is trained with the Causal Language Modeling (CLM) objective.

CLM is the task of predicting the next word in a sequence given previous words. That’s why the model can read the context of the previous sentences written and produce generated text that remains coherent and related to the previous words.

A few AI content detectors have based their algorithms on the GPT-2 model, such as GPTZero and GPT-2 Output Detector.

GPT-3

GPT-3 (Generative Pre-trained Transformer 3) is one of the most powerful AI algorithms today. It was released in 2020 by OpenAI as a successor to GPT-2.

GPT-3 is trained on a dataset of 45 TB of text data, making it the largest and most powerful natural language processing (NLP) system ever created.

It is an unsupervised learning algorithm, meaning it can learn from unlabeled text data without requiring labels for each piece of data. This makes it particularly suitable for tasks such as machine translation and question-answering.

GPT-2 was not released immediately due to the biases and malicious language it sometimes produced. Compared to its predecessor, GPT-3 was built with strategies that lessened toxic language. As a result, the model produces fewer malicious results.

Originality.AI is based on the GPT-3 language model, but it can also detect generated content from ChatGPT, OpenAI’s latest model.

AI Text Classifier

AI Text Classifier is OpenAI’s own AI content detector tool. It is trained on both human-written text and AI-generated text. There are three sources for the human-written text, namely Wikipedia, WebText, and prompts from InstructGPT.

Then, each sample of AI-written text is paired with a similar sample of human-written text. For instance, portions of random articles from Wikipedia data were used to generate 1,000 tokens of AI text. These results are then paired with the original, human-written continuation.

OpenAI admits that the model they created still has plenty of limitations. First, it is only trained in the English language and can be unreliable on anything below 1,000 characters. So, as with any other AI content detector, this should only be used as one of the many assessment methods for determining AI plagiarism.

Conclusion

AI content detection algorithms are the same as those powering AI content generation tools. To identify AI–generated material, they attempt to replicate its style and syntax by being trained on large datasets of publicly accessible data, as well as samples from AI-generated content.

AI content detection tools, such as OpenAI’s Text Classifier trained on GPT–3 (the largest language model to date), can be a powerful way to combat cheating and fraud in academic settings.

Although no other apps have been specifically trained on ChatGPT yet, those that are trained on GPT–3 can still detect its results.

However, AI content detection tools should not be used as the sole method for assessing AI plagiarism; despite their high accuracy ratings, there should be additional methods to determine whether the content is plagiarized or written by AI.

Jonathan Gillham

Founder / CEO of Originality.AI I have been involved in the SEO and Content Marketing world for over a decade. My career started with a portfolio of content sites, recently I sold 2 content marketing agencies and I am the Co-Founder of MotionInvest.com, the leading place to buy and sell content websites. Through these experiences I understand what web publishers need when it comes to verifying content is original. I am not For or Against AI content, I think it has a place in everyones content strategy. However, I believe you as the publisher should be the one making the decision on when to use AI content. Our Originality checking tool has been built with serious web publishers in mind!

More From The Blog

AI Content Detector & Plagiarism Checker for Serious Content Publishers

Improve your content quality by accurately detecting duplicate content and artificially generated text.