Fact Checking

AI Fact Checking Accuracy Study

Discover Originality.ai’s best-in-class fact checker. Find out how the Originality.ai Fact Checker compares to GPT-4o and GPT-5 in this accuracy study.

AI is becoming increasingly prevalent across industries, from web publishing to marketing and even education.

Yet, considering AI’s tendency to produce AI hallucinations (incorrect information presented as fact), such as with the 2025 AI book list scandal, establishing ways to confirm the veracity of information is of the utmost importance.

However, how can you tell which fact-checking solution is the most accurate and effective? 

That’s where our AI fact-check accuracy study comes in. 

Learn about our fact-checking benchmarking, datasets, and find out how the accuracy of the Originality.ai fact checker compares to GPT-4o and GPT-5. 

Plus, get insights into why fact-checking is important and the AI hallucination rates of popular LLMs according to third-party research studies.

3 Key Findings:

  1. Overall, across all three datasets tested Originality.ai delivered the best recall 83.5%.
  2. Originality.ai closely tied GPT-5 for accuracy, with a slight lead (Originality = 86.69% accuracy vs. GPT-5 = 86.67%), while decisively beating GPT-4o.
  3. Originality.ai outperformed GPT-4o and GPT-5 across all metrics on the SciFact dataset, showcasing its ability to handle scientific claims with superior reliability.
Fact Checking Accuracy Overall Results
Originality.ai Fact Checking Accuracy vs. GPT 4o vs GPT-5

What Is Fact-checking?

First, let’s take a quick look at what fact-checking is (and why it is important in 2025).

Quick Answer: Fact-checking is the rigorous process of verifying the accuracy and authenticity of information.

To provide further context, fact-checking may be conducted across various forms of content, such as:

  • News articles
  • Blogs
  • Speeches
  • Social media posts
  • Assignments
  • Essays
  • Papers
  • Books and E-books

This means that fact-checking is incorporated into a number of different use cases, including:

  • Journalism
  • Web publishing
  • Content marketing
  • Social media marketing
  • Education and academia
  • Traditional publishing

So, why is fact-checking important?

In an age where information can be shared at lightning speed, the spread of misinformation can have profound consequences, from influencing public opinion to negatively impacting brand reputation to endangering public health and safety.

The importance of fact-checking is multifold:

  • Upholding trust: Reliable information forms the basic foundation of trust between content creator, news organizations, academia, and their audience. Fact-checking ensures that the information shared is accurate, maintaining the credibility of these entities.
  • Informed decision making: Accurate and up-to-date information enables individuals to make informed decisions.
  • Preventing misinformation spread: Fact-checking acts as a filter, preventing the spread of false narratives and misconceptions that can potentially have an overwhelming negative impact on the public.
  • Promoting accountability: It holds public figures, journalists, and content creators accountable for their statements, ensuring they are responsible in their communications and discouraging the spread of false information.

AI Hallucinations and Popular LLMs

While LLMs are constantly improving, the potential for AI hallucinations has become a well-known limitation of AI tools.

As a quick recap, an AI hallucination occurs when AI presents something that is incorrect as a fact. If this ‘hallucination’ is then missed during editorial review and published, it can lead to a number of consequences, from spreading misinformation to damaging brand reputation.

Research is still being conducted into AI hallucinations and LLMs, but here’s a quick look at AI hallucination rates based on some of the latest findings:

LLM Hallucination Rate Data Source Year Published
GPT-3.5 39.6% Paper via PubMed 2024
GPT-4 28.6% Paper via PubMed 2024
Bard 91.4% Paper via PubMed 2024
GPT 3.5 69% Stanford Study 2024
PaLM 2 72% Stanford Study 2024
Llama 2 88% Stanford Study 2024
Source: Combined results from PubMed and Stanford studies

Findings differed for the exact hallucination rates across the studies, depending on sample size, as well as the material/facts tested, and the testing methodology. 

For instance, whereas the paper available via PubMed conducted a study to determine if a paper was hallucinated based on whether it produced two of the wrong “title, first author, or year of publication,” the Stanford Study focused on testing hallucinations in response to “specific legal queries.

Yet, the key theme remains that AI models and popular LLMs do hallucinate, which, unnoticed and without proper fact-checking, can pose significant problems.

The Goal of Our Study

The objective of this benchmarking effort was twofold: 

  1. First, to obtain or construct a suitable fact-checking benchmark dataset for Originality.ai’s fact checker.
  2. Second, to compare the performance of the checker against existing solutions (notably GPT-4o and GPT-5) across several datasets. 

The intent was to stress test the fact checker using challenging, real-world claims and measure its ability to deliver robust True/False judgments.

Dataset Selection

We sought to evaluate the checker against a diverse set of datasets to avoid overfitting to any one claim type or category. The goal was to test general performance, not just domain-specific accuracy. 

Below are the datasets we considered and our verdicts:

FEVER

Consists of 185,445 human-generated claims based on Wikipedia sentences (2018 snapshot), classified as Supported, Refuted, or NotEnoughInfo, with annotated evidence. This dataset is widely considered the standard benchmark for fact-checking research. 

Verdict: Use it, but sample 1,000 claims for practicality, since we want to re-run the benchmark frequently.

Dataset: Fever Paper

SciFact

A compact dataset (1.4k expert-written scientific claims) with high annotation agreement (Cohen’s κ ≈ 0.75). Scientific claims tend to be less ambiguous, making this a high-quality dataset. 

Verdict: Use it in its entirety.

Dataset: SciFact Paper

FactKG

Knowledge-graph-based dataset of 108k synthetic claims (one-hop, multi-hop, negation, etc.). While it offers perfect labeling due to deterministic graph generation, claims follow strict subject–predicate–object templates and DBpedia 2015 snapshot is outdated. This makes it less challenging for a text-evidence fact checker and poor for stress testing.

Verdict: Skip it.

Dataset: FactKG Paper

AVeriTeC

A 2023 dataset of 4,568 real-world claims from 50 fact-checking organizations. Each claim includes supporting evidence and detailed justifications. Unlike synthetic datasets, these claims are not trivially generated and require careful reasoning to verify. According to the paper, “any claim included in the dataset is deemed interesting enough to be worth the time of a professional journalist”. 

Verdict: Use it, full size.

Dataset: AVeriTeC Paper

Dataset Preprocessing

Because Originality.ai’s fact checker outputs only True/False labels (no 'No Evidence' class), we filtered each dataset to retain only claims with binary outcomes. 

FEVER’s large size required sampling 1,000 claims. After running three checkers (Originality, GPT-4o, GPT-5) on this sample, we manually inspected 176 disputed claims where there was at least one disagreement between original labels and checker predictions. We dropped 72 ambiguous claims and corrected mislabeled ones, resulting in a final sample of 928 high-quality claims. 

SciFact and AVeriTeC were filtered similarly for True/False claims, resulting in 693 and 3017 claims, respectively.

Fact Checker Accuracy Results

Performance was measured on FEVER 1k, SciFact, and AVeriTeC datasets, then aggregated.

Overall: Combined Results

Fact Checker Accuracy Originality.ai vs. GPT-4o vs. GPT-5
Model Accuracy Precision Recall F1 TP TN FP FN
Originality.ai 86.69% 84.1% 83.5% 83.8% 0.35% 0.52% 0.07% 0.07%
GPT-4o 83.40% 77.6% 82.4% 80.0% 0.33% 0.50% 0.10% 0.07%
GPT-5 86.67% 88.6% 77.0% 82.4% 0.31% 0.56% 0.04% 0.09%
Overall Originality.ai vs. GPT-4o vs. GPT-5 Fact Checking Results

Fact Checker Accuracy Overall Results

Across all three datasets, Originality.ai delivered the best recall (0.835) and tied GPT-5 for accuracy, while decisively beating GPT-4o.

FEVER 1k: Sample Results

Model Accuracy Precision Recall F1 TP TN FP FN
Originality.ai 97.8% 98.0% 97.5% 97.7% 0.48% 0.50% 0.01% 0.01%
GPT-4o 94.1% 92.3% 95.8% 94.0% 0.46% 0.48% 0.04% 0.02%
GPT-5 99.6% 99.6% 99.6% 99.6% 0.48% 0.51% 0.002% 0.002%
Comparison of Fact Checking Performance Metrics Across Models

GPT-5 achieved an almost perfect score (Accuracy ≈ 0.996), but Originality.ai was close behind and performed consistently with high precision and recall.

AVeriTeC Dataset: Results

Model Accuracy Precision Recall F1 TP TN FP FN
Originality.ai 82.7% 73.2% 74.6% 73.9% 0.25% 0.58% 0.09% 0.08%
GPT-4o 80.9% 67.7% 74.9% 71.1% 0.24% 0.57% 0.11% 0.08%
GPT-5 84.0% 80.1% 66.9% 72.9% 0.22% 0.63% 0.05% 0.11%
Comparison of Fact Checking Results for the AVeriTeC Dataset

While GPT-5 edged out a small win in raw accuracy (0.840), it did so by sacrificing recall (0.669 vs Originality’s 0.746), meaning it missed far more true claims. Originality.ai’s balanced approach is arguably more trustworthy, especially for use cases where missing true information carries a cost.

SciFact Datatset: Results

Model Accuracy Precision Recall F1 TP TN FP FN
Originality.ai 88.7% 94.2% 88.4% 91.2% 0.58% 0.30% 0.04% 0.08%
GPT-4o 79.5% 84.5% 84.3% 84.4% 0.55% 0.24% 0.10% 0.10%
GPT-5 80.8% 93.8% 75.9% 83.9% 0.50% 0.31% 0.03% 0.16%
Table: Performance Metrics Comparison (Originality.ai vs GPT-4o vs GPT-5)

Originality.ai outperformed GPT-4o and GPT-5 across all metrics on SciFact, showcasing its ability to handle scientific claims with superior reliability. This result highlights its strength in cases where precision and recall must both be high.

Streamline Your Fact Checking With Originality.ai

Not only is the Originality.ai fact checker accurate, it’s also easy to use. 

Here’s a quick 3-step process on how to check facts with Originality.ai:

  1. Select ‘Fact Checker’ to scan your text and see if it’s factually correct.
  2. Review the ‘Claim of Fact’ and whether it is Potentially True or False.
  3. Verify the information by clicking on the source link for more information.

Final Thoughts

Originality.ai showed strong, balanced performance across all benchmarks

It dominated SciFact, stayed close to GPT-5 on FEVER and AVeriTeC, and clearly outperformed GPT-4o. 

Its higher recall means it misses fewer true claims, which is critical when the goal is to surface as many correct facts as possible. 

Overall, Originality.ai proves to be a top-tier fact-checking system that competes directly with the latest LLMs.

Get insight into Originality.ai’s industry-leading accuracy across tools:

Further Reading on Fact Checking:

Jonathan Gillham

Jonathan Gillham

Founder / CEO of Originality.ai I have been involved in the SEO and Content Marketing world for over a decade. My career started with a portfolio of content sites, recently I sold 2 content marketing agencies and I am the Co-Founder of MotionInvest.com, the leading place to buy and sell content websites. Through these experiences I understand what web publishers need when it comes to verifying content is original. I am not For or Against AI content, I think it has a place in everyones content strategy. However, I believe you as the publisher should be the one making the decision on when to use AI content. Our Originality checking tool has been built with serious web publishers in mind!

More From The Blog

Al Content Detector & Plagiarism Checker for Marketers and Writers

Use our leading tools to ensure you can hit publish with integrity!