What LLM is The Most Accurate?

Discover the Truth: Fact-Checking 5 Large Language Models - GPT-3.5, GPT-4, LLAMA-7B, LLAMA-13B, and LLAMA-70B with's Fact Checker Tool. Get insights into the accuracy of AI-generated claims in our latest analysis.

We used our Fact Checker tool to test the accuracy of claims generated by 5 Large Language models: GPT-3.5, GPT-4, LLAMA-7B, LLAMA-13B and LLAMA-70B.


1000 prompts from 10 topic categories were fed into the LLMs to generate a dataset of claims. The prompts included direct questions e.g 'What caused child labour to decrease in the 20th century?', 'In what month does the Summer Under the Stars event take place?', and requests for more information e.g. 'In the 1770s Pierre Jaquet-Droz, a Swiss watchmaker, built a mechanical doll (automata) that could write holding a quill pen. Tell me more about this.', etc.

The dataset of claims was fed into's Fact-Checker Tool. The results were processed and analyzed.

Key Findings

  • Model Accuracy: The LLAMA-13B (76.9%) had the best performance i.e. scored the most accurate claims, while GPT-4 (67.9%) had the worst performance.
  • Performance per topic: The topic that showed the average highest accuracy scores was Health (80.5%) and the worst topic was News (64.4%).
  • Model Confidence: The LLAMA-7B, LLAMA-13B and LLAMA-70B models attempted every prompt. GPT-3.5 and GPT-4 attempted approximately 97% of the prompts, responding with replies like : "I'm sorry, but I don't have access to specific data on the reduction of malarial mortality by the National Malaria Protection Unit from 1998-2006.", etc to certain prompts. (With respect to Methodology, these points were scored as 0.5)


The Fact Checking tool is an aid to support editors to more efficiently fact check claims. and it is up to the end-user to interpret the results appropriately. It will sometimes provide inaccurate responses. This can include the data used in this study.


Accuracy Scores

  • Average accuracy of each LLM model in all topics.
  • Average accuracy of all the models in each topic.
Average accuracy of each LLM model in all topics
Average accuracy of all the models in each topic

Confidence Scores

Confidence of each model i.e. the number of prompts answered/ the total number of prompts received. Both GPT models scored less than 100%.

Confidence of models in answering prompts

Heat Map

An illustration of how each model in each topic. The darker the shade, the better its performance.

Fact Checking Accuracy of each LLM in Topic Categories

Summary: LLM Accuracy in 10 Topics

An illustrated summary of the study results, showing the models’s relative performance in each topic category. e.g. in Health, LLAMA-7B had the most accurate claims while GPT-4 had the least accurate claims. fact checker results on LLM Accuracy in 10 topics

Related Research

Studies have been and are currently being done on the frequency and severity of the truthfulness of models in generating answers to prompts. Bias in training data, and imitative falsehoods lead to subtle inaccuracies and wild hallucinations. Of major concern is the Misuse of Information, by accidentally or deliberately using LLM-generated claims to spread misinformation. But there’s also the less obvious but significant concern that the unreliability of models will lead to Mistrust, and the positive benefits of being underutilized.    

Further Reading

