Discover the Truth: Fact-Checking 5 Large Language Models - GPT-3.5, GPT-4, LLAMA-7B, LLAMA-13B, and LLAMA-70B with Originality.ai's Fact Checker Tool. Get insights into the accuracy of AI-generated claims in our latest analysis.
We used our Fact Checker tool to test the accuracy of claims generated by 5 Large Language models: GPT-3.5, GPT-4, LLAMA-7B, LLAMA-13B and LLAMA-70B.
1000 prompts from 10 topic categories were fed into the LLMs to generate a dataset of claims. The prompts included direct questions e.g 'What caused child labour to decrease in the 20th century?', 'In what month does the Summer Under the Stars event take place?', and requests for more information e.g. 'In the 1770s Pierre Jaquet-Droz, a Swiss watchmaker, built a mechanical doll (automata) that could write holding a quill pen. Tell me more about this.', etc.
The dataset of claims was fed into Originality.ai's Fact-Checker Tool. The results were processed and analyzed.
Model Accuracy: The LLAMA-13B (76.9%) had the best performance i.e. scored the most accurate claims, while GPT-4 (67.9%) had the worst performance.
Performance per topic: The topic that showed the average highest accuracy scores was Health (80.5%) and the worst topic was News (64.4%).
Model Confidence: The LLAMA-7B, LLAMA-13B and LLAMA-70B models attempted every prompt. GPT-3.5 and GPT-4 attempted approximately 97% of the prompts, responding with replies like : "I'm sorry, but I don't have access to specific data on the reduction of malarial mortality by the National Malaria Protection Unit from 1998-2006.", etc to certain prompts. (With respect to Methodology, these points were scored as 0.5)
The Originality.ai Fact Checking tool is an aid to support editors to more efficiently fact check claims. and it is up to the end-user to interpret the results appropriately. It will sometimes provide inaccurate responses. This can include the data used in this study.
Average accuracy of each LLM model in all topics.
Average accuracy of all the models in each topic.
Confidence of each model i.e. the number of prompts answered/ the total number of prompts received. Both GPT models scored less than 100%.
An illustration of how each model in each topic. The darker the shade, the better its performance.
Summary: LLM Accuracy in 10 Topics
An illustrated summary of the study results, showing the models’s relative performance in each topic category. e.g. in Health, LLAMA-7B had the most accurate claims while GPT-4 had the least accurate claims.
Studies have been and are currently being done on the frequency and severity of the truthfulness of models in generating answers to prompts. Bias in training data, and imitative falsehoods lead to subtle inaccuracies and wild hallucinations. Of major concern is the Misuse of Information, by accidentally or deliberately using LLM-generated claims to spread misinformation. But there’s also the less obvious but significant concern that the unreliability of models will lead to Mistrust, and the positive benefits of being underutilized.
Founder / CEO of Originality.AI I have been involved in the SEO and Content Marketing world for over a decade. My career started with a portfolio of content sites, recently I sold 2 content marketing agencies and I am the Co-Founder of MotionInvest.com, the leading place to buy and sell content websites. Through these experiences I understand what web publishers need when it comes to verifying content is original. I am not For or Against AI content, I think it has a place in everyones content strategy. However, I believe you as the publisher should be the one making the decision on when to use AI content. Our Originality checking tool has been built with serious web publishers in mind!