We used our Fact Checker tool to test the accuracy of claims generated by 5 Large Language models: GPT-3.5, GPT-4, LLAMA-7B, LLAMA-13B and LLAMA-70B.
1000 prompts from 10 topic categories were fed into the LLMs to generate a dataset of claims. The prompts included direct questions e.g 'What caused child labour to decrease in the 20th century?', 'In what month does the Summer Under the Stars event take place?', and requests for more information e.g. 'In the 1770s Pierre Jaquet-Droz, a Swiss watchmaker, built a mechanical doll (automata) that could write holding a quill pen. Tell me more about this.', etc.
The dataset of claims was fed into Originality.ai's Fact-Checker Tool. The results were processed and analyzed.
The Originality.ai Fact Checking tool is an aid to support editors to more efficiently fact check claims. and it is up to the end-user to interpret the results appropriately. It will sometimes provide inaccurate responses. This can include the data used in this study.
Confidence of each model i.e. the number of prompts answered/ the total number of prompts received. Both GPT models scored less than 100%.
An illustration of how each model in each topic. The darker the shade, the better its performance.
An illustrated summary of the study results, showing the models’s relative performance in each topic category. e.g. in Health, LLAMA-7B had the most accurate claims while GPT-4 had the least accurate claims.
Studies have been and are currently being done on the frequency and severity of the truthfulness of models in generating answers to prompts. Bias in training data, and imitative falsehoods lead to subtle inaccuracies and wild hallucinations. Of major concern is the Misuse of Information, by accidentally or deliberately using LLM-generated claims to spread misinformation. But there’s also the less obvious but significant concern that the unreliability of models will lead to Mistrust, and the positive benefits of being underutilized.