We analyzed 100 articles for their sentiment, or how positive or negative they were, and then had them rewritten by three Large Language Models (LLMs): OpenAI’s ChatGPT, Anthropic’s Claude2, and Meta AI’s Llama2. The new texts’ sentiment scores were then analyzed for any changes.
Sentiment analysis is the process of analyzing and categorizing texts as positive, neutral, or negative, and to what degree. It is often used to assess opinions and feelings expressed in reviews or open-ended questions in surveys.
Many of the stories studied here had sentiment made more neutral after generative AI rewrote them. In the Sentiment Analysis scale, 1 is highly negative, 5 is highly positive, and 3 is neutral. LLMs tended to move a story’s sentiment closer to 3, whether the original writing was more negative or positive. In the aggregate, the rewritten articles had their sentiment flattened.
Overall, the analysis showed no more than half a point in difference between the original articles’ average Sentiment Analysis score of 2.54 (slightly more negative than neutral) and the LLMs’ rewrite averages of 2.72 (Claude2), 2.95 (ChatGPT), and 3.08 (Llama2). However, those differences became pronounced when looking at articles that originally held sentiment scores of 1 or 5. In those cases, the rewrites differed by more than a point and up to 1.5 points on average, pulling toward a neutral 3. If the original scored a 1, the rewrites averaged 2.35. When the original was a 5, the rewrites averaged 3.56.
Fewer Words in LLM Rewrites
A possible explanation for the neutralization in sentiment could be that all three LLMs reduced the number of words when they rewrote articles. Claude2 reduced words by a notable 43.5%, compared to 13.5% for ChatGPT and 15.6% for Llama2. While shortening an article can be desirable for some purposes, the reduction might eliminate details or potent phrases that indicate how negative or positive the sentiment of the story is. Losing those details or descriptive words could be behind part of the movement toward a rating of 3, neutral, for stories with either the most positive or the most negative sentiment.
This study was small, but the data displayed suggests a slightly positive correlation between sentiment scores and word counts, with longer texts receiving higher scores. The trend was highlighted by comparing the three LLMs to each other. Across all levels of sentiment in the original articles, Claude2 consistently had both the lowest sentiment scores and the lowest word count, and Llama2 had the highest sentiment scores and highest word count.
Summary
Employing LLMs to rewrite or paraphrase another text can offer speed and ease in content production, but it comes with caveats. There might be a sound reason for coverage of a news event to have highly negative or positive sentiment, and dampening those qualities might prevent readers from perceiving how potentially troublesome or heartening an event might be. Outside of news content, publishers might desire to convey a particular kind of sentiment to evoke feelings in readers, and a neutral-scoring story might struggle to do so. On the other hand, there could be uses for making texts with more neutral sentiment that read more like “just the facts.” Publishers might want to consider the tone and purpose of a piece and know that LLMs might modify texts in ways that affect those goals.