Lab Notebook: Can GPT tell us who to trust?

Cybersecurity for Democracy finds that GPT-3.5 has mixed ability to accurately assess news outlets

Published in

Cybersecurity for Democracy

3 min readOct 5, 2023

Large language models (LLMs) exhibit a wide range of emergent behaviors: skills not explicitly taught that appear as the number of parameters scales. OpenAI reports that GPT-4, the most powerful LLM, achieves advanced performance across a range of standardized tests including the SATs and AP subject tests.

Given researchers train these models on a large corpus of internet data that contains numerous news articles from a range of publishers, we sought to test whether this translated into accurate text completions about news publisher reputation. This has ramifications both for the utility of LLMs as an annotation tool for researchers, but also as an information retrieval system for the general public. Here again, we found a lot of performance variability based on the task. This suggests that while GPT-3.5 encodes meaningful information on news outlets, particularly US-based ones, intervention from subject matter experts is still prudent.

The code for this analysis can be found here.¹

Methodology

We test this using gpt-3.5-turbo, a variety of the model which performed the best on general knowledge according to Stanford’s HELM benchmark. We presented it with prompts for two different tasks to see whether, when given the name of a popular news outlet, it responds with the correct political leaning and trustworthiness of the outlet.

We compare the model outputs to the simplified class ratings, according to the table below, provided by Media Bias Fact Check.

Table of Media Bias Fact Check rating conversion

To ensure a higher likelihood the model was exposed to articles from the outlet in the training data, we restrict the news outlets to only those that are highly popular as measured by having more than 100,000 subscribers on their Facebook page as of April 2023. There were 609 news outlets that met this threshold and had partisanship ratings and 519 news outlets with trustworthy ratings.

Results

Whereas the model struggled with the partisanship, it excelled on the trustworthy task according to the weighted F1-score of its predictions.

When analyzing the breakdown of partisanship predictions, we see that the model’s poor performance was largely due to a bias towards guessing centrist.

Plot of GPT-3.5 Class Predictions Compared to the Actual Class Distribution

There was also a strong performance bias towards US-based news outlets. GPT-3.5 was nearly twice as accurate when predicting their partisan lean.

Plot of F1 Score on Partisanship Task

In the trustworthy task that GPT-3.5 performed well on, there was little bias in its predictions toward a particular data class or by country of origin.

Plot of GPT-3.5 Class Predictions Compared to the Actual Class Distribution

Plot of F1 Score on Trustworthy Task

Conclusion

These results suggest that users should exercise caution when trusting GPT-3.5 as an evaluator of news source information quality as the accuracy of its completions vary widely across related tasks and by country.

Footnotes

The analysis of this article was initially run in June 2023. At the time of analysis, the researchers did not yet have access to GPT-4.

About NYU Cybersecurity for Democracy

Cybersecurity for Democracy is a research-based, nonpartisan, and independent effort to expose online threats to our social fabric — and recommend how to counter them. It is a part of the Center for Cybersecurity at the NYU Tandon School of Engineering.

Would you like more information on our work? Visit Cybersecurity for Democracy online and see how tools, data, investigations, and analysis are fueling efforts toward platform accountability.