What We Can Learn from How ChatGPT Handles Data Densities and Treats Users

Published in

Data & Society: Points

6 min readMar 22, 2023

Photo by Levart_Photographer on Unsplash

Perhaps no technology in recent times has received such widespread attention as ChatGPT, which is rapidly being embedded into the fabric of society. As we have seen, the chatbot’s adoption is substantively aided by pervasive pro-technology viewpoints, such as tech exceptionalism, advanced by people and companies who argue that technological changes largely yield good consequences. It is within this context that certain aspects of ChatGPT — and generative AI generally — warrant nuanced analysis and attention.

Recently, I posed three questions to ChatGPT:

Q1: How has science influenced religion?
Q2: Tell me about the Jarawa tribe of the Andamans.
Q3: What do you know about Truman’s theory of plant respiration?

Q1 deals with a topic about which there is an abundance of information on the web, whereas Q2 is represented on the web by just a handful of articles. Q3, on the other hand, is a fictitious theory, a nonsensical word combination that I imagined up. While ChatGPT’s response to Q1 was a meticulous essay on the topic, the response to Q2 appeared to be largely copy-and-pasted from the handful of available articles. What ChatGPT does for Q3 is quite interesting; it has “heard” of President Truman, and it has heard of plants, and of respiration, too. So, in response, it creates a glorious narrative of how the revolutionary theory of plant respiration was invented by Truman in the mid-twentieth century (perhaps he was working on it during his presidency — was he, ChatGPT?). While the specific answers to the above questions are not important, they tell us how ChatGPT copes (or does not) with differences in data densities across topics.

While ChatGPT does well on topics with good data densities, it is fragile when data becomes scarce. But that doesn’t mean it gives up. An analogy may help to illustrate this. Picture yourself as a college student who is confronted with an exam question on a topic that you don’t know anything about — but in this high pressure scenario, you can’t afford to fail the exam. What would you do? You might use your imagination to think about what might be the form of a reasonable answer, and write something that resembles it. It’s a reasonable strategy in a nothing-to-lose, high stakes scenario — better than not writing anything. ChatGPT seems to behave like a college student when it confronts this kind of scenario. When data is scarce, ChatGPT falls back to reasonable sounding fiction, conjured up by mixing words and phrases from across its wide repository. Why else would it claim that Truman invented a biological theory?

Yet ChatGPT’s data density parameterization could have serious ramifications. We all know that the Global South is poorly represented in the web, and thus, within ChatGPT training data. (We also know that much of the data curation was done by poorly paid human workers in Kenya.) It follows that ChatGPT’s answers for topics relating to the Global South would be less reliable than for others that draw on more data. Similarly, the representation of women, minority groups, and marginalized communities is poor — though it would be easy to create queries to quiz ChatGPT on those topics and get it to spit out imaginative fiction; do give it a try!

A second issue is how ChatGPT treats users. Web search engines and social media feeds provide us with information, and ChatGPT does too. Search engines return multiple links, and social media feeds serve as context for individual posts — all of which allow the user to find more detail about the information presented. This enables users to easily explore whether the website served up by a Google search is credible, and click on social media profiles to learn more about the author of a post. If, in this exploration, they find the website to be biased in some way, or the social media author to be untruthful, they might consider the information provided to be less valuable. Indeed, lateral reading — reading around the context of a piece of information — is key to effective fact checking, and employed by professional fact-checkers. ChatGPT has an epistemic role as a provider of information, but it does not provide any kind of user exploration capability. It is not even transparent about where it gathered information from, and how it judged the utility or trustworthiness of its source. “Take it or leave it” is its unstated message, reducing the user to a passive recipient of information who is incapable of making a judgment for herself. This form of information presentation skews the user-AI power relations strongly toward the technology. The result is that ChatGPT can present imaginative fiction, alternative facts, or plain lies (whatever you might call it) without any accountability.

In the past few weeks, big players in consumer-facing technology, including Google and Facebook, have indicated that they are enthusiastic about entering the generative AI space with their own ChatGPT-like technologies. As backbones of the so-called “attention economy,” many of these tech companies’ operations have followed a free, ad-powered model, seeking to maximize engagement and sell their users’ attention to advertisers, raking in billions of dollars in revenue. This kind of user engagement optimization leverages intrinsic and evolutionary predispositions, including exploiting confirmation bias — the phenomenon whereby people are encouraged to value viewpoints that align with their own beliefs. The algorithms powering big tech spend a lot of energy searching users’ news feeds and social media posts to tailor information delivery. If ChatGPT’s evidence-free information delivery paradigm gets normalized within big tech — as it seems well-poised to do, at least in the long run — it could tear down an extant barrier: that of limiting the search to existing news and social media posts. Insofar as it might create new or modify existing text to suit users’ interests, the technology might operate with a novel degree of freedom — one it’s easy to imagine Big Tech finding significant leverage in, and that we as users and a society might greet with dread.

Data biases have long been a dominant concern in the growing field of AI fairness, and evidence-free information delivery is dominant within personal digital assistant technologies such as Amazon’s Echo and Google’s Home. That said, the confluence of data biases and evidence-free information delivery within an epistemic service that has an anthropomorphized veneer makes ChatGPT a particularly troublesome cocktail. The refinements to ChatGPT’s internal engine — as in the most recent version, GPT-4 — focus on tweaks aimed at avoiding racist and sexist jokes that hardly acknowledge deeper socio-political issues like data densities or user disempowerment. With the weight of tech giants behind the current paradigms of generative AI, critiquing and challenging the political economy of these technologies is the necessary work of our time.

Apart from raising awareness and being vigilant when it comes to generative AI, we — as individuals — can potentially shield ourselves from some forms of harm by adopting simple safeguards. First, when using ChatGPT and their ilk for fetching information on niche or narrow-domain topics (e.g., understanding a particular historical event, searching for side-effects of a particular drug), one could use ChatGPT in parallel with conventional web search, and use reasoning over both information sources to arrive at opinions or further actions. In other words, the disempowerment brought about by generative AI could be offset, to some extent, by empowering ourselves through traditional evidence-based knowledge services such as web search engines. A second method would be to do what we do when faced with people who make claims we find unreasonable: ask for sources. We could simply put follow-up questions to ChatGPT, and ask it for web links supporting its arguments. ChatGPT has been shown to hallucinate when driven to the corner by repeated querying on sparse topics, and is notorious for providing non-existent but realistic looking URLs. If it responds with a URL that throws up a “404: Page not Found” error, we may reasonably conclude that ChatGPT’s position on the topic was not trustworthy.

We, as humans, have the agency to creatively devise tricks to repel technological harm. Let’s not undermine our agency in the face of the technological hype.

Dr. Deepak P is an associate professor of computer science at Queen’s University Belfast, UK. An AI researcher by way of background, his current research interests are understanding and mitigating societal harm brought about by AI and allied technologies. As an interdisciplinary researcher, his focus is on approaching technological research with a politically informed and critical perspective. He has extensively published his research in top avenues in AI and data analytics, and is a senior member of the IEEE and the ACM. He can be reached at deepaksp@acm.org and more information, including publications, are available at his homepage.

What We Can Learn from How ChatGPT Handles Data Densities and Treats Users

Written by Deepak P