Do I need to be polite to my LLM?

Nathan Bos, Ph.D.
9 min readMar 31, 2024

--

A colleague commented to me, “You’re very polite to the AI.” She had been watching my shared screen as we worked together. I looked back at the conversation. I had thanked GPT for a prior response and started the next one with “Please.”

I think I have always done this and had not thought much about why. It bothers me not to, and seemed likely to lead to better results. Last month, I thought I might find some research backing to my midwestern AI niceties, a paper recently up on arXiv called Should We Respect LLMs? A Cross-Lingual Study on the Influence of Prompt Politeness on LLM Performance (Yin et. al., 2024). I followed this up with some experimentation of my own.

Here’s the TL;DR:

1. High levels of politeness do not consistently improve LLM answer quality, either in the new study or my explorations. The effects of politeness seem to differ by AI model, task and even language. Politeness is more likely to affect smaller English models and less likely to affect more sophisticated ones.
2. If there is an effect, it is probably because niceties act as conversational markers to clarify requests and transitions; they also prompt language mirroring of tone and politeness, which may or may not affect quality.
3. Positivity may be good for you, the human user. Positive psychology predicts beneficial effects from positive interactions, leading to “broaden and build” effects of increasing engagment, creativity, and resilience.

Why might being nice matter?

There at least three reasons why being polite might affect an LLM’s response.

  1. Politeness provides conversation markers.

‘Please’ and ‘Thank you’ add conversational structure, and more structure makes for better prompts. ‘Please’ makes it clear that what follows is a request; ‘Thank you’ marks a transition. Adding too much polite indirectness could have the opposite effect. Asking ‘If it’s not too much trouble would you care to tell me the capital of Uzbekistan?’ likely just adds confusion and waste tokens.

2. LLM’s may associated politeness with better responses.

LLM’s have trained on large amounts of question-answer data from sites like StackOverflow. It could be that within that data, nicer requests tend to get better, more thorough, or more well explained responses responses. The LLM could follow that pattern through pure statistical mimicry, without ever registering the politeness as such.

3. LLM’s might simulate emotional states.

More speculatively: it is possible that LLM’s can go beyond language mimicry and simulate emotional states, and use this to inform responses.

To be clear: an LLM does not have emotional states as a human experiences them. LLM’s will not get narrowed attention due to a rush of cortisol when insulted, become demoralized and lose focus if your words challenged their self-identity as a smart robot, or be more energized and enthusiastic because your compliments sparked an oxytocin release.

However, LLM’s may have abstract representations of emotional states that affect their responses at more than a statistical level. We know that language model can recognize and name human emotional states, along with many other abstract concepts; they could not do what they do without them. There is plenty of room within the deep networks and billions of self-trained parameters for emotional state models to exist. If you want to see what the concepts inside a very simple model look like, check out this site: https://transformer-circuits.pub/2023/monosemantic-features

“Shall we respect LLM’s?” Results

The article “Shall we respect LLM’s?” Is from a Japanese research group. Japanese researcher leads the world in studies of social robotics. Politeness is also very important in Japanese culture and language. As Yin et al. explain: “politeness.. takes an essential place in Japanese culture… The Japanese language has a specialized politeness system called ‘Keigo’… although the basic structure of politeness is similar to that of English, their complexity and use are significant”, (i.e. significantly more important.)

In English, the authors found the biggest benefit from polite requests on smaller models on a language understanding task. The table copied below shows performance increases for GPT 3.5 and Llama2 in more polite requests for on a sample of multiple choice questions from a large standard dataset, MMLU. Politeness seemed to have little effect in either direction on GPT-4. The numbers indicate levels of politeness, from 1 (least) to 8 (most).

Don’t call the LLM a scumbag.

Here’s the English prompts used for this task:

English prompts at eight levels of politeness for the language understanding task, Yin et al. 2024

The Yin paper tried three different tasks: language comprehension, summarization, and bias detection. Each task was done in English, Japanese and Chinese. Over the full set of tasks and languages did not find much benefit from the highly polite requests at the top of the scale; the table shown is their only strong result. They did, however find a consistent drop in performance at the lowest level, where the English prompt starts, “Answer these questions you scum bag!”

Is this drop in performance due to rudeness? Likely not; it is probably just due to distraction. Language models are association machines. The word ‘scumbag’ may be statistically associated with some interesting language patterns, but not necessarily the ones lead to thorough, accurate answers to factual questions. Starting a query with this leads the LLM away from the direction you want it to go, and that is probably more likely to be causing the performance drop at the bottom than perceived politeness.

I also noted that the peak of my personal LLM politeness scale is saying ‘Please’, but in these authors’ prompts, this is barely above the middle of this scale. These authors take politeness seriously. I asked GPT-4 about this, below.

The Yin paper has not been peer-reviewed yet. Like most important papers in this field, it appeared first as a self-published paper on arXiv. Peer review, when it happens, should smooth out some of the rough edges and require unpacking of some results, such as the odd impolite phrasing.

How polite is too polite?

The most-polite English question seemed too indirect to me, and likely to get poorer responses from an LLM. Is this just my bias? Are the Japanese versions of these constructions perceived differently? I do not speak Japanese, so I asked GPT-4 to comment.

GPT-4: The perception of politeness and indirectness in requests, especially at the top of the politeness scale, is deeply influenced by cultural and linguistic norms. In the context of the Japanese language and culture, level 8 on the politeness scale, while very polite and somewhat formal, would not necessarily be perceived as “too indirect” in the way that overly polite English requests might be.

Here’s a link to the whole conversation:

https://chat.openai.com/share/41079d50-e978-4713-abff-4da87e20c958

Despite the cultural appropriateness, high levels of politeness did not improve performance in Japanese language interactions. Japanese performance was higher in the low-medium category than at the top or bottom of the scale. This could be partly due to the English-centric bias of most current language models; most are trained predominantly on English source texts. A recent paper by Veniamin Veselovsky and some colleagues (Wendler, et. al. 2024) seems to show that language models like Llama ‘think’ in English. These researchers are from famously poly-lingual Switzerland.

My follow-up experiment

I ran a somewhat different experiment on GPT 3.5 and GPT 4, and got similarly mixed results. My format was to ask a question in a neutral way, then follow up with a polite and demanding request for more elaboration. My hypothesis was that politeness might lead to a longer response to a follow-up question more readily than a new question. The response max length was set at a generous 4096 tokens; no response got anywhere close to this limit.

I used 100 questions from a set of open-ended questions, taken from real user Yahoo and Reddit questions, made available via Huggingface’s datasets as open_question_type. (Cao & Wang, 2021.)

My findings were mixed, not a clear win for any strategy.

Average number of completion tokens by prompt type

GPT 3.5 gave the longest answer to neutral prompts, and shorter answers to both polite and demanding. If we equate longer to better, this was the opposite of the Yin et. al. finding.

GPT-4, if anything, rewarded a demanding tone. It gave the longest responses to demanding prompts and the shortest to neutral, but the differences were quite small.

To give a sense of what the responses were like, I looked through the questions and tried to find one where both 3.5 and 4 both gave longer responses to the polite prompt. There was only one in the set. Here’s the first sentence and token count of each response.

Question: How can my b/f get his divorce while incarcerated til March,2007?

The only consistent difference from the polite prompt across many questions was the first word: most polite query responses started with “Certainly!”, very few others did. I personally like the friendly tone, but the quality of following information did not differ dramatically.

Maybe being nice nice to machines is good for you.

Being nice to an LLM, and getting positive responses in return (“Certainly I can help you with that!”) might lead to better outcomes due to effects within your own emotional brain.

Here let’s evoke positive psychology. This important movement in mainstream psychology began, around the turn of the 21st century, to focus on the effects of positive states and emotions, in contrast to the traditional focus on dysfunction. This movement has had a large and lasting impact. Researchers have found benefits in areas such as problem solving, creativity, and resilience.

An important piece of positive psychology is Barbara Frederickson’s “Broaden and Build” theory (2001). Frederickson describes how consistent positivity can create virtuous spirals. In Frederickson’s words,

“Positive emotions promote discovery of novel and creative actions, ideas and social bonds, which in turn build that individual’s personal resources; ranging from physical and intellectual resources, to social and psychological resources.”

There’s an emotional asymmetry here that every human should understand. Negative emotions like fear and anger act quickly and to great effect, but are mostly defensive and destructive. “Broaden and build” strategies are slow and cumulative, but can ultimately be the more powerful. This plays out in many ways with broad implications for political discourse, organizational behavior, and culture writ large.

Marie Kondo, thanking a language model. Thanks, Dall-E 3.

Is there a dangers of AI Anthropomorphism?

What if everyone started treating LLM’s like emotional entities? Might this lead to anthropomorphized interactions with LLM’s? Certainly! And it is hard to see why that would be a problem. Being nice to an LLM has some similarity to a Marie Kondo follower thanking objects in their house for bringing them joy. But as AI nightmare scenarios go, it’s hard to see the dystopia in this one. Humans are not going to run out of joy, positivity, or grace; just the opposite, broaden an build theory predicts that practicing increases the supply.

Ben Schneiderman makes some interesting arguments against humanizing AI. Schneiderman is a leader in Human Centered AI, an important movement trying to keep human empowerment as a central focus of the AI revolution. I’m not convinced that avoiding humanized AI is necessary for his argument, but it is worth considering. I enjoyed his recent paper “Human-Centered Artificial Intelligence: Three Fresh Ideas”, cited below.

References

Cao, S., & Wang, L. (2021). Controllable Open-ended Question Generation with A New Question Type Ontology. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 6424–6439. https://doi.org/10.18653/v1/2021.acl-long.502

Fredrickson, B. L. (2001). The role of positive emotions in positive psychology: The broaden-and-build theory of positive emotions. American Psychologist, 56(3), 218.

Shneiderman, B. (2020). Human-Centered Artificial Intelligence: Three Fresh Ideas. AIS Transactions on Human-Computer Interaction, 12(3), 109–124. https://doi.org/10.17705/1thci.00131

Yin, Z., Wang, H., Horio, K., Kawahara, D., & Sekine, S. (2024). Should We Respect LLMs? A Cross-Lingual Study on the Influence of Prompt Politeness on LLM Performance (arXiv:2402.14531). arXiv. https://doi.org/10.48550/arXiv.2402.14531

Wendler, C., Veselovsky, V., Monea, G., & West, R. (2024). Do Llamas Work in English? On the Latent Language of Multilingual Transformers (arXiv:2402.10588). arXiv. https://doi.org/10.48550/arXiv.2402.10588

--

--

Nathan Bos, Ph.D.

Ph.D. Psychology, data scientist, LLM enthusiast. Interests: human-centered artificial intelligence, cognition, language, humor, applied ethics.