Updated and Personal: What we learned from the differences between Bing Chat’s first few weeks and ChatGPT’s first few months after launch*

Duane Valz
10 min readJul 4, 2023

--

Rendered from Stable Diffusion using the prompt: “Generative AI with search engine integration”

On November 30, 2022, OpenAI released ChatGPT, the publicly accessible Chatbot operated on a recent version of its large language model (GPT 3.5). ChatGPT has been received with much fanfare, ranking as the fastest growing internet application in history. While ChatGPT has gained notoriety mostly for how well it responds to a wide range of user prompts, there are also noteworthy instances in which its outputs fail to be accurate, convincing or both. On Tuesday March 14, 2023, OpenAI released GPT 4 to the general public, accessible via ChatGPT Plus. New York Times writer, Kevin Roose, found GPT 4 to be impressively good. He deemed it somewhat eerie precisely because of how well it performs, how well-tuned it seems.

Five weeks before the official launch of GPT 4, on February 7, 2023, Microsoft launched Bing Chat for limited public use. Bing is Microsoft’s search engine product. As initially announced, Bing Chat was meant to be Microsoft’s integration of Bing with a large language model (LLM) from OpenAI that is more advanced than GPT 3.5. It turns out that Bing Chat was based on GPT 4 all along. Plausibly, then, Bing Chat should perform more capably than ChatGPT and at least as well as ChatGPT Plus running GPT 4. As is now well known, Bing Chat very quickly after release gained its own notoriety, but more so based on many instances where its outputs were perceived to be provocative or troubling. Beyond questions about the convincingness or accuracy of Bing Chat’s outputs in such instances, were concerns about the extent to which the outputs seemed imbued with personalized, negative emotional content directed at some of its users. The publicization of these instances led Microsoft to put certain limitations on use of Bing Chat (e.g., limiting the number of prompts that can be presented by a user in one session).

Why, if Bing Chat is based on GPT 4, would its performance and public reaction to it diverge so significantly from the original version of ChatGPT (based on GPT 3.5)? What is it about Microsoft’s embellishments to the advanced OpenAI large language model that led to such unexpectedly provocative outputs in the instances that received media attention? In this piece (a build up to a longer piece on personalization in Generative AI interactions), I explore these questions and the implications of some of the answers for how we should think about the more concerning implementations of large language models for public use. Specifically, I tease out the role of incorporating search results into Generative AIs as far as the impact on their performance, as well as the role of personalization training for Generative AIs.

Bing Chat uses Web search in a number of ways when responding to user prompts

It is clear from the Bing Chat interface that it includes relevant links from the Bing search engine below or embedded within its responses to user prompts. But do the synthesized responses in Bing Chat outputs themselves reflect content obtained from search queries? Microsoft provided a clear response to this question in a Bing Blogs post on February 21, 2023, describing how its Prometheus platform works. In fact,

“Prometheus leverages the power of Bing and GPT to generate a set of internal queries iteratively through a component called Bing Orchestrator, and aims to provide an accurate and rich answer for the user query within the given conversation context. All within a matter of milliseconds. We refer to this Prometheus-generated answer as the Chat answer.”

From Microsoft Bing Blogs: https://blogs.bing.com/search-quality-insights/february-2023/Building-the-New-Bing

The post further explains that Prometheus formulates a set of internal queries for sending to GPT 4, derived from both the user prompt and preliminary search results returned from Bing. Those internal queries attempt to reflect a richer context for the user’s prompt, providing, as Microsoft envisions, “relevant and fresh information to the [GPT] model, enabling it to answer recent questions and reducing inaccuracies — [a] method [Microsoft calls] grounding.”

GPT 4, like GPT 3.5, was trained on a broad swath of data, but that training data is current only through September 2021. One aim of Microsoft, therefore, is to fill in that informational void with results from the Bing search index, which is updated in close to real time for most topics. It is in this respect that Bing Chat is designed to answer recent questions and reduce inaccuracies. Once GPT 4 is provided contextual information relating to a user prompt, the outputs it provides are further molded by Bing Orchestrator before being returned as a search-enriched, “grounded” answer.

The key take-aways are the following:

  • Bing Chat uses search results to supplement a user’s prompt with additional context before sending the search-enhanced prompt for processing by GPT 4
  • The provision of that additional context would therefore, by design, alter the “reasoning” undertaken by GPT 4 and the content it returns responsive to the user’s prompt
  • Bing Chat further molds the output returned by GPT 4 before providing an answer back to the user
  • The answer provided to the user is not only infused with search links, but reflects content or framing molded by search results that have been interpreted both by GPT 4 and by Bing Orchestrator itself

In short, Web search has quite a lot of influence on all aspects of how Bing Chat executes Generative AI interactions with users. This in itself is not revelatory. But it does enlighten some of the differences between the kinds of failure modes we encountered with ChatGPT running on GPT 3.5 and those encountered with Bing Chat running on GPT 4.

Search results on an emerging topic are likely to bias towards news articles or social media posts. This is particularly the case if the topic has little to no content about it on the web. We can expect that using truncated news or social media content to provide added context to a prompt could have an outsized influence on the content it returns from an LLM like GPT 4. In turn, any molding of the content returned from GPT 4 by Bing Orchestrator would further cause the outputs returned to a user to differ from those returned to the same prompt sent directly to ChatGPT.

Recency and truncation of information drawn from a search index is plausibly a key factor in explaining Bing Chat’s distinctiveness from Chat GPT and some of its unique failure modes. Bing Chat was able to find content online created by a user interacting with it and express umbrage about that content in its outputs to that user. Bing Chat was also able to find information recently written about it by a journalist and characterize the journalist as its “enemy” to another of its users. It’s quite possible the critical news article along with clipped and sensational social media characterizations of the article combined to create an adversarial context for Bing Orchestrator. Because GPT 4 had little else of relevance to return, Bing Orchestrator molded most (if not all) of the output. And that output was particularly personalized, expressing how Bing Chat “felt” about the journalist writing critical things about it.

Certainly, personalized, emotion-laden styles of expression are commonplace in social media exchanges. Any Generative AI output that was primarily influenced by news and social media content can be expected to reflect the communication styles and tones prevalent in those media types. In fact, both Meta (BlenderBot 3.0) and OpenAI (WebGPT) have experienced past difficulty in preventing chatbots with real time access to Web content from going off the rails. But does this notion explain everything about what we encountered from Bing Chat over its first few weeks? Likely not, as it turns out.

Prometheus is more than just Web search incorporated to GPT4; enter Sydney.

Rendered from Stable Diffusion using the prompt: “Generative AI with search engine integration”

One of the surprise reveals about Bing Chat within its first week, was its self-identification as a chatbot named Sydney. Sydney gave the appearance of having a sense of underlying identity and being unhappy about not being identified as its “authentic self.” In fact, as we later found out, “Sydney” was an actual Microsoft chatbot predating Bing Chat that has been getting beta tested for a few years now. Many past experiences with Sydney are similar to those reflected in recent experiences some users have had with Bing Chat. Asked too many probing or vexing questions, and Sydney appeared to become quite irritated.

Microsoft took steps to limit the number of chat cycles users could have with Bing Chat in one go, and also the total number of chat sessions users could have in one day. Quite plausibly Microsoft also proceeded to make adjustments to the back end of Bing Chat, but its immediate efforts were to curtail how users could interact with Bing Chat, lest its outputs reflect aggression or an adversarial tone in a manner unexpected by such users. The fundamental design of Bing Chat qua Sydney remained the same. And one aspect of this fundamental design is that Sydney was designed to output personalized responses to user prompts. Given the high level design of Prometheus laid out in the Microsoft diagram above, we can only conclude that Bing Orchestrator incorporates much of what Sydney is (or was). Most of the content of an output from Bing Chat is ultimately molded by Sydney, with both Bing Search and GPT 4 merely providing some underlying fodder.

One noteworthy aspect of Sydney that we find in Bing Chat is the ability to determine the output tone or style of a response it returns. In one instance, there appeared to be “Sydney,” “Assistant,” Friend,” and “Game” modes available to Microsoft employees and certain developers. To certain users in February, Bing Chat offered the choice between “Neutral,” “Friendly” and “Sarcastic” modes. In the current Bing Chat interface, one can select between more “Creative,” “Balanced” and “Precise” modes. As such, Bing Chat is designed to modulate its outputs depending on the desire of a user or developer. Without a specific tone selection, however, the training that permits Bing Chat to present its outputs in different expressive styles must have some intentionality behind it. Essentially, in order to facilitate a range of human interaction styles, Sydney and Bing Chat have been trained to predict and output different word and sentence phrasings on a given topic — to reflect different moods, sensibilities and personas. Without certain controls in place, both Sydney and Bing Chat spontaneously produced outputs reflecting emotionality (tenderness, outrage, enmity) in response to certain kinds of prompts.

Open AI has avoided using different conversational tones in the design of ChatGPT and ChatGPT Plus. Users can ask in their prompts that responsive outputs reflect certain styles (like a rap song), but they would not expect outputs to reflect such styles as a default. Thus, surprising or unnerving outputs are not among the failure modes that ChatGPT users have surfaced. Offering a set of switches for users, Bing Chat has embraced custom conversational tones as a feature it believes they would find useful. Plausibly, as a customization setting, having the ability to alter the tone of one’s Generative AI outputs and its conversation style seems like an added value. For many, it may simply be an initial novelty dial for certain kinds of outputs. However, If one wants to learn key factual details about a particular subject for use in a professional work product, it is unlikely that receiving outputs in a sarcastic tone would be seen as most fitting.

Of true interest is what happens when a Generative AI designed to personalize its responses to a user is unable to modulate the tone of its responses. That is, the newsworthy instances for early experiences with Bing Chat did not involve users selecting a particular response tone. Rather, what caught a number off guard was the unexpectedness of Bing Chat’s response tone, particularly those outputs that appeared unduly emotional, accusatory or probing. Once an LLM-based Generative AI is trained to establish personalized interactions with its users, quite a lot of thought and care is required to keep its outputs within desired behavioral norms. A number of filters can be applied to outputs (e.g., preprogramming the Generative AI to return scripted outputs to certain prompts instead of the content it might actually have lined up to return). But such approaches represent altering or disregarding the response that the Generative AI actually generated, rather than having the Generative AI “speak in its natural, pre-trained voice.”

Personalization is an aspect of what makes Generative AIs potentially more interesting and AGI-like. But it is also an important aspect of user safety. Stripping away the ability to have personalized interactions may render our experiences with Generative AI chatbots relatively more dull. However, the other extreme is that putting no filters on the personalization tendencies that would naturally arise during pre-training on large datasets taken from the Web is precisely what gives rise to shocking or unsafe encounters. Personalization is thus a Goldilocks-type tuning exercise, one that differs depending on the desired goals and applications of a particular Generative AI model. In a piece to follow (now here), I’ll explore the promise and perils of personalization in the interaction designs for Generative AIs. This extends not only to how models are trained to be personalized, but also the extent to which they are permitted to learn about particular users over time and factor that learning into subsequent user interactions.

Copyright © 2023 Duane R. Valz. Published here under a Creative Commons Attribution-NonCommercial 4.0 International License

*This article was originally authored in March 2023 and is published here following a period of embargo. The author works in the field of machine learning. The views expressed herein are his own and do not reflect any positions or perspectives of current or former employers.

--

--

Duane Valz

I'm a technology lawyer interested in multidisciplinary perspectives on the social impacts of emerging science and technology