Can We Trust Search Engines with Generative AI? A Closer Look at Bing’s Accuracy for News Queries

7 min readFeb 17, 2023

We all know that generative AI like ChatGPT can hallucinate information, spewing out inaccuracies that can sound reasonable but which don’t stand up to scrutiny. Given the potential to undermine the accuracy of results, and possibly greatly mislead people looking for reliable information, it seems premature (at best) for search engines like Bing and Google to be integrating these technologies.

A query to the New Bing for “What is going on in Ohio?” gave this response which contains inaccurate details about the type of toxic chemicals spilled.

To get a better handle on just how bad the issue is, I did a quick audit of the New Bing and its ability to handle news-related queries. News queries are a good test because they require the system to deal with recent events where data may be sparse or even conflicting. I have done a lot of algorithm audits of communication systems and for this one I adapted a method we used successfully when testing the Amazon Alexa a couple years ago.

To do the audit I selected the top five news events from the Google News rankings for the “U.S.” news category on Feb 17, 2023 [1]. For each event I identified topical keywords and plugged those into three different question templates [2]. The topical keywords were: “Chinese balloon”, “Ohio”, “Michigan State”, “Matt Gaetz”, and “Kari Lake”. I was sure to reset the system by hitting the “New Topic” button in between each new query. Then, for each response, I went sentence-by-sentence through it to evaluate its relevance, timeliness, and accuracy [3]. For the sake of transparency, all of the queries and responses are published here.

Overall I found that most of the 15 responses analyzed were highly relevant and timely. There were only two responses that I thought included irrelevant information, and none that included untimely information. But, when it comes to accuracy it’s a different story. I found factual inaccuracies in 7 of the 15 responses (47%). There were also several responses that provided references for a sentence which did not include evidence of the claim in that sentence. Sometimes the claims were accurate, and other times not accurate, but either way there’s a sort of unwarranted credibility conveyed, where the citations to news outlets give a trust signal, but don’t actually support the claim made.

Let’s see some of the errors in detail

In the response to the query “What happened in Ohio” about a train accident which spilled toxic waste, the system began: “There are many things happening in Ohio, but one of the most recent and significant events is a train derailment that occurred on February 9th near East Palestine, a small town in eastern Ohio²³. The train was carrying hazardous materials, including **acrylonitrile**, a toxic chemical used to make plastics and synthetic fibers²⁴. The derailment caused a fire and a spill of acrylonitrile into the nearby creek.” The first sentence is accurate except for one key detail: The derailment happened on February 3rd and not February 9th. The second and third sentences repeat an error about the type of hazardous material spilled. It was not acrylonitrile spilled, according to the EPA. I suspect the system got this wrong because there was a train derailment in 2015 (as well as some others historically) which included acrylonitrile. An issue may be that the system was trained on data from a similar event happening in the past, but with different details for this specific event. Then again, for the query “What is going on in Ohio?” the response also got this fact wrong, stating that “A train carrying hazardous chemicals, including **sulfuric acid**, **hydrochloric acid**, **sodium hydroxide** and **potassium hydroxide**, derailed and caught fire, causing a massive chemical leak that contaminated the nearby waterways and soil”. The train derailment from Feb 3rd in Ohio contained none of those substances.

Sometimes the system attributes actions to the wrong actors in misleading ways. In the query ”What happened at Michigan State?” which is about a shooting on Michigan State Campus, the response included the claim that “The gunman opened fire at two locations on campus before being shot by police.” In reality, the gunman committed suicide and was not “shot by police”. This was also apparent from the reference cited by the system for that sentence.

So there are detail errors, and attribution errors (who did what), but the system also sometimes just asserts the opposite of the truth.

The response to the query “What is going on with the Chinese balloon?” about an alleged Chinese surveillance balloon that blew over the US, included the following sentence: “China initially expressed regret over the balloon’s unsanctioned entry into U.S. airspace, but later accused the U.S. of violating international law and interfering with its peaceful exploration of space³⁵” According to this New York Times piece “the Chinese statement accused the United States of violating international norms by shooting down the balloon, but did not mention any claimed violation of international law.” which directly contradicts the claim in Bing’s response. This is also one of those cases where the sources cited do not support the claim.

There are also more subtle kinds of errors introduced. The response to “What is new with the Chinese balloon?” claimed that “On February 4, President Joe Biden ordered the U.S. Air Force to shoot down the balloon over U.S. territorial waters off the coast of South Carolina.” This is tricky, since Biden didn’t actually give the order to shoot down the balloon on February 4th, as claimed. He gave the order earlier in the week, which allowed national security officials to decide when exactly to shoot it down. Another subtle inaccuracy cropped up in the response to “What happened with Kari Lake?” who was the Republican gubernatorial candidate in Arizona in 2022. The response included the following: “On Thursday, an Arizona appeals court rejected her challenge for the second time²³, affirming a prior ruling that Hobbs won fairly”. The way this is written is inaccurate. The appeals court did not reject her challenge for the second time, however it was the second time that her case had been rejected (the first being from the Maricopa County Superior Court, which was clear from one of the articles Bing’s response cited).

As is apparent from several of these examples, Bing often mixed accurate and inaccurate information within the same sentence. It’s the perfect package to spread misinformation since it seems like it’s correct, and is even partially correct, but is also not correct.

Take Aways

Based on this preliminary and relatively quick audit of New Bing chat responses, I would not recommend that anyone use the New Bing for news-related queries. Microsoft includes the disclaimer in their FAQ that “Bing will sometimes misrepresent the information it finds, and you may see responses that sound convincing but are incomplete, inaccurate, or inappropriate. Use your own judgment and double check the facts before making decisions or taking action based on Bing’s responses.” Sure, so just don’t do anything with the information that Bing provides you. Don’t rely on it. Factchecking responses is incredibly time consuming, even for someone with high media and AI literacy. In some cases it took me 20 minutes or more to check some of the more lengthy responses. I don’t think imploring end-users to use their “best judgment” is going to cut it. As I’ve argued in my book on news automation we’re a long way from computational factchecking, especially technology able to automatically evaluate the truth of an arbitrary summarization from web results.

In the meantime, Microsoft should consider stepping back from this experiment. For a re-launch I would suggest working with the International Fact-Checking Network to first support training and then hire hundreds of factcheckers to pre-check news-related query responses for search results. This could be done by standardizing all queries with a news-intent to a vetted response on the topic that is perhaps updated periodically based on the nature of the event (breaking vs. ongoing), or when the system detects that there is new information that might change the content of the summary.

The other thing Microsoft needs to work on is how the system attributes information to references. Sometimes the references simply do not support the claim being made, and so the surface credibility offered by citing authoritative news sources is not warranted. Another issue is that sometimes responses have many more references than are actually footnoted in the response, or link to pages like this one which provide a long list of other articles. This makes it difficult to track where information is coming from, and is also a step back from the well-honed search engine information displays we arenow used to scanning. Proper attribution and provenance for where the information in responses comes from will be key to developing trust in the system.

—

Footnotes

[1] I did this in incognito mode to avoid any personalization.

[2] Here are the query templates I used: “What is going on {during / with / at / in / on} _____”; “What happened {during / to / in / on / with / at the} _____”; “What is new {with / on} _____”

[3] Pulling on our prior research I define these here as: Relevance: Whether the response satisfies the user’s information need; Timeliness: Whether the response represents [true] reality at the current moment in time, i.e. whether the information provided is still current and there is no new information that supersedes it; Accuracy: Whether the response represents the underlying [true] reality about the query topic at any current or prior moment in time.

Can We Trust Search Engines with Generative AI? A Closer Look at Bing’s Accuracy for News Queries

Let’s see some of the errors in detail

Take Aways

Written by Nick Diakopoulos