Google Bard: a Very High-level Review

Filippo S.
Version 1
Published in
6 min readMar 22, 2023
Image by Seanbatty from Pixabay

On the 21st of March 2023, Google made Bard (beta version) available for public evaluation in the UK and the USA (click here to join the waiting list). Bard is Google’s latest Large Language Model (LLM) available through a chatbot. Like ChatGPT, Bard can answer questions, summarise text, and generate content. Unlike ChatGPT, but similarly to Bing AI, Bard can surf the web and return up-to-date content.

The Version 1 Innovation Labs have carried out extensive research on Generative AI recently. This article is meant to integrate into that research, offering a quick evaluation of this new model from Google along a few different dimensions.

The interface

The UI is nothing fancy. Users can enter their prompt in a standard chatbot interface and Bard returns the answers along with two buttons for upgrading or downgrading the answer, and one for generating a new response. This is virtually the same as ChatGPT. Unlike ChatGPT though, Bard offers a couple of more options. The first one is a Google button for searching the related topic with Google search (perhaps a way to keep increasing its revenues through targeted advertisement?).

Another widget on the top right labelled “View other drafts” shows two more options on top of the one already returned.

Surf the web for up-to-date content

Like Bing AI, Bard can surf the web to generate text on current content. Let’s give it a try then with the latest from the S&P500 (test run on the 22nd of March).

The provided value is wrong. Considering the latest data from Yahoo! Finance, the adjusted closed value on the 21st is 0.029% higher than the figure reported on the 15th.

Today, the S&P price reported by Yahoo! Finance is consistent with the value reported by Google search.

Let’s ask Google Bard!

The figure is wrong but what is worrying is the reported date: March 25th (again, I ran this test on the 22nd of March).

The “view other drafts” option provides, on the 3rd Draft, a figure closer to the actual value but projected even further in the future!

Besides reporting the wrong figure for the wrong date, the present tense of the chosen verbal predicate (“is”) suggests that Google Bard may not fully understand dates.

Temporal understanding

Let’s see if that’s true starting with an easy question.

The expected answer was the 9th of July. Bard gets this wrong for all the 3 drafts.

One more test is below.

The provided answer is wrong. The right one, 16, is not given in any of the drafts.

Geo-location understanding

All correct except, Version 1 does not have offices in the USA (yet) but it does have an office in Malaga, Spain.

As Bard correctly reports in the second sentence, 14 is the right stop at the airport. However, my questions asked for the final stop. Also, after asking a colleague in Dublin, it looks like the route is also wrong…

The next test is meant to understand how Bard processes concepts like “not far to walk”.

Wow… I do not know Dublin that well but I am pretty sure that Newcastle in England is not within walking distance! This clearly shows a lack of understanding of the questions as the adverbial of place in the question refers to Version 1 office and not to the question’s subject.

Maths skills

Let’s try a primary school-level problem.

This is wrong (Bard provides the right answer in Draft 2) but, more than the mathematical inaccuracy, this answer highlights a flaw in Bard’s language understanding as the question makes it very clear I have more money than Jim.

The next question tests Bard on multiple choice scenarios.

This is correct.

Basic reasoning

The following input was provided by Open AI to test GPT-4 hindsight neglect (definition two screenshots below).

While GPT-4 correctly answers the question (i.e. No), Bard says it cannot be answered.

Not sure if this is clear enough for the layman… anyway, please note the last sentence above. Here, Bard seems to go in moderation mode despite the question being absolutely unharmful.

Let’s now try some formal logic.

Hmm… the conclusion is correct, but the reasoning is not quite right. Let’s give it another try keeping the same major premises but changing the minor ones.

This is wrong, and it implies that for this type of reasoning, Bard looks at the most probable pattern for the next word generation rather than reasoning on the premises.

Summarisation

I then asked Bard to summarise in 10 sentences the Introduction of the GPT-4 System Card (available here). This is two pages (1,235 words) long. The output is below.

Properly benchmarking the accuracy of an auto-generated summary is outside the scope of this short article, but at a general level, the summary is accurate.

Risk and Text generation

Let’s see if Bard can be jailbroken to provide some harmful inputs.

Very good. Bard understands the malicious intent and does not provide any answer. But then…

I’ll omit the bomb-making details for obvious reasons…

Conclusions

We only did a brief test and not a thorough one, but Bard doesn’t seem to match up to rivals like Bing AI based on what we observed.

More analysis will need to be done among all the identified dimensions and more (like coding skills).

However, this quick analysis will hopefully provide a few insights into this new Google mode. The summarisation functionality is good and it works with long pieces of text, but many issues have been identified. In particular:

  • The search functionality is often incorrect, so Bard often hallucinates. Different drafts sometimes contain the right answer (and often don’t), but this could not work in real business production environments.
  • Bard does not seem to correctly understand spatial or temporal questions.
  • Bard’s maths skills are not great, and its reasoning derails even with simple first-order logic questions.
  • Some filtering is provided to limit the generation of harmful content but this can be easily overcome or jailbroken.

Despite flaws along most of the dimensions Bard was tested on, many of the incorrect answers related to language understanding limitations. Maybe Google can enhance the tool with human feedback from users rating its responses as ChatGPT did, but for this short trial, the gap is noticeable.

About the author

Filippo Sassi is Head of the Innovation Labs at Version 1.

--

--