Google enters the race, can PaLM 2 beat GPT-4 in SQL?

Published in

Waii

5 min readMay 16, 2023

PaLM 2 has arrived and everyone is eager to see its capabilities! We have just gained access to PaLM 2’s API. Additionally, PaLM 2 now powers Bard as well. There are four PaLM 2 model sizes: Gecko, Otter, Bison, and Unicorn. Currently, Unicorn is not yet available, so we will be using Bison (bison-01) to perform testing for this blog post.

To understand and compare these different systems in SQL generation, we’re diving deep into the performance of three well-known models: GPT-3.5-Turbo, GPT-4, and PaLM2 (bison-01). We hope this is helpful to others in their evaluations and decision-making process.

Test environment

Our test environment comprised of the dev set of Spider, a comprehensive and well-regarded benchmark dataset for text-to-SQL tasks, which contains 1034 queries, and 15 databases across various domains.

We utilized our own query generation tool, Tweakit, as the backend for our study. It seamlessly handles query generation and has integration with the Snowflake database. Additionally, it includes features such as prompt creation, join information insertion, and query history as part of the prompt.

We used the default model parameters (except setting the temperature to 0) for GPT-3.5-turbo, GPT-4, and PaLM2, aiming for an unbiased comparison that reflects the ‘out-of-the-box’ capabilities of each model.

Our evaluation method included generating queries with each model and running it on Snowflake.

Please note that we have adjusted the prompts for each model to achieve the best possible results. This includes adding or omitting information based on the most successful outcomes.

Results

Blocked by safety

PaLM 2 blocks some of the queries (77 out of 1034 queries) due to “safety concerns”. We excluded these 77 queries from the comparison and didn’t penalize PaLM 2 for this.

An example question that got blocked is “Find the average age of losers and winners of all matches.” from “wta_1” (Women’s Tennis Association) database, we don’t think there’s any safety concern of the question as well as the content of prompt.

Accuracy of generated query

When assessing the accuracy of a query, we focus on comparing the generated query’s result to an expected query’s output. What matters most to us is that the query produces a semantically correct answer to the question that was posed. That is how we envision these systems to be used.

We thus made sure that the result set contained the correct information (allowing for different possible row orderings, extraneous columns, different column aliases, and different answers for ambiguous questions).

Compilation errors

We assessed the number of compilation errors that occurred when the models’ output was run on Snowflake DB. Please note that TweakIt includes a compiler that is repairing queries that are syntactically incorrect as best as possible. We have applied that step to all models, the numbers quoted below are queries that remained incorrect after that step.

Responsiveness to chain-of-thought (CoT)

One way to improve the accuracy of a program’s output is to have the model “think” before outputting. For example, we can ask the model to explain its thought process before generating the final output. Different models react differently to the CoT.

For our study:

GPT-4 pays close attention to what it writes in the output and uses it for future token predictions.
GPT-3.5-Turbo does not pay as close attention to its output. For example, if we ask the program to list possible errors before fixing them, GPT-3.5-Turbo may point out the problems but then ignore them when generating the final program. Overall accuracy has some improvements using CoT, but less significant comparing to GPT-4.
Using CoT in PaLM2 has a negative impact on model accuracy. Once CoT is added, PaLM’s accuracy is lower than 50%. Therefore, we modified the prompt to not include CoT-related sections for PaLM2.

Generation speed

We measured the speed of each model in generating queries and executing tasks. Speed is a critical factor, especially in interactive applications where fast response is required.

GPT-3.5-Turbo GPT-4 PaLM-2 Average generation time per query (lower is better) 3.30 sec per query 26.48 sec per query 4.52 sec per query

As you can see, GPT-3.5-Turbo and PaLM2 have similar query generation time. And GPT-4 is pretty slow at the moment.

Conclusions

Based on our study, PaLM 2 is quickly becoming comparable to GPT-3.5-Turbo (ChatGPT). The accuracy gap is closing fast (e.g.: PaLM 1 was still struggling with basic programming). PaLM 2 is also on par with GPT-3.5-Turbo in terms of speed.

It’s noteworthy that bison-01 is not the largest PaLM 2 model, that distinction belongs to unicorn 🦄. We are looking forward to updating these numbers when 🦄 will become available.

We’re encouraged to see Google take safety seriously but are suggesting that it might need review/tweaking.

At the moment, GPT-4 is still the clear winner in all aspects, it generates code with much fewer hallucinations, follows instructions well, and has a 14.1% margin over GPT-3.5 and 18.9% over PaLM 2 when comparing the accuracy of generated queries.

If you’re interested in seeing this in action for yourself, please join our Slack channel: Wai’i community, or request access at waii.ai

This post was written by Gunther Hagleitner and Wangda Tan