On August 22, OpenAI unveiled a fine-tuning API for GPT-3.5-turbo. Here’s a quote from the announcement:
Early tests have shown a fine-tuned version of GPT-3.5 Turbo can match, or even outperform, base GPT-4-level capabilities on certain narrow tasks.
So, how narrow of a task is SQL generation anyway?
Data Preparation & Experiment
In order to test OpenAI’s fine-tuning capabilities, we used the following dataset:
- From all ~1,000 “dev set” queries in the Spider benchmark, we randomly sampled 50% of the queries into the training set and used the remaining 50% as the validation set.
- We used a basic prompt for testing purposes, which included table schema, PK/FK relationships, and questions. Note: We have previously posted results of spider with a much more feature-rich system, which results in higher accuracy overall. Here, we are keeping it simple to focus on the delta between fine-tuned and larger models.
- We used Spider’s golden queries as the desired fine-tuning output.
The steps to fine-tune GPT-3.5 are documented here. But the process is delightfully simple. Upload the file, call the API, and wait for the results.
The job we ran took about 1.5 hrs to finish. You know it’s done when you get an email containing the model name you can use in the OpenAI API calls.
On to the interesting bit. How well does our new model do? We compared three configurations:
- A. GPT-3.5-turbo
- B. GPT-4
- C. Fine-tuned GPT-3.5-turbo
In past experiments, we have shown how GPT-4 typically beats GPT-3.5 by ~10% in accuracy. But what’s the outcome with fine-tuning?
We generated all 500 queries with all three options using the same prompt. None of the queries we generated were in the fine-tuning set.
And without further ado, here are the results:
The result is remarkable: GPT-3.5 fine-tuned beat GPT-4 by 7.2%. It beat the GPT-3.5 base model by 13%. That’s quite an achievement. Remember that using the fine-tuned model is 2x cheaper and significantly faster than GPT-4.
How much do queries change after fine-tuning?
Although the accuracy number looks promising, we examined the query details to understand what changed after fine-tuning.
Usage of SQL statements
The following data illustrates keyword usage for different keywords, based on the number of queries in the validation set, using GPT-3.5-turbo, our fine-tuned model, and golden queries from Spider.
This is quite a significant departure from the original structure. CTEs which are out of the box and are commonly used by GPT, have been dropped completely from the query gen in the fine-tuned model. CTEs are often useful for readability and also early pushdown performance, thus this isn’t optimal from a SQL perspective, but certainly follows the training set.
This is interesting: Even though Spider doesn't have many INNER JOINs for the golden queries, the fine-tuned model still uses INNER JOIN (instead of JOIN). This suggests that at least some of the knowledge learned from training and alignment is still being applied.
Set operations (Except/Intercept)
A common pattern for Spider queries is the extensive use of except/intersect operations. During our initial Spider test, we noticed that generated queries mostly used join and filter operations instead of set operations. After fine-tuning,
INTERSECT were used more frequently.
Table alias naming
Finally, the generated query from the fine-tuned model looks similar to Spider’s golden queries:
SELECT t1.name FROM people AS t1
INNER JOIN poker_player AS t2
ON t1.people_id = t2.people_id
We’re not fans of this style of writing queries. It works but makes the queries more difficult to read. The naming style before the fine-tuning was much easier to work with.
The fine-tuned version of GPT-3.5-turbo outperformed GPT-4 on specific tasks at one-third of the cost, providing a better understanding of how fine-tuning affects the model's behavior. However, there were some setbacks in terms of readability and performance. Overall, these findings demonstrate the potential of tailoring AI models to meet specific dataset needs but also highlight the need for specialized knowledge to handle them effectively.
Given the promising results and intriguing discoveries, we will continue with the following
- Continuous tuning: How much can we improve workloads over time with more fine-tuning or improved fine-tuning mechanisms?
- Refine Query Readability: Develop and test strategies to improve the query readability of the fine-tuned model by maintaining or reintroducing well-named Common Table Expressions (CTEs) and aliases. We plan to use our own query set, which adheres to better readability and performance practices, instead of relying on Spider dev set golden queries.
- Integration with Existing Systems: We only tested fine-tuning for query generation, but we plan to leverage fine-tuned models for different parts of the system (such as query description, auto-completion, etc.) to improve efficiency and accuracy.
- Generalizability and Capability to Handle Complex Queries: Evaluate the fine-tuned model's ability to perform effectively across diverse domains and tasks without losing its fine-tuned specializations. This involves testing the model on datasets that are very different from the original training or fine-tuning, as well as generating complex queries, such as 100+ lines of queries like TPCDS, to ensure the model's adaptability.
- Community Engagement: Share the findings and methodology with the broader community, and collaborate with other researchers and practitioners to exchange knowledge and promote best practices in fine-tuning.
At Waii, we will continue to provide the best text-to-SQL APIs to the data community. We are excited about the potential of this technology and will continue to work towards refining its capabilities and promoting best practices in fine-tuning and query generation.