Generative AI with SQL — First Impressions

The importance of writing powerful prompts

Josh Berry

Published in

Learning SQL

5 min readJul 25, 2023

Introduction

I have been a longtime advocate for using Jinja templates as a powerful tool to help writing SQL. Recently, however, I’ve been exploring the rabbit holes of generative AI to see how well it works.

What I quickly discovered, is that AI writes relatively poor SQL when ask to do so blindly. Sure, once in a while it would hallucinate its way into writing a pretty complex SQL statement, and I’d sit there in amazement for a brief moment. I quickly began to realize, however, that these occurrences are few and far between. It is more likely that the AI would do something hilariously wrong, rather than correct.

Prompting Helps

I discovered that the secret was in the art of the prompt. If you prime the conversation with information about your table, for example, then the AI can make better SQL decisions.

I usually start all prompts with a prefix:

You are an AI agent who’s job is to write the correct SQL for my analysis. The table dbo.SALES is a table that contains customerid, orderid, qty, sales, customername, customerid, date.

Next, I add the real prompt:

Bad — sales for Acme Corp
Better — total sales for customername Acme Corp by month

“Total” tells the AI to sum the sales, instead of selecting just rows.
Customer Name helps the AI find the correct column.

Bad — top products based on growth
Better — Top 10 products by total sales based on year over year change for current year vs previous year

“by total sales” tells AI what data to use to rank the products
“current year vs previous year” indicates the time ranges for calculating the growth rate
using terms like “current” and “previous” will result in analysis that is relative to the current date
if you prefer hard coded dates, you would instead say “for 2023 vs 2022”

Bad — what color products are popular
Better — What are the top 10 most popular product colors based on number of orders for the current year to date

including a date range will help limit the data queried
“based on number of orders” tells AI what metric to use to approximate popularity

Step by Step Instructions Helps Even More

I also found that giving the AI step-by-step instructions is even more helpful. It really helps to understand the style of SQL-writing that uses Common Table Expressions, or CTE’s, because AI seems to really respond well to that.

Here’s an example of asking the AI to do anomaly detection on stocks data.

A basic example

You are an AI agent who’s job is to write the correct SQL for my analysis. The table dbo.STOCKS is a table that contains date, ticker, industry, close, open, volume.
Write SQL to tell me: detect anomalies in close price in the past 12 months

The problem here is that the AI has used a basic standard deviation calculation on the entire dataset to define anomalies. It didn’t even group by ticker!

What I really want, is a rolling comparison over time, within each ticker, to define anomaly.

A better example

Instead of just a simple prompt, I’ll engineer a detailed prompt that properly teaches the AI how to define anomalies.

You are an AI agent who’s job is to write the correct SQL for my analysis. The table dbo.STOCKS is a table that contains date, ticker, sector, close, open, volume. User input to apply to the template:
where sector is Industrials, use kpi CLOSE by DATE, dimension TICKER
Template contents:
Use the following template and substitute dimensions, kpi, and date filtering based on the input.
Step 1: CTE_HISTORY_AGG which selects the dimensions and the kpi from the dataset. In CTE_HISTORY_AGG, ignore input references to a date and use WHERE {date} > ‘2020–01–01’ and {date} is not null.
NOTE: Aggregate the kpi and group by the {dimensions} and the {date}. Choose the aggregation yourself, but if you’re not sure then use SUM.
Step 2: CTE_STATS which selects from CTE_HISTORY_AGG and adds the following columns:
1. column moving_average defined as AVG({kpi}) OVER(PARTITION BY {dimensions} ORDER BY {date} ROWS BETWEEN 19 PRECEDING AND CURRENT ROW)
2. column normal_spread defined as STDDEV({kpi}) OVER(PARTITION BY {dimensions} ORDER BY {date} ROWS BETWEEN 19 PRECEDING AND CURRENT ROW)*3
3: column lower_limit defined as moving_average — normal_spread, and column upper_limit defined as moving_average + normal_spread
Step 3: CTE_RESULTS which selects from CTE_STATS, and flags any rows where the kpi exceeds either the upper or lower limit, and call it either “positive outlier” or “negative outlier”.
Also add any date filter mentioned in the input. If not specified in the input, use WHERE {date} is in the last 3 months. Only report rows flagged as outliers, Order by {date} DESC show most recent 1000 results

The result here is rock solid!

Templates (again!)

If you read my prompt closely, you’ll notice that I spent extra time to break it into sections. First, I gave the AI my prompt. Then, I told the AI that there is a logic template that I want it to follow, which effectively merges my prompt and the logic template.

where sector is Industrials, use kpi CLOSE by DATE, dimension TICKER

Then separately, I wrote a template for the AI to follow:

User input to apply to the template:
{{ my input }}
Template: Use the following template and substitute dimensions, kpi, and date filtering based on the input.
Step 1…
Step 2…
…
Final Instructions

The reason for this, is because my prompt is now a template which I can utilize Jinja to populate. This should make it easy to re-use my template on any dataset I want, even if it isn’t the dbo.STOCKS table.

Conclusion

I have only begun to scratch the surface with AI and how it can be make us more productive. My goal was to just show you the basics of what I’ve learned so far. I will be sure to create more posts as I uncover new ways that AI can be useful in the world of SQL.

If you have suggestions, questions, or comments, you can find me hanging out in Locally Optimistic and DataTalks.Club.

Go from SELECT * to interview-worthy project. Get our free 5-page guide.