Effortless SQL Query Crafting: Unleashing GitHub Copilot’s Power

Massimo Piccardo
10 min readSep 6, 2023

--

A few weeks ago, I was collecting feedback from developers in my company about their use of GitHub Copilot. They were participating in a test program to evaluate the tool in their daily work.

I was disappointed when one of them said that GitHub Copilot was “not useful” for her because she “only works on database query design and uses SQL Server Management Studio all the time”.

I thought to myself, “Yeah, Copilot is not for that.” But it sounded like a missed opportunity. So, I decided to give it a try and test it out.

This article is about my experience using GitHub Copilot for database query design. I got some really good results, and I think the content will be helpful for anyone who uses SQL Server or T-SQL. The results will also generalize well to other DBMSs as long as you can use Copilot in your tool of choice.

Let’s get started!

Designing queries is… slow

We all know that designing complex queries can be a slow process. This is because it is often an iterative trial-and-error process: you start writing a basic query, run it, take a look at the results, fix it, then add more logic, run it again, and so on. You need the help of your IDE to speed up the process: IntelliSense is your friend.

However, this help is still limited. You have to identify the right tables, the joins, the conditions, the aggregation functions, and of course the overall structure of the query is completely up to you and your experience.

So, can GitHub Copilot help? Sure!

The LLM (Large Language Model) engine behind the tool can be unleashed to go way beyond what IntelliSense can do.

Setting up the environment

The examples and screenshots you find here are all taken from VS Code using mssql for Visual Studio Code, an extension for developing and executing queries against a SQL Server database. You can replicate a very similar experience using SQL Server Data Tools in Visual Studio. Actually, you can use any IDE that supports GitHub Copilot (see the list here). The extension is easy to use, but if you need help, please refer to the official website.

I use this Movies database to show the results, which you can find in this very cool Database Star — SQL Code Repository. I recreated it in a LocalDB SQL Server Express instance, but you can do it in a full-fledged SQL Server instance as well.

I decided to discard the use of the classical Northwind or AdventureWorks sample databases from Microsoft because Copilot LLM was trained on available public code, and there are so many online examples of SQL queries about these databases that Copilot can probably produce correct queries even without providing a context.

Anyway, I strongly suggest trying these techniques on your own databases to test the real benefits you can get.

The naive approach

The first thing you can do is create a new query file and ask Copilot whatever your query should do in a comment and just wait for its suggestion.

Sometimes you need to type a carriage return or a tab to trigger the suggestion. If the suggestion doesn’t show up or starts with a comment (“ — “), type “SELECT” and a carriage return. You will also notice Copilot very often adds three backticks at the end of the suggested query. You can safely remove them.

This is what you get

The naive approach

The result is good-looking but… wrong. Copilot simply makes up a query trying to guess tables and column names, and it usually fails miserably.
Here both table name and column name are wrong.

So…why? We have to remember Copilot uses a LLM, which is only a text predictor. It hasn’t the structured knowledge of your database needed to build error-free SQL queries. You need to give it this knowledge.

Context is all you need

When invoked, Copilot looks around a little bit (what you wrote in the same file and other opened tabs) and uses this information as a context to be more focused on the text prediction it is going to generate. Actually, it distills it into a prompt to be used as an input for the LLM engine.

The more specific the prompt the more useful will be the prediction.

So, we have to generate a context to help Copilot help us. We have to explain which tables we can play with, the columns they have, and how they relate to each other. Moreover, this context should not be too long, as there are limitations to the length of prompts that Copilot can process. So, if the context is too long, it can be truncated or ignored.

On the Open AI website there is a useful example called “Natural language to SQL”

OpenAi natural language SQL example

In this example, the prompt creates a context by listing a bunch of CREATE TABLE statements to inform the LLM about the database tables.

Let’s try to repeat our previous attempt using the same approach. We put the CREATE TABLE statement in the comments just before the query we want Copilot to generate for us

Give a context

It works!
Now the table and column names in the Copilot suggestion are correct.

Let’s ask for something more advanced: the same query as before but limited to movies produced by DreamWorks

Need for more context

It fails again. There is no production_company column in the movie table. Copilot had no other way than to pretend the column to be there.

So, we need to feed Copilot with more information. Let’s add the needed table definitions

Successful query with joins

Now the query generated by Copilot is correct and perfectly working!

It is clear we need to describe every table we are supposed to use in our query, but

  • Sometimes we do not know in advance the tables we need
  • Generating a CREATE TABLE statement, copying it and pasting in comments can be boring and slow
  • Sometimes the list of statements can be long and take hundreds of lines

Can we optimize this context generation a little bit? Sure :)

Database context generation

The fastest way to generate the context we need is to query the database metadata to retrieve all the basic information we need about tables and columns. We do not need a formal description of the tables: Copilot is smart enough to understand them anyway.

This is a context generation query that can help with that (it works on SQL Server, but very similar queries can be easily produced for other DBMSs)

Copilot basic context generation query

When run on the movies database, it generates this list of values

-- Table dbo.[country] ([country_id] int not null, [country_iso_code] varchar(10), [country_name] varchar(200))
-- Table dbo.[department] ([department_id] int not null, [department_name] varchar(200))
-- Table dbo.[gender] ([gender_id] int not null, [gender] varchar(20))
-- Table dbo.[genre] ([genre_id] int not null, [genre_name] varchar(100))
-- Table dbo.[keyword] ([keyword_id] int not null, [keyword_name] varchar(200))
-- Table dbo.[language] ([language_id] int not null, [language_code] varchar(10), [language_name] varchar(500))
-- Table dbo.[language_role] ([role_id] int not null, [language_role] varchar(20))
-- Table dbo.[movie] ([movie_id] int not null, [title] varchar(1000), [budget] bigint, [homepage] varchar(1000), [overview] varchar(1000), [popularity] decimal, [release_date] varchar(30), [revenue] bigint, [runtime] int, [movie_status] varchar(50), [tagline] varchar(1000), [vote_average] decimal, [vote_count] int)
-- Table dbo.[movie_cast] ([movie_id] int, [person_id] int, [character_name] varchar(400), [gender_id] int, [cast_order] int)
-- Table dbo.[movie_company] ([movie_id] int, [company_id] int)
-- Table dbo.[movie_crew] ([movie_id] int, [person_id] int, [department_id] int, [job] varchar(200))
-- Table dbo.[movie_genres] ([movie_id] int, [genre_id] int)
-- Table dbo.[movie_keywords] ([movie_id] int, [keyword_id] int)
-- Table dbo.[movie_languages] ([movie_id] int, [language_id] int, [language_role_id] int)
-- Table dbo.[person] ([person_id] int not null, [person_name] varchar(500))
-- Table dbo.[production_company] ([company_id] int not null, [company_name] varchar(200))
-- Table dbo.[production_country] ([movie_id] int, [country_id] int)

As you can see, it is a pseudo-SQL description of all the tables. You can copy and paste it at the beginning of your query file and Copilot will be aware of all it needs.
Just to show an example, here you have a query involving more tables, completely generated by Copilot thanks to the context pasted in the comments section

More complex queries

Copilot is also smart enough to figure out more complex queries, like this one

Other complex query

Improving the context: foreign keys

The movie database is a scholastic schema, where column names are very consistent and their content is predictable. Unfortunately, real-world databases are not always so polished: we can have foreign key columns with non-standard names (column employee referring to user.id), incomprehensible tables and column names, and so on.

We can improve our context generation query to extract more data.

First, we can retrieve information about foreign key constraints

Copilot context generation query with references

This adds the comment /* references table.column */ to the table descriptions.

For example, the dbo.movie_cast foreign keys are now rendered as follows (split into multiple lines for readability).

-- Table dbo.[movie_cast] ([movie_id] int /* references movie.movie_id */ , 
-- [person_id] int /* references person.person_id */ , [character_name] varchar(400),
-- [gender_id] int /* references gender.gender_id */ , [cast_order] int)

Improving the context: descriptions

We can do even more if we take advantage of SQL Server Extended Properties. They are a sort of metadata you can attach to database objects like tables and columns. We can store there natural language comments about our tables and columns, to better clarify their contents and then enrich our database context with this information.

How to add these comments? I created a stored procedure to help add comments. It works like this

-- add description to table
exec sp_AddUpdateDescription @TableName = 'movie', @Description = 'contains information about each movie';

-- add descrition to columns
exec sp_AddUpdateDescription @TableName = 'movie', @ColumnName = 'budget', @Description = 'In US Dollars';
exec sp_AddUpdateDescription @TableName = 'movie', @ColumnName = 'release_date', @Description = 'Format: YYYY-MM-DD';
exec sp_AddUpdateDescription @TableName = 'movie', @ColumnName = 'movie_status', @Description = 'indicates if the movie is Released, Rumoured, or in Post-Production.';

And this is the stored procedure body

Copilot context generation query with references and descriptions

Now we need to update our context generation query to read the descriptions we created

Stored procedure to add tables and column descriptions

If we run the context generation query we have a result like this (limited to dbo.movie table and split into multiple lines for readability)

-- Table dbo.[movie] /*Contains information about each movie*/ ([movie_id] int not null, 
-- [title] varchar(1000), [budget] bigint /*In US Dollars*/ ,
-- [homepage] varchar(1000), [overview] varchar(1000), [popularity] decimal,
-- [release_date] varchar(30) /*Format: YYYY-MM-DD*/ ,
-- [revenue] bigint, [runtime] int,
-- [movie_status] varchar(50) /*Indicates if the movie is Released, Rumoured, or in Post-Production.*/ ,
-- [tagline] varchar(1000), [vote_average] decimal, [vote_count] int)

All this additional information lets us be more relaxed with the description of the query we want Copilot to generate. It has more information about our database and it is able to come up with correct queries even if our prompt is vague or uses ambiguous terminology.

Look at this prompt: “get the best 10 films names with cartoons style”. It is intentionally loose, with grammar mistakes and super-generic. None of the words used in this prompt correspond to the name of a table or column in the database. Still, Copilot manages to construct a perfectly working query

Loose prompt query

Another example of the power in your hands

Hard pivot query

Not bad, is it?

All that glitters is not gold

Okay, don’t get too optimistic just yet. This database is quite simple and intuitive. Most of the time, we have to work with databases that contain specialized data. In these cases, Copilot’s ability to understand the meaning of natural language requests may be less effective.

Moreover, every query generated by Copilot should always be verified and tested. Do not trust it blindly. In my opinion, these kinds of tools work better in experienced hands, as you can guide them and check the results more effectively.

Another drawback of this approach is that the generated context can be very long. If we include all the table descriptions of a huge database, the context will be too big and Copilot will only keep a portion of it. In this case, we would have to manually delete from the context all the tables that we do not believe are relevant to our query design.

Anyway, these results are quite impressive and I’m sure can help many of us to speed up query design in our daily work.

GitHub Copilot Chat

I only showed the experiments I conducted with GitHub Copilot without using GitHub Copilot Chat (or ChatGPT), but you can also get great results with it. Simply copy and paste the context, then ask for a query to be generated.

GitHub Copilot Chat example

Takeaways

Thank you for following me so far. I hope you enjoyed it.

You can find the query to generate the context in this public gist repository https://gist.github.com/massimopiccardo/548bd31cb194ef5ff295fc82ceeda773

To recap, here are the steps on how to take advantage of GitHub Copilot to boost our query design:

  1. Use an IDE that supports GitHub Copilot.
  2. Install and activate GitHub Copilot.
  3. (Optional) Define descriptions for your tables and columns using the sp_AddUpdateDescription procedure.
  4. Run the context generation query to extract a prompt context to copy at the top of your file.
  5. Write your prompt.
  6. Let Copilot suggest the SQL query.

I hope this is helpful!

--

--