Natural Language to SQL using an Open Source LLM

Published in

Brillio Data Science

4 min readApr 25, 2023

Photo by Markus Spiske: https://www.pexels.com/photo/black-laptop-computer-turned-on-showing-computer-codes-177598/

Among multiple explorations that I’ve been conducting with my fellow LLM experimenters aka. my team, here is an interesting piece — an LLM-based approach for translating natural language questions into SQL queries.

As a use case, this has been tried, way before LLMs started ruling. A typical scenario where this would be useful is say — when an SQL-illiterate person needs to query things from a database. This can as well be extended to other underlying data sources like a graph, where we could build a solution to generate Cypher queries, for example, which can query the graph and return the results.

A very common ask from our customers when it comes to LLMs, has been about the use of open-source models in place of paid APIs ( Open AI , most of the times) , the pros and cons of each of the approaches along with the various architectural and design considerations that need to be worked upon- before the gigantic question of scaling the solution into production!

One step at a time — this first blog is about using an open-source model to generate sql queries, at inference time, using zero-shot learning. A little bit of prompt engineering, and I must say, pretty impressed to see the results!

The main tools /packages/ LLM that I used :

HF chavinlo/alpaca-native — a replica of Stanford’s alpaca
Langchain which is a framework for developing applications powered by language models
bitsandbytes — lightweight wrapper around CUDA custom functions, in particular 8-bit optimizers, matrix multiplication, and quantization functions.
AWS Sagemaker instance (ml.g5.8xlarge)

I used the above-mentioned model as it has been giving me good results for similar experiments on in-context learning for healthcare NER, assertion models, etc.

Well, time to take a quick peek into the code : ( please find the Git link at the bottom )

Step 1: Import everything

import torch
from transformers import LlamaTokenizer, LlamaForCausalLM, pipelinepy
from langchain.llms import HuggingFacePipeline
from langchain import PromptTemplate, LLMChain

Step 2: Let’s load the model and the tokenizer

base_model = LlamaForCausalLM.from_pretrained(
    "chavinlo/alpaca-native",
    load_in_8bit=True,
    device_map='auto',
)

tokenizer = LlamaTokenizer.from_pretrained("chavinlo/alpaca-native")

Step 3: Define the pipeline and the prompt template. ( Experimented a bit with the temperature parameter as well as the prompt here — have fun )

pipe = pipeline(
    "text-generation",
    model=base_model,
    tokenizer=tokenizer,
    max_length=500,
    temperature=0.3,
    top_p=0.95,
    repetition_penalty=1.2
)

local_llm = HuggingFacePipeline(pipeline=pipe)
llm_chain = LLMChain(prompt=prompt, llm=local_llm)


template = """
Write a SQL Query given the table name {Table} and columns as a list {Columns} for the given question : 
{question}.
"""

prompt = PromptTemplate(template=template, input_variables=["Table","question","Columns"])

def get_llm_response(tble,question,cols):
    llm_chain = LLMChain(prompt=prompt, 
                         llm=local_llm
                         )
    response= llm_chain.run({"Table" : tble,"question" :question, "Columns" : cols})
    print(response)

Step 4: Is this working …

tble = "employee"
cols = ["id","name","date_of_birth","band","manager_id"]
question = "Query the count of employees in band L6 with 239045 as the manager ID"
get_llm_response(tble,question,cols)

## Answer: SELECT COUNT(*) FROM employee WHERE band='L6' AND manager_id=239045;

Another one —

tble = "employee"
cols = ["id","name","date_of_birth","band","manager_id"]
question = "Query the count of employees in band L6 and over 40 years of age"
get_llm_response(tble,question,cols)

## Answer: SELECT COUNT(*) FROM employee WHERE band='L6' AND date_of_birth>=(CURDATE() - INTERVAL 40 YEAR);

The quantization is definitely working —

Tue Apr 25 19:31:21 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A10G         On   | 00000000:00:1E.0 Off |                    0 |
|  0%   24C    P0    57W / 300W |   9845MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1659      C   ...3/envs/python3/bin/python     9843MiB |
+-----------------------------------------------------------------------------+

A few questions that need to be answered as we would want to move this from a PoC to a bigger solution —

How do we accommodate more than one table — which will be the case in most scenarios — a few answers come to mind: perhaps bigger, smarter prompts, few-shot learning, or fine-tuning?
What about more complex queries?
How would you evaluate the performance of an approach like this? Or compare with other LLM options?

Many things to ponder about and read and try! I will come back with more, and try to answer some of these along with any other fun stuff that I get to work on!

If you are working on something interesting in the field of LLMs or other generative AI models and would like to share and discuss it, please feel free to reach out.

And, here is the Git Link.

Natural Language to SQL using an Open Source LLM

Written by Indira KriGan