Transforming Natural Language Text to SQL: Harnessing RAG and LLMs for Precision Querying

Manan Shah
11 min readApr 22, 2024

--

Frame executable SQL queries with the help of RAG and LLM (Gemini-pro, paLM 2) using embeddings and prompt engineering to retrieve data from your own custom data source.

In the rapidly evolving landscape of artificial intelligence (AI), the integration of Generative AI and Large Language Models (LLMs) has emerged as a game-changer, offering transformative capabilities across various domains. Recent advancements have sparked discussions around the potential of Generative AI to redefine the essence of Artificial Intelligence.

Generative AI encompasses a category of AI capable of generating original content and concepts across diverse mediums, from text and conversations to visuals and music. These AI systems emulate human-like intelligence in tasks such as image identification, natural language comprehension, and translation. Large Language Models (LLMs), on the other hand, represent a specific form of artificial intelligence designed to understand and produce text, among other functions. These models are trained on vast datasets using machine learning techniques, particularly transformer model architectures.

However, a critical question arises: how can we harness the power of Generative AI and LLMs to tailor solutions to specific data sources or datasets? One potential application lies in leveraging LLMs to frame SQL queries from simple text and retrieve relevant data from your data source. This approach, known as Retrieval Augmented Generation (RAG), holds promise for streamlining processes like Text-to-SQL conversion and data retrieval.

In this article, we delve into the intricacies of using RAG to convert text queries into SQL queries based on user input questions. By framing SQL queries dynamically in response to user queries, we aim to demonstrate the potential of this approach in facilitating seamless interaction with databases and enhancing data retrieval processes.

What is Retrieval Augmented Generation (RAG)?

Retrieval Augmented Generation (RAG) stands as a transformative AI framework designed to enhance the efficacy of Large Language Models (LLMs) by integrating external sources of knowledge. By grounding the LLM’s responses in these external repositories of information, RAG aims to enrich the model’s understanding and augment its capacity to generate accurate and contextually relevant content. The implementation of RAG within an LLM-based question-answering system yields two pivotal advantages: firstly, it ensures access to the most up-to-date and trustworthy facts, thereby bolstering the reliability of the model’s responses. Secondly, by providing transparency regarding the model’s sources, RAG empowers users to validate the accuracy of its claims, fostering a sense of trust and confidence in the AI-generated output. Whether you’re a researcher, student, or knowledge seeker, RAG offers intuitive and intelligent assistance in extracting valuable insights, conducting analyses, and making informed decisions. With RAG, users can unlock the full potential of their data resources, enabling innovation, productivity, and success across diverse domains.

How does RAG work?

RAG operates through a streamlined process that seamlessly combines information retrieval and generation techniques to provide accurate and contextually relevant responses to user queries.

RAG Architecture

Here’s a breakdown of how RAG works:

Step 1: User Query
The process begins when the user asks a question to the RAG system.

Step 2: Information Retrieval
Upon receiving the user query, the Retriever component of RAG swiftly searches and retrieves relevant information from various sources. This information could be sourced from the internet, internal databases, or other data repositories.

Step 3: Context Augmentation
The retrieved information is then augmented to the original query as additional context. This augmented context enriches the understanding of the query and provides valuable background information for generating a comprehensive response.

Step 4: Generation of Response
With the augmented context in place, the Language Model component of RAG is tasked with generating a response based on the combined input of the user query and the augmented context. Leveraging its vast knowledge base and natural language processing capabilities, the LLM formulates an answer that is both informative and contextually appropriate.

Advantages of using RAG

One of the key advantages of RAG is its ability to access data from external sources through the Retriever component. Unlike traditional models, RAG can retrieve information from sources it has not been explicitly trained on, expanding the scope and versatility of its knowledge base. This capability enables RAG to provide more comprehensive and accurate responses, making it a powerful tool for information retrieval and generation in various domains.

Another advantage of using RAG is avoiding hallucinations. In the realm of LLMs, hallucinations denote situations wherein the model produces content that appears believable but lacks accuracy or factual basis. Let's say, using the traditional approach, you ask a question to your LLM model about specific data based on your dataset. The LLM not being trained on that dataset would hallucinate and give out data based on the question through whatever it finds on the internet rather than being specific to the dataset you have. RAG has the potential to mitigate hallucinations in LLMs by integrating retrieval mechanisms from external knowledge sources.

Now that we’ve grasped the essence of RAG, let’s explore its real-time application.

Mastering the Art of Text-to-SQL Conversion with RAG

In this practical approach of harnessing Generative AI for Text-to-SQL conversion, we transform natural language text into executable SQL queries with precision and efficiency by leveraging state-of-the-art Generative AI, driven by Large Language Models (LLMs) like Gemini-Pro and Text-Bison models.

The methodology revolves around utilizing the embedding approach (words as numerical vectors), to find the closest matched entities from the question and then use them to construct our SQL query. Once we get the matched entity, we use prompt engineering to ask the model to create the SQL query. This approach not only facilitates efficient querying and data analysis but also opens up new possibilities for data-driven decision-making across diverse domains, from data analysis to conversational AI.

What are Embeddings and what is the Embedding approach?

Imagine you have a vast collection of product reviews from an online marketplace, containing valuable feedback from customers about various products. Each review is a lengthy piece of text expressing opinions, experiences, and sentiments about specific items.

Now, to make sense of this trove of information, you decide to break down each review into smaller, manageable chunks. You then index these chunks and convert them into numerical representations known as embeddings. These embeddings capture the semantic meaning and context of the text in a high-dimensional vector space.

For instance, the sentence

"The camera quality is excellent, but the battery life could be better"

might be represented as a numerical vector in this embedding space.

the: [ 0.035, -0.012, 0.018, ... ]  
camera: [ -0.024, 0.043, 0.017, ... ]
quality: [ 0.032, -0.021, 0.015, ... ]
is: [ 0.011, -0.027, 0.036, ... ]
excellent,: [ 0.038, -0.009, 0.025, ... ]
but: [ -0.014, 0.037, 0.019, ... ]
battery: [ 0.021, -0.032, 0.024, ... ]
life: [ -0.028, 0.015, 0.022, ... ]
could: [ 0.016, -0.035, 0.031, ... ]
be: [ -0.008, 0.041, 0.013, ... ]
better: [ 0.027, -0.018, 0.029, ... ]

Once you have these embeddings, you store them in a vector database, making them readily accessible for analysis.

Now, let’s say a customer wants to know about the performance of a particular camera model. Instead of manually sifting through hundreds or thousands of reviews, you can leverage the power of embeddings.

When the customer submits their query, such as “camera performance,” you convert this query into its corresponding embedding representation.

Next, you perform a similarity check between the query’s embedding and the embeddings of the chunks in your database. This process calculates the similarity scores between the query and each review chunk.

Finally, you retrieve the top ‘k’ most relevant review chunks based on their similarity scores. These chunks contain insights specifically related to the customer’s query, allowing you to provide targeted and personalized responses without the need for exhaustive manual searching.

In this way, embeddings serve as a powerful tool for efficiently navigating and extracting valuable information from large text datasets, enabling more effective decision-making and knowledge discovery.

Now, here’s where it gets really interesting. These top chunks aren’t just handed over to you as is. They’re fed into a Large Language Model (LLM), along with your original query. This step is crucial because it allows the LLM to consider not only the query itself but also the context provided by the most relevant document chunks.

Embeddings

To better understand this embedding approach, let’s consider a simpler scenario. Imagine you have a task of categorizing words based on their sentiment — whether they convey a positive or negative meaning. For instance, the word “Sad” might be categorized as negative, while “Positive” obviously falls into the positive category.

Depressed = [62.1, -59.3, 12, -12.6, 10.3] (Just an example vector)

Let’s introduce the word “Depressed”.

It’s not explicitly listed in our sentiment mapping, but using the embedding approach, we can still categorize it. How?

By finding its closest match in our database of word vectors. In this case, “Depressed” closely aligns with “Sad”, leading our model to categorize it as negative.

This ability to categorize words based on their embeddings offers immense flexibility. We can account for synonyms or variations of words that may not even exist in our database explicitly. This means our model isn’t limited by the words it knows; it can understand and categorize concepts beyond its initial training data.

How can we create embeddings?

Embeddings can be created in several ways.

  1. Word2Vec: Represents words as dense vectors in a continuous vector space, trained using CBOW or skip-gram architectures.
  2. FastText: Extends Word2Vec by considering subword information, allowing it to handle out-of-vocabulary words and capture morphological information.
  3. GloVe: Constructs word embeddings based on word-word co-occurrence statistics in a corpus, using matrix factorization.
  4. Doc2Vec: Extends Word2Vec to generate embeddings for entire documents or sentences, facilitating similarity comparisons and document clustering.
  5. BERT: Pre-trained language model that generates contextual word embeddings by considering sentence context, utilizing a transformer architecture for deep contextualized representations.
  6. Faiss: Library for efficient similarity search and clustering of dense vectors, commonly used with embedding methods for fast nearest neighbor search in high-dimensional spaces.
import spacy
from gensim.models import Word2Vec
nlp = spacy.load("en_core_web_sm")
sentence = "The camera quality is excellent, but the battery life could be better"
tokens = [token.text.lower() for token in nlp(sentence)]
data = [tokens]
model = Word2Vec(data, min_count=1)
embeddings = {token: model.wv[token] for token in tokens}
for token, embedding in embeddings.items():
print(f"{token}: {embedding}")

Frame SQL Query using Embedding Approach + Prompt Engineering

Application RAG Architecture

The above diagram is the architecture for our solution. Let’s break it down and understand it step by step with an example.

Let’s assume we have a database named Company_Data with 3 columns as shown below.

CREATE TABLE Company_Data (
Company_Name String,
Sentiment String,
Category String
);

Step 1: User Query

The user asks a question

question="What are the wins for company XYZ in 2023"

Referring to the database, we see the closest possible answer to this would be Row number 4 where the Company name is ‘XYZ’, the Category is ‘Success’ and, the Sentiment is ‘Positive’ because of the word “wins” as present in the question.

Step 2: Embeddings and LLM model with prompt

The next step is to pass our question to the embedding functions to retrieve the closest matches of words that can be used to frame our SQL question.

Since we deal with 3 different columns, we create 3 different embedding functions for each column.

import openai
import pandas as pd
from vertexai.language_models import TextGenerationModel
from vertexai.preview.generative_models import GenerativeModel
from google.cloud import bigquery

def company_embedding_value(question):
[Code here]

def category_embedding_value(question):
[Code here]

def sentiment_embedding_value(question):
[Code here]

prompt="""
"""

company_name=company_embedding_value(question)
sentiment=sentiment_embedding_value(question)
category=category_embedding_value(question)
def company_embedding_value(question):
query_str = """
SELECT query.query, base.Category
FROM VECTOR_SEARCH(
[Your main code here along with ML/Embedding model]
top_k => 10, options => '{"fraction_lists_to_search": 0.01}'
)
"""
job_config = #Define your job config based on where your embeddings are stored.

query_job = #Execute the query to get the closest matches.

first_closest = [row.Company_Name for row in results][0]

return first_closest

From this example, these are the values the embedding functions will return:

company_name = 'XYZ'
sentiment = 'Positive'
category='Success'

Step 3: Pass embedding_values to the final LLM model with SQL prompt.

Now that we have our values, we can pass them directly to our final function which will help form our SQL query.

def form_sql(question, company_name, sentiment, category):

args = """
[INPUTS]: "[Company]:{}, [Sentiment]: {}, [Category]: {}"
[OUTPUT]:
"""

schema = """
CREATE TABLE Company_Data if not exists (
Company_Name String,
Sentiment String,
Category String
);
"""

prompt = f"""
You are an expert SQL developer querying about company data.
You have to write SQL code in a [Database-name] database based on a user's question.
Use this basic template for every query: Select * from Company_Data WHERE ... .
This is your schema that you are working with: [SCHEMA]:{schema}.
Only add a date filter if a date is referenced in the question.
The question asked is: {question}. To answer this question, we need to add these columns to the WHERE condition:
"""

if company_name:
prompt += f"We will add event_company to our WHERE condition. From the question, we have, company name is: {company_name}\n"

if sentiment:
prompt += f"We will add sentiment to our WHERE condition. From the question, we have, sentiment is: {sentiment}\n"

if category:
prompt += f"We will add categorization to our WHERE condition. From the question, we have, categorization is: {category}\n"

prompt += "Please do not use your own logic or do not add any other column to the WHERE condition apart from these."

result = [Your llm model].predict(prompt)

return result.text

This is how the final result would look like

Output Image

This is just a simple example of how you can frame SQL queries for precision querying using Generative AI. This application can be made to handle larger amounts of data where questions can be asked in many different ways. You can ingest your data in a data source, create and store embeddings, and then access it and provide it to your LLM model to fetch or retrieve the data.

Future Applications

Now that we have our SQL query, there is so much more we can do with it:

  1. Text Summarization
  2. Data Analysis
  3. Create Dashboards and Visualizations
  4. Make Predictions

Conclusion

In conclusion, the convergence of Generative AI and Large Language Models (LLMs) with Retrieval Augmented Generation (RAG) presents an unprecedented opportunity to revolutionize the realm of data querying and retrieval. By seamlessly integrating state-of-the-art techniques like Text-to-SQL conversion with advanced AI architectures, we can unlock the full potential of data sources while ensuring precision, efficiency, and reliability in information extraction.

The power of embeddings coupled with prompt engineering enables us to bridge the gap between natural language queries and SQL commands, facilitating dynamic interaction with databases and enhancing decision-making across diverse domains. With RAG, users can harness the collective intelligence of external knowledge repositories, ensuring access to up-to-date and contextually relevant information.

Moreover, RAG not only enhances the accuracy and comprehensiveness of AI-generated responses but also fosters trust and transparency by enabling users to validate the sources of information. Whether it’s conducting complex analyses, extracting valuable insights, or making data-driven decisions, RAG empowers users with intuitive and intelligent assistance, paving the way for innovation and success in the ever-evolving landscape of Artificial Intelligence.

--

--

Manan Shah

Software Engineer | Data Science | MCS Graduate from the University of California Irvine