How to connect LLM to SQL database with LlamaIndex

How to Tutorial for using LlamaIndex for text-to-SQL

Dishen Wang
Dataherald
5 min readJun 30, 2023

--

Photo by Liudmila Shuvalova on Unsplash

Introduction

Hello again! In our last two tutorials we explored using SQLChain and SQLAgent offered by LangChain to connect a Large Language Model (LLM) to a sql database. Although both tools offered powerful capabilities in converting natural languages into SQL queries related to a database, they presented shortcomings such as difficulties in interpreting complex questions, lack of information in understanding of the database, and misunderstanding the database schemas.

In this article we will try out a new data framework LlamaIndex that aims to outshine both of LangChain’s solutions. We will again use Dataherald’s real_estate database for our tutorial and will compare the performance of LlamaIndex with the previous two solutions.

1. Getting Started

In order to utilize LlamaIndex, we will be writing in Python. In this tutorial, we’ll be working with a PostgreSQL database, so we have already installed postgreSQL on our machine. Additionally, for our LLM, we will be utilizing OpenAI’s gpt-3.5-turbo model. If you happen to be using OpenAI's LLM too, make sure to have your API key ready. Now, let's proceed with the installation of the necessary packages:

pip install openai 
pip install langchain
pip install sqlalchemy
pip install llama-index
pip install psycopg2

Notice how we will also be using LangChain in our tutorial as well. LangChain and LlamaIndex are not mutually exclusive and we can use both tools to build text-to-SQL programs if desired.

Now let us create a new new file called main.py and import the following:

import os
from sqlalchemy import create_engine, MetaData
from llama_index import LLMPredictor, ServiceContext, SQLDatabase, VectorStoreIndex
from llama_index.indices.struct_store import SQLTableRetrieverQueryEngine
from llama_index.objects import SQLTableNodeMapping, ObjectIndex, SQLTableSchema
from langchain import OpenAI

Additionally, we need to define our OpenAI API key as an enviornment variable for LlamaIndex. Use the os module to defined the environment variable at runtime:

os.environ['OPENAI_API_KEY'] = "your_api_key"

Now let us set up our database.

2. Connect the database

For postgres databases, use the following format string for the database URI. Be sure to also declare all the necessary variables:

pg_uri = f"postgresql+psycopg2://{username}:{password}@{host}:{port}/{mydatabase}"

Let us create a new SQLAlchemy engine to communicate with our database:

engine = create_engine(pg_uri)

Now we need to create an ObjectIndex object that allows users to use our Index data structures over arbitrary objects. An Index is a data structure that allows us to quickly retrieve relevant context for a user query. The arbitrary objects in our case is the table schemas in the database. The ObjectIndex will handle serialization to/from the object, and use an underying Index (e.g. VectorStoreIndex, ListIndex, KeywordTableIndex) as the storage mechanism.

# load all table definitions
metadata_obj = MetaData()
metadata_obj.reflect(engine)

sql_database = SQLDatabase(engine)

table_node_mapping = SQLTableNodeMapping(sql_database)

table_schema_objs = []
for table_name in metadata_obj.tables.keys():
table_schema_objs.append(SQLTableSchema(table_name=table_name))

# We dump the table schema information into a vector index. The vector index is stored within the context builder for future use.
obj_index = ObjectIndex.from_objects(
table_schema_objs,
table_node_mapping,
VectorStoreIndex,
)

3. Setup LLM

By default, LlamaIndex utilizes OpenAI’s text-davinci-003 model. To change the underlying model we first define our own LLM and create a ServiceContext object that will overwrite the LLM in LlamaIndex later.

llm_predictor = LLMPredictor(llm=OpenAI(temperature=0, model_name="gpt-3.5-turbo"))
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor)

4. Create the query engine

Now let us create our query engine. A query engine is a generic interface that allows you to ask questions over your data. We need to connect both our database and LLM to the engine:

# We construct a SQLTableRetrieverQueryEngine. 
# Note that we pass in the ObjectRetriever so that we can dynamically retrieve the table during query-time.
# ObjectRetriever: A retriever that retrieves a set of query engine tools.
query_engine = SQLTableRetrieverQueryEngine(
sql_database,
obj_index.as_retriever(similarity_top_k=1),
service_context=service_context,
)

5. Ask a query

Finally, let us run a query against the query engine. We can also print out the metadata from the response which includes the sql query and its result:

response = query_engine.query("What are the top 3 cities with the highest average rent in Jan 2023?")

print(response)
print(response.metadata['sql_query'])
print(response.metadata['result'])

Here is the output:

According to the data from January 2023, the top 3 cities with the highest average rent are Manhattan Beach with an average rent of $11,142.74, Indian Shores with an average rent of $8,482.35, and Paradise Valley with an average rent of $7,902.95.
SELECT location_name, AVG(metric_value) AS avg_rent
FROM renthub_average_rent
WHERE period_start = '2023-01-01' AND geo_type = 'city'
GROUP BY location_name
ORDER BY avg_rent DESC
LIMIT 3;
[('Manhattan Beach', 11142.744883040936), ('Indian Shores', 8482.35294117647), ('Paradise Valley', 7902.945833333333)]

Discussion

When using LlamaIndex, one noticeable difference from our previous LangChain solutions is that LlamaIndex uses an Index object that stores the relevant table schema information. With this object, LlamaIndex can quicky retrieve relevant context for a user query from the table schema and the resulting sql query should be more accurate than the other solutions.

In our tests, we ran multiple queries against the LlamaIndex solution and compared the results with the LangChain solutions as well. While the results show LlamaIndex was able to answer some questions with high accuracy, it seemed less flexible compared to SQLAgent as it only attempts to only query a single table and gives up if it doesn’t return anything. For example, if asked the same question from our How-to SQLAgent tutorial “Are home prices increasing in San Diego from Nov 2022 till Dec 2022?”, the result doesn’t provide a satisfactory answer:

There are no results for the query as the SQL response is empty. Therefore, it is not possible to determine if home prices are increasing in San Diego from November 2022 to December 2022 based on the available data.
SELECT location_name, metric_value
FROM redfin_homes_sold
WHERE location_name = 'San Diego' AND period_start = '2022-11-01' AND period_end = '2022-12-01' AND property_type = 'All Homes'
ORDER BY metric_value DESC
[]

In the end, a straight forward LlamaIndex solution shows similar performance to SQLChain with higher accuracy in particular cases. Perhaps a potential next step to improve the solution is to combine SQLAgent with LlamaIndex and see if it can get the best out of both worlds.

Conclusion

LangChain SQLChain, LangChain SQLAgent, and LlamaIndex present an array of exceptional tools and functionalities that revolutionize the way we interact with SQL databases. While it is already feasible to utilize natural language for querying SQL databases, there remain limitations with each out-of-the-box solution. Although the solutions explored in this series can handle simple questions like describing a table or retrieving the count or average of a value, they struggle when it comes to processing complex real-world queries. These solutions, at best, have shown inconsistencies with their results, leading to LLM hallucinations and the need for constant evaluations in order to process production-level tasks. Without addressing these issues, SQLChain, SQLAgent, and LlamaIndex still have a long way to go. Nevertheless, with the continuous advancements in text-to-SQL, it is certain that better solutions will emerge in the future, further bridging the gap between LLM and structured databases.

About Dataherald

  • Sign up for free and use the hosted version of Dataherald
  • Our open-source engine is available on Github.
  • Join our Discord server to learn more about the project.

--

--