How to connect LLM to SQL database with LangChain SQLChain

How to Tutorial for using LangChain SQLChain

Dishen Wang
Dataherald
6 min readJun 15, 2023

--

Photo by Chris Ried on Unsplash

Introduction

In recent years, Large Language Models (LLMs) have become exponentially popular due to their remarkable capabilities in generating coherent and contextually relevant text across a wide range of domains. Although LLMs are used for answering questions, assisting in research, and helping engineers in software development, one of the major weaknesses LLMs have shown is their ability to generate incorrect or nonsensical text, also known as “hallucination.” For example, if you ask OpenAI’s ChatGPT: “When did France gift Lithuania Vilnius TV Tower?”, ChatGPT might respond: “France gifted Lithuania Vilnius TV Tower in 1980“, which is factually untrue since France had nothing to do with the construction of the Vilnius TV Tower.

One reason why LLMs can return such confident lies is that LLMs will attempt to conflate different sources of information on the internet to produce a response that is inaccurate or misleading. Another reason for LLMs’ hallucination is that the source of information isn’t accurate and that the LLM will use the information without validation. To help reduce LLM hallucination for a specific domain, we can attempt to connect a LLM to a SQL database which holds accurate structured information to be queried by the LLM. This will make the LLM focus on a single source for its information extraction, which allows the LLM to return the most accurate information possible provided by the database.

This article will demonstrate how to use a LLM with a SQL database by connecting OpenAI’s GPT-3.5 to a postgres database. We will be using LangChain for our framework and will be writing in Python.

1. Getting started

Let us install the required packages first, make sure you have already installed postgreSQL on your machine and have an OpenAI account as well. Create a new python virtual environment if needed:

pip install langchain 
pip install openai
pip install psycopg2

Create a file called main.py and import the following:

from langchain import OpenAI, SQLDatabase
from langchain.chains import SQLDatabaseSequentialChain

SQLDatabaseSequentialChain is a chain for querying SQL database that is a sequential chain. And according to the LangChain documentation, the chain is as follows:

1. Based on the query, determine which tables to use.
2. Based on those tables, call the normal SQL database chain.

This is useful in our case since the number of tables in our database is large.

For smaller databases, you can just use SQLDatabaseChain from LangChain.

2. Connect the database

Before we can connect the database to our LLM, let us first get a database to connect to. Since LangChain uses SQLAlchemy to connect to SQL databases, we can use any SQL dialect supported by SQLAlchemy, such as MS SQL, MySQL, MariaDB, PostgreSQL, Oracle SQL, Databricks, or SQLite. If you would like to know more about the requirements for connecting to databases, please refer to the SQLAlchemy documentation here. In my example I will be using Dataherald’s postgres real_estate database. This database contains 16 tables about rent, sales, inventory, and various other real estate information for locations across the United States in the last couple years. We will connect our LLM to this database in attempt to answer real estate questions in the United States.

The postgres database connection with psycopg2 looks like the following string:

"postgresql+psycopg2://username:password@host:port/mydatabase"

For simplicity, let us define the configuration variables in the main.py file. you can also set them up as environment variables in a .env file for better key and password management.

username = "database username" 
password = "database password"
host = "local host or remote host address"
port = "host port"
mydatabase = "database name"

Now let us setup the connection for the database:

pg_uri = f"postgresql+psycopg2://{username}:{password}@{host}:{port}/{mydatabase}"
db = SQLDatabase.from_uri(pg_uri)

3. Setup LLM

Since we will be using GPT-3.5, we will use an OpenAI API key:

OPENAI_API_KEY = "your OpenAI key"

Using OpenAI from langchain, let us setup our LLM:

llm = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY, model_name='gpt-3.5-turbo')

(Optional) In addition to the question itself, let us also provide instructions on how we want the model to return the response:

PROMPT = """ 
Given an input question, first create a syntactically correct postgresql query to run,
then look at the results of the query and return the answer.
The question: {question}
"""

And finally, we can connect our LLM and database:

db_chain = SQLDatabaseSequentialChain(llm=llm, database=db, verbose=True, top_k=3)

4. Run a query

Now that we have our LLM and database connected, let us give the LLM a prompt to response to:

question = "what is the average rent price in chicago in nov 2022 according to redfin?" 
# use db_chain.run(question) instead if you don't have a prompt
db_chain.run(PROMPT.format(question=question))

Finally we can run our program and observe the result:

> Entering new  chain... 
Table names to use:
['zillow_rent', 'redfin_median_sale_price', 'redfin_new_listings', 'renthub_median_rent']
> Entering new chain...

Given an input question, first create a syntactically correct postgresql query to run, then look at the results of the query and return the answer.

The question:
what is the average rent price in chicago in nov 2022 according to redfin?

SQLQuery:SELECT AVG(metric_value) AS average_rent_price
FROM renthub_median_rent
WHERE location_name = 'Chicago' AND period_start >= '2022-11-01' AND period_end <= '2022-11-30' AND property_type = 'Apartment Building';
SQLResult: [(2259.0,)]
Answer:The average rent price in Chicago in November 2022 according to Redfin is $2,259.
> Finished chain.

Conclusion

We have successfully connected a LLM to our database and were able to get answers from the LLM according to the information provided by the database, pretty amazing right? Well there are still a couple of shortcomings we should discuss about connecting a LLM to a database:

When we ran our query for our database chain, our sample question was full of information and basically checked off every box the LLM needed to construct a SQL query. In reality, however, humans often don’t provide the necessary details in their questions. As a result, LLMs will struggle to extract all the information needed to return a result from the database, especially if the database is large.

Similarly, if we try to ask the LLM a more complex question such as “How has the market trend in Chicago changed in the last three years”, the current LLMs on the market will not be able to construct a complex SQL query to answer the question.

Although previously we discussed about connecting LLM to database as a way of reducing LLM hallucinations, we still cannot avoid the fact that LLMs can produce a valid but incorrect sql query to the database. For example, when I asked the LLM: “What is the number of house sold in march 2022 in Boston?”, it returned “The number of houses sold in March 2022 in Boston is 9”, which is incorrect compared to what the database actually tells us. This particular instance occurred for the question asked because the LLM was running the SQL query below:

SELECT COUNT(*) 
FROM redfin_homes_sold
WHERE location_name = 'Boston'
AND period_start >= '2022-03-01'
AND period_start <= '2022-03-31';

Although the query makes sense, the problem is that the table “redfin_homes_sold” doesn’t index every home sale but instead the monthly home sale for each location and property type. The LLM was not given the information needed to use the table, which resulted it running the incorrect sql query.

Lastly, the response time of the LLM should be considered as a factor as well. If you are trying to use a higher LLM model for database queries such as GPT-4, the response time will drastically increase as a trade off to greater accuracy.

This blog is part of a series of learning how to use different methods to connect a LLM to a database. We will be discussing other methods such as LangChain SQLAgent and LlamaIndex in the coming weeks. If you’re interested in learning more about connecting LLMs to databases please consider subscribing for our next installations in the coming weeks.

About Dataherald

  • Sign up for free and use the hosted version of Dataherald
  • Our open-source engine is available on Github.
  • Join our Discord server to learn more about the project.

--

--