High accuracy text-to-SQL with Langchain

Mo Pourreza
Dataherald
Published in
5 min readMay 13, 2024
Generated by DALL-E

Background

Large Language Models (LLMs) are powerful, yet their fixed knowledge cutoff and inability to interact directly with the real world can limit their practical applications. Deploying LLMs as agents provides a solution to these constraints. LLM agents extend their capabilities by utilizing APIs, executing Python functions, and leveraging other models to access real-time information. The enhanced functionality allows them to perform a variety of tasks, from software engineering to writing novels.

A user-friendly library for developing and deploying such agents is Langchain, which simplifies the integration of various tools with your agent. A notable application of LLM agents is in data analytics. The agents can generate and execute SQL queries, including the creation of visual data representations . In this blog post, we demonstrate how to connect an agent to a database using Dataherald’s text-to-SQL tool, enabling the agent to derive insights from the data effectively.

Dataherald

Before we begin developing our own agent, we first set up an account with Dataherald and obtain an API key. Dataherald offers a $50 credit, allowing you to explore and experiment with the platform for free. Once you create an account, you can easily connect your database through the user interface with just a few clicks.

In this post, we utilize a sample database containing U.S. real estate information, which includes 8 tables and millions of rows. Once we’ve added our database, we need to select the tables that our agents will use. For this demonstration, we will enable access to all tables.

Finally, we just have to obtain our own API key to use their text-to-SQL tool in our agent:

Data Analytics Agent

Now that we have everything needed to develop our tools and integrate them with the agent, the first step is to install the Python packages necessary for this tutorial:

pip install --upgrade --quiet dataherald langchain langchain-core langchain-community langchain-openai

Then we will import all of the required functions and classes:

from google.colab import userdata
from langchain_openai import ChatOpenAI
from langchain_community.utilities.dataherald import DataheraldAPIWrapper
from langchain_community.tools.dataherald.tool import DataheraldTextToSQL
from langchain_core.prompts.prompt import PromptTemplate
from langchain.agents import AgentExecutor, create_react_agent
from langchain.tools import StructuredTool
import matplotlib.pyplot as plt
import numpy as np
import uuid
import ast
import json

Now we set our OpenAI and Dataherald API keys, together with the auxiliary functions that we need for this tutorial:

# Setting the api keys in Google colab
openai_api_key = userdata.get('OPENAI_API_KEY')
dataherald_api_key = userdata.get('DATAHERALD_API_KEY')

api_wrapper = DataheraldAPIWrapper(dataherald_api_key= dataherald_api_key,db_connection_id="661fd247412d933d48439ebc")

def clean_string(s):
# Remove all spaces and newlines, then convert to lowercase
return s.replace(" ", "").replace("\n", "").lower()

Now, let’s define the tools essential for this tutorial. We’ll utilize three key tools:

  1. Text-to-SQL Tool: We’ll employ Dataherald’s Text-to-SQL agent for this purpose.
  2. SQL Query Execution Tool: To execute the SQL queries, we’ll rely on the Dataherald API wrapper.
  3. Plotting Tool: Lastly, for creating plots from a list of values, we’ll utilize a suitable plotting tool.
# the text to sql engine
text_to_sql_tool = DataheraldTextToSQL(api_wrapper=api_wrapper)

# execute sql query tool
def execute_sql_query(sql_query: str) -> str:
generated_queries = api_wrapper.dataherald_client.sql_generations.list(page=0,
page_size=20,
order='created_at',
ascend=False)
query_id = ""
for query in generated_queries:
if clean_string(query.sql) == clean_string(sql_query):
query_id = query.id
break
if not query_id:
raise Exception("Query has not found")
query_results = api_wrapper.dataherald_client.sql_generations.execute(id=query_id)
return str(query_results)

execute_query_tool = StructuredTool.from_function(
func=execute_sql_query,
name="Execute Query",
description="Usefull for executing a SQL query over the database, input is a sql query generated by the text-to-sql tool",
)

# A function to plot a list
def plot_and_save_array(dict_of_values):
dict_of_values = json.loads(dict_of_values.strip().replace("'", '"'))
items = list(dict_of_values.keys())
values = list(dict_of_values.values())
plt.figure(figsize=(5, 3))
plt.plot(items, values, marker='o')
plt.title("Array Plot")
plt.xlabel("Items")
plt.ylabel("Values")
plt.xticks(rotation=45)
plt.grid(True)
identifier = uuid.uuid4().hex[:6] + ".png"
plt.savefig(identifier)
plt.show()
return "success"

plotting_tool = StructuredTool.from_function(
func=plot_and_save_array,
name="Plotting Results",
description="A tool which receives a valid json object with keys being the x axes values and for each key we should have a single value",
)

We need to define the agent prompt to specify the task and explain how to utilize the tools for the agent:

template = """Assistant is a large language model trained by OpenAI.

Assistant is designed to be able to assist with question and analysis on the database values.

Assistant plan:
1) Use the text-to-sql tool to generated a SQL query for the user question
2) Execute the generated SQL query over the database using the Execute Query tool
3) Use Plotting Results tool to plot the results that are returned by the Execute Query tool

TOOLS:
------

Assistant has access to the following tools:

{tools}

To use a tool, please use the following format:

```
Thought: Do I need to use a tool? Yes
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
```

When you have a response to say to the Human, or if you do not need to use a tool, you MUST use the format:

```
Thought: Do I need to use a tool? No
Final Answer: [your response here]
```

Begin!

New input: {input}
{agent_scratchpad}
"""

prompt = PromptTemplate.from_template(template)

Now that we have all the necessary tools and the agent prompt, we can proceed to define our agent:

llm = ChatOpenAI(api_key=openai_api_key, model="gpt-4-turbo-preview", temperature=0)
tools = [text_to_sql_tool, execute_query_tool, plotting_tool]
agent = create_react_agent(llm, tools, prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

Finally, let’s test the agent:

agent_executor.invoke(
{"input": "What was the rent price in LA for the summer and fall 2022 for each month? Plot the results"}
)

The agent response is:

The agent successfully utilized the Dataherald text-to-SQL tool to generate the SQL query and then proceeded to generate a plot based on the results obtained from executing the SQL query. This agent represents a basic implementation of an analytics tool. More sophisticated methods can be developed by integrating additional tools and techniques.

Conclusion

This article explores the use of the Dataherald Text-to-SQL tool to create analytics agents. Initially, we developed a straightforward agent utilizing three tools; however, this approach can be adapted to a variety of use cases involving relational databases. You can connect your own database to our Dataherald engine here and build complex agent-based pipelines using our langchain tool together with other powerful langchain tools.

About Dataherald

  • Sign up for free and use the hosted version of Dataherald
  • Our open-source engine is available on Github.
  • Join our Discord server to learn more about the project.

--

--