Chat with your Database — Using Langchain + OpenAI LLMs ChatGPT 4 as an Agent to Query MS SQL Server with Natural Language

Keven
24 min readOct 12, 2023

A step-by-step tutorial to set up an NLP SQL agent for your MS SQL instance with sample QA analysis

Introduction

Natural language interactions with enterprise-level databases open up a realm of possibilities for simplifying data retrieval, particularly through utilizing advanced technologies like LangChain and OpenAI’s Large Language Models (LLMs) such as ChatGPT. In the context of interfacing with MS SQL Server, envision deploying a SQL Agent empowered by LLMs that interprets and translates user queries expressed in everyday language into executable SQL queries. This article embarks on a journey through the fusion of LangChain and LLMs, elucidating a pathway to implement a framework that permits users to engage in a dialogue with databases, posing queries in a natural, conversational manner. For a hands-on, visual explanation, we have also curated a YouTube video tutorial, seamlessly aligning with our step-by-step guide contained herein, aiding you in effortlessly integrating these technologies for the natural language querying of your MS SQL Server.

Background

SQL Server

SQL Server is a relational database management system (RDBMS) developed by Microsoft. It’s a scalable, enterprise-grade database platform that provides robust data management and analytics solutions. SQL Server is widely recognized for its capability to securely store, manage, and analyze vast amounts of data, which can be accessed and manipulated using the SQL (Structured Query Language) language. Organizations utilize SQL Server to house varied types of data, enabling developers, and analysts to interact with it through SQL queries to retrieve, update, and manage data, thereby aiding in insightful decision-making.

OpenAI LLM (Large Language Model)

OpenAI’s Large Language Models (LLMs), among which ChatGPT is a notable instance, embody a technological evolution in natural language understanding and generation. LLMs are trained on a diverse range of internet text, enabling them to comprehend and generate human-like text based on the inputs they receive. They can answer questions, create content, and even engage in a coherent and contextually relevant dialogue with users. Importantly, LLMs can be utilized to parse and understand natural language queries, and thus can act as an intermediary between human language and machine-readable instructions, such as SQL queries.

LangChain

LangChain presents itself as a powerful toolkit designed to efficiently integrate and build applications atop OpenAI’s LLMs. It simplifies the development process by providing a streamlined API to interact with models like ChatGPT, further enabling developers to focus on crafting applications without being encumbered by the intricacies of model interactions and management. LangChain facilitates the conversion of natural language inputs into structured data or commands, by leveraging the language understanding capabilities of LLMs.

Merging Technologies for Conversational Database Interactions

Combining the robust data management of SQL Server, the natural language comprehension and generation capabilities of OpenAI’s LLMs, and the application development efficiency provided by LangChain, we pave the way toward seamless and intuitive database interactions. Users can articulate queries in natural language, which is then parsed by the LLM. LangChain, serving as the developmental backbone, efficiently interfaces with the LLM, translating understood user inputs into executable SQL queries through the SQL Server Agent. The process thus bridges the gap between natural language communication and structured database querying, offering a user-friendly and intuitive path to data retrieval and management. The high-level system design illustration can be conceptualized as below:

Conceptual Design of LangChain SQL Agent

In the ensuing sections, we’ll delve deeper into the implementation nuances of utilizing these technologies conjointly, enabling you to facilitate natural language querying of databases, thereby making data interaction more accessible, intuitive, and user-friendly for a wider audience.

Environment Setup (in Visual Studio Code)

Step 1: Installing Python

  • Navigate to the official Python website.
  • Click on “Downloads” and select the latest version of Python suitable for your OS (Windows, macOS, Linux/UNIX).
  • Download and run the installer.
  • Ensure you check the box that says “Add Python to PATH” during installation to make the python and pip commands globally accessible.
  • Once installation is complete, verify it by opening your command line (cmd, terminal, etc.) and typing:
python --version

This should display the version number, confirming a successful installation.

Step 2: Setup a Virtual Environment

First, create a project folder (e.g. sqlserver-agent) and navigate into the project to setup the virtual environment.

For Windows:

  • Install virtualenv using pip :
pip install virtualenv
  • Create a new virtual environment named venv inside your project folder:
python -m venv venv
  • Activate the virtual environment:
.\venv\Scripts\activate

Your terminal should now indicate that you are using the venv environment.

For Linux:

  • Install virtualenv using pip:
pip install virtualenv
  • Create and activate the virtual environment:
python -m venv venv 
source venv/bin/activate

Your terminal should show (venv) before the cursor, indicating the virtual environment is active.

Step 3: Integrating with Visual Studio Code (VSCode)

  • Open VSCode.
  • Navigate to the “Extensions” view by clicking on the Extensions icon in the Activity Bar on the side of the window or pressing Ctrl+Shift+X.
  • Install the Python extension for Visual Studio Code provided by Microsoft.
  • Open your sqlserver-agent project folder in VSCode: Click on "File" > "Open Folder" and select your sqlserver-agent folder.
  • Once the folder is open in VSCode, select the “View” menu and click on “Terminal” to open a terminal inside VSCode.
  • In the terminal, make sure your virtual environment (venv) is active. If not, activate it using the activation command from Step 3.
  • Upon activating the venv, VSCode might automatically detect the virtual environment and select it for your workspace. If not, click on the interpreter version on the bottom left of the window and select your venv.

Step 4: Install the required libraries

Install required packages: langchain, sqlalchemy, pyodbc, cnos-connector, openai

!pip install langchain sqlalchemy pyodbc
!pip install cnos-connector
!pip install openai

Now, you should have the environment ready to build the SQL Server query agent.

Step 5: Configure the database connection in the dev environment

A crucial step is establishing a connection between the database and our development environment. The nuances of this configuration process hinge significantly upon the specific version of SQL Server in operation. To start with, we will guide you through the procedure to ascertain your database version. Subsequently, we will meticulously craft an appropriate connection string, before illustrating the Python code essential for initiating the database connection.

Check the current database server version

Connect to the target SQL Server instance with Management Studio and then right-click on the instance name. Click on the “Property” in the menu after right-click.

check the SQL Server version from Management studio

The database version can be found in the “General” group of the “Server Properties” window.

Check the ODBC drivers installed on the local computer

It is important to make sure the local ODBC driver is compatible with the target SQL Server. We can use the following code to print the ODBC version of the current environment.

# check the local ODBC driver and make sure it match with the traget database instance
import pyodbc
for driver in pyodbc.drivers():
print(driver)

In my case, I have SQL Server, ODBC Driver 17 for SQL Server, and ODBC Driver for SQL Server

Check local ODBC driver version by using pyodbc

Based on the Microsoft document, ODBC Driver 18 for SQL Server should be compatible with SQL Server 2016 to SQL Server 2022.

Generate a connection string

I recommended using connectionstings.com to quickly generate a connection based on the database and driver version collected from previous steps. In my case, I am using SQL Server 2016 and ODBC driver 17. The connection string is generated as:

Driver={ODBC Driver 17 for SQL Server};Server=[your server address];Database=[database name];UID=[Username];PWD=[Password];

Create a login for the application to use for the targeted database

It is more secure and trackable to create a dedicated application login for this application. It can be done by go to the “Security” tab → “Logins” and right-clicking to create a new login.

In the create new login dialog, make sure to choose SQL authentication and map the user to the target database. In my case, I am creating a new login as ‘test’ and mapping it as a ‘db owner’ for the test database.

Test SQL Server Connection with pyodbc

Now, we should have everything ready, and let’s test if we can connect the database to conduct some queries from the development environment. For this tutorial, I created a blank database named ‘test’ and randomly imported an open dataset, Asian countries — GDP growth, from Kaggle. The table contains 6 fields [RegionalMember], [Year], [GDPgrowth], [UnitOfMeasurement], [Subregion], and [CountryCode] which basically both record and predict the GDP growth from the year 2020 to the year 2023 in the Asia area. Since in this demonstration, we focus on the capabilities of conversational queries from ChatGPT model, we did not process any data cleanup. The raw data looks pretty challenging since special characters and mixed data types are common.

Raw data without cleanup

Let’s start building the Langchain SQL agent.

Import libraries

from langchain.agents import create_sql_agent
from langchain.agents.agent_toolkits import SQLDatabaseToolkit
from langchain.sql_database import SQLDatabase
from langchain.llms.openai import OpenAI
from langchain.agents import AgentExecutor
from langchain.agents.agent_types import AgentType
from langchain.chat_models import ChatOpenAI

Connect to the database

# connect to test database
db = SQLDatabase.from_uri(
database_uri="mssql+pyodbc://[uid]:[pwd]@[servername]/[db name]?driver=ODBC+Driver+17+for+SQL+Server"
)

Create a SQL agent using ‘gpt-4’ model with ZERO_SHOT_REACT_DESCRIPTION

toolkit = SQLDatabaseToolkit(db=db, llm=OpenAI(temperature=0))

agent_executor = create_sql_agent(
llm=ChatOpenAI(temperature=0, model="gpt-4"),
toolkit=toolkit,
verbose=True,
agent_type=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
)

Test agent by asking questions directly

Question#1

agent_executor.run("can you describe the GDP growth in Asian area?")

Process#1

> Entering new AgentExecutor chain...
Action: sql_db_list_tables
Action Input: ""
Observation: ADOSep2023_A1-GDPgrowth, asianGDP2023, crime, test
Thought:The tables "ADOSep2023_A1-GDPgrowth" and "asianGDP2023" seem to be relevant to the question. I should check their schemas to understand what data they contain.
Action: sql_db_schema
Action Input: "ADOSep2023_A1-GDPgrowth, asianGDP2023"
Observation:
CREATE TABLE [ADOSep2023_A1-GDPgrowth] (
[RegionalMember] VARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[Year] VARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[GDPgrowth] VARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[UnitOfMeasurement] VARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[Subregion] VARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[CountryCode] VARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL
)

/*
3 rows from ADOSep2023_A1-GDPgrowth table:
RegionalMember Year GDPgrowth UnitOfMeasurement Subregion CountryCode
Developing Asia 2020 -0.6 %
Developing Asia 2021 7.2 %
Developing Asia 2022 4.3 %
*/


CREATE TABLE [asianGDP2023] (
[RegionalMember] VARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[Year] VARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[GDPgrowth] VARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[UnitOfMeasurement] VARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[Subregion] VARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[CountryCode] VARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL
)

/*
3 rows from asianGDP2023 table:
RegionalMember Year GDPgrowth UnitOfMeasurement Subregion CountryCode

*/
Thought:The tables "ADOSep2023_A1-GDPgrowth" and "asianGDP2023" both contain the columns "RegionalMember", "Year", "GDPgrowth", "UnitOfMeasurement", "Subregion", and "CountryCode". The "GDPgrowth" column in both tables seems to be the most relevant to the question. I will query these tables to get the GDP growth for the Asian region. I will limit the results to the most recent 10 years.
Action: sql_db_query_checker
Action Input: "SELECT Year, GDPgrowth FROM ADOSep2023_A1-GDPgrowth WHERE RegionalMember = 'Developing Asia' UNION SELECT Year, GDPgrowth FROM asianGDP2023 WHERE RegionalMember = 'Developing Asia' ORDER BY Year DESC LIMIT 10"
Observation:
SELECT Year, GDPgrowth
FROM ADOSep2023_A1-GDPgrowth
WHERE RegionalMember = 'Developing Asia'
UNION ALL
SELECT Year, GDPgrowth
FROM asianGDP2023
WHERE RegionalMember = 'Developing Asia'
ORDER BY Year DESC
LIMIT 10;
Thought:The query syntax is correct. Now I can execute the query to get the GDP growth for the Asian region.
Action: sql_db_query
Action Input: "SELECT Year, GDPgrowth FROM ADOSep2023_A1-GDPgrowth WHERE RegionalMember = 'Developing Asia' UNION SELECT Year, GDPgrowth FROM asianGDP2023 WHERE RegionalMember = 'Developing Asia' ORDER BY Year DESC LIMIT 10"
Observation: Error: (pyodbc.ProgrammingError) ('42000', "[42000] [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Incorrect syntax near '-'. (102) (SQLExecDirectW); [42000] [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Incorrect syntax near 'LIMIT'. (102)")
[SQL: SELECT Year, GDPgrowth FROM ADOSep2023_A1-GDPgrowth WHERE RegionalMember = 'Developing Asia' UNION SELECT Year, GDPgrowth FROM asianGDP2023 WHERE RegionalMember = 'Developing Asia' ORDER BY Year DESC LIMIT 10]
(Background on this error at: https://sqlalche.me/e/20/f405)
Thought:The error message indicates that there is an issue with the syntax of the query. The issue seems to be with the table name "ADOSep2023_A1-GDPgrowth" and the use of "LIMIT" which is not supported in MSSQL. I should rewrite the query using the correct syntax for MSSQL.
Action: sql_db_query_checker
Action Input: "SELECT TOP 10 Year, GDPgrowth FROM [ADOSep2023_A1-GDPgrowth] WHERE RegionalMember = 'Developing Asia' UNION SELECT TOP 10 Year, GDPgrowth FROM asianGDP2023 WHERE RegionalMember = 'Developing Asia' ORDER BY Year DESC"
Observation:
SELECT TOP 10 Year, GDPgrowth
FROM [ADOSep2023_A1-GDPgrowth]
WHERE RegionalMember = 'Developing Asia'
UNION ALL
SELECT TOP 10 Year, GDPgrowth
FROM asianGDP2023
WHERE RegionalMember = 'Developing Asia'
ORDER BY Year DESC
Thought:The query syntax is correct. Now I can execute the query to get the GDP growth for the Asian region.
Action: sql_db_query
Action Input: "SELECT TOP 10 Year, GDPgrowth FROM [ADOSep2023_A1-GDPgrowth] WHERE RegionalMember = 'Developing Asia' UNION SELECT TOP 10 Year, GDPgrowth FROM asianGDP2023 WHERE RegionalMember = 'Developing Asia' ORDER BY Year DESC"
Observation: [('2024 Forecast (ADO Sep 2023)', '4.8'), ('2024 Forecast (ADO Apr 2023)', '4.8'), ('2023 Forecast (ADO Sep 2023)', '4.7'), ('2023 Forecast (ADO Apr 2023)', '4.8'), ('2022', '4.3'), ('2021', '7.2'), ('2020', '-0.6')]
Thought:I now have the GDP growth data for the Asian region for the most recent years. The data includes both actual and forecasted GDP growth rates.
Final Answer: Here are the GDP growth rates for the Asian region for the most recent years:

- 2024 Forecast (ADO Sep 2023): 4.8%
- 2024 Forecast (ADO Apr 2023): 4.8%
- 2023 Forecast (ADO Sep 2023): 4.7%
- 2023 Forecast (ADO Apr 2023): 4.8%
- 2022: 4.3%
- 2021: 7.2%
- 2020: -0.6%

> Finished chain.

Response#1

‘Here are the GDP growth rates for the Asian region for the most recent years:\n\n- 2024 Forecast (ADO Sep 2023): 4.8%\n- 2024 Forecast (ADO Apr 2023): 4.8%\n- 2023 Forecast (ADO Sep 2023): 4.7%\n- 2023 Forecast (ADO Apr 2023): 4.8%\n- 2022: 4.3%\n- 2021: 7.2%\n- 2020: -0.6%’

From the process#1 output, we can see that the agent is able to conduct several sub-tasks with self-instruct logic.

1. Identifying Relevant Tables

  • Action: sql_db_list_tables
  • Input: None
  • Observation: Identified tables, such as “ADOSep2023_A1-GDPgrowth” and “asianGDP2023”, as potentially relevant to the query about GDP growth.
  • Thought: Noted that the two tables mentioned might contain the needed data and proceeded to check their schemas to comprehend the structure and data they hold.

2. Inspecting Table Schemas

  • Action: sql_db_schema
  • Input: Names of the two identified tables.
  • Observation: Viewed the schemas and a few rows of both tables, confirming they both have similar structures and relevant data, such as “Year” and “GDPgrowth”.
  • Thought: Decided to execute a query on these tables to retrieve GDP growth data for the Asian region, targeting the most recent 10 years.

3. Initial Query Formation and Validation

  • Action: sql_db_query_checker
  • Input: An SQL query aiming to extract the 10 most recent years’ GDP growth data from both tables, specifically for “Developing Asia”.
  • Observation: Reviewed the formed query’s syntax, using a UNION operation and limiting the results with LIMIT 10.
  • Thought: Acknowledged the syntax as logically valid and proceeded to execute the query.

4. Executing the Query

  • Action: sql_db_query
  • Input: The formulated SQL query.
  • Observation: Received an error regarding incorrect syntax near ‘-’ and ‘LIMIT’.
  • Thought: Identified two issues: handling the hyphen in the table name and the unsupported “LIMIT” clause in MSSQL. Decided to modify the query to adhere to MSSQL syntax.

5. Modifying and Revalidating the Query

  • Action: sql_db_query_checker
  • Input: A modified SQL query using square brackets to encapsulate the table name with a hyphen and utilizing TOP 10 instead of LIMIT.
  • Observation: Reviewed the modified query.
  • Thought: Acknowledged the amended syntax as valid and proceeded to reattempt query execution.

6. Re-executing the Modified Query

  • Action: sql_db_query
  • Input: The amended SQL query.
  • Observation: Successfully retrieved GDP growth data, both actual and forecasted, for various years.
  • Thought: With the data obtained, the agent prepared to provide the user with the requested GDP growth information for the Asian region.

7. Presenting the Data

Final Answer: The agent presented GDP growth rates, expressed as percentages, for several recent years and forecasts in a comprehensible format.

In summary, the agent effectively navigated through the process of identifying relevant tables, inspecting their schemas, crafting an SQL query, addressing a syntax error, reformatting and executing the query, and ultimately providing a clear answer based on the extracted data. This showcased adaptive problem-solving, syntactical error handling, and data presentation capabilities.

Question#2

agent_executor.run("which country has the fatest GDP grown in 2022")

Process#2

> Entering new AgentExecutor chain...
Action: sql_db_list_tables
Action Input: ""
Observation: ADOSep2023_A1-GDPgrowth, crime, test
Thought:The table "ADOSep2023_A1-GDPgrowth" seems to be the most relevant one for this question as it contains GDP growth data. I should check its schema to understand its structure and the fields it contains.
Action: sql_db_schema
Action Input: "ADOSep2023_A1-GDPgrowth"
Observation:
CREATE TABLE [ADOSep2023_A1-GDPgrowth] (
[RegionalMember] VARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[Year] VARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[GDPgrowth] VARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[UnitOfMeasurement] VARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[Subregion] VARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[CountryCode] VARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL
)

/*
3 rows from ADOSep2023_A1-GDPgrowth table:
RegionalMember Year GDPgrowth UnitOfMeasurement Subregion CountryCode
Developing Asia 2020 -0.6 %
Developing Asia 2021 7.2 %
Developing Asia 2022 4.3 %
*/
Thought:The table "ADOSep2023_A1-GDPgrowth" contains the columns "RegionalMember", "Year", "GDPgrowth", "UnitOfMeasurement", "Subregion", and "CountryCode". The "Year" and "GDPgrowth" columns are relevant to the question. I should write a query to find the country with the highest GDP growth in 2022.
Action: sql_db_query_checker
Action Input: "SELECT TOP 1 RegionalMember, GDPgrowth FROM ADOSep2023_A1-GDPgrowth WHERE Year = '2022' ORDER BY GDPgrowth DESC"
Observation:
SELECT TOP 1 RegionalMember, GDPgrowth FROM ADOSep2023_A1-GDPgrowth WHERE Year = '2022' ORDER BY GDPgrowth DESC
Thought:The query syntax is correct. Now I can execute it to get the country with the highest GDP growth in 2022.
Action: sql_db_query
Action Input: "SELECT TOP 1 RegionalMember, GDPgrowth FROM ADOSep2023_A1-GDPgrowth WHERE Year = '2022' ORDER BY GDPgrowth DESC"
Observation: Error: (pyodbc.ProgrammingError) ('42000', "[42000] [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Incorrect syntax near '-'. (102) (SQLExecDirectW)")
[SQL: SELECT TOP 1 RegionalMember, GDPgrowth FROM ADOSep2023_A1-GDPgrowth WHERE Year = '2022' ORDER BY GDPgrowth DESC]
(Background on this error at: https://sqlalche.me/e/20/f405)
Thought:The error message indicates that there is an issue with the table name. The hyphen in the table name "ADOSep2023_A1-GDPgrowth" is causing the error. I should rewrite the query and enclose the table name in square brackets to avoid this error.
Action: sql_db_query_checker
Action Input: "SELECT TOP 1 RegionalMember, GDPgrowth FROM [ADOSep2023_A1-GDPgrowth] WHERE Year = '2022' ORDER BY GDPgrowth DESC"
Observation:
SELECT TOP 1 RegionalMember, GDPgrowth FROM [ADOSep2023_A1-GDPgrowth] WHERE Year = '2022' ORDER BY GDPgrowth DESC
Thought:The revised query syntax is correct. Now I can execute it to get the country with the highest GDP growth in 2022.
Action: sql_db_query
Action Input: "SELECT TOP 1 RegionalMember, GDPgrowth FROM [ADOSep2023_A1-GDPgrowth] WHERE Year = '2022' ORDER BY GDPgrowth DESC"
Observation: [('Malaysia', '8.7')]
Thought:I now know the final answer
Final Answer: The country with the fastest GDP growth in 2022 is Malaysia with a growth rate of 8.7%.

> Finished chain.

Response#2

The country with the fastest GDP growth in 2022 is Malaysia with a growth rate of 8.7%.

The agent diligently worked through each action and thought process, properly adjusting its strategy upon encountering an error, ultimately successfully retrieving and presenting the desired data.

1. List Tables: First, it lists all available tables in the database to find the relevant one.

  • Action: sql_db_list_tables
  • Thought: Identified “ADOSep2023_A1-GDPgrowth” as potentially relevant due to its naming.

2. Schema Inspection: Then, it inspected the schema of the identified table to understand its structure and the types of data it holds.

  • Action: sql_db_schema
  • Input: “ADOSep2023_A1-GDPgrowth”
  • Thought: Identified relevant columns (“RegionalMember” and “GDPgrowth”) and realized “Year” and “GDPgrowth” columns could answer the query about the highest GDP growth in 2022.

3. Query Checker: The agent composed an SQL query to fetch the country with the highest GDP growth in 2022, first checking the syntax.

  • Action: sql_db_query_checker
  • Input: “SELECT TOP 1 RegionalMember, GDPgrowth FROM ADOSep2023_A1-GDPgrowth WHERE Year = ‘2022’ ORDER BY GDPgrowth DESC”
  • Observation: SQL was syntactically correct but an error was later encountered during execution due to a hyphen in the table name.

4. Query Execution: Initially, the agent tried executing the SQL query but encountered a syntax error due to a hyphen in the table name.

  • Action: sql_db_query
  • Input: Initial query without brackets for the table name
  • Observation: Encountered an error due to a hyphen in the table name.
  • Thought: Decided to revise the query to avoid the error by enclosing the table name in square brackets.

5. Query Revision and Re-check: It revised and rechecked the query by enclosing the table name in square brackets.

  • Action: sql_db_query_checker
  • Input: “SELECT TOP 1 RegionalMember, GDPgrowth FROM [ADOSep2023_A1-GDPgrowth] WHERE Year = ‘2022’ ORDER BY GDPgrowth DESC”
  • Thought: Validated that the revised query was syntactically correct.

6. Revised Query Execution: Finally, the agent successfully executed the revised query.

  • Action: sql_db_query
  • Input: Revised query with brackets around the table name
  • Observation: Obtained the desired data — [(‘Malaysia’, ‘8.7’)].

7. Answer Formulation: Presented the answer based on the obtained results.

  • Final Answer: Malaysia experienced the highest GDP growth in the Asian region in 2022, with a rate of 8.7%.

Question#3

agent_executor.run("which country has the fatest GDP grown in 2022 in Asia?")

Process#3

> Entering new AgentExecutor chain...
Action: sql_db_list_tables
Action Input: ""
Observation: ADOSep2023_A1-GDPgrowth, crime, test
Thought:The table "ADOSep2023_A1-GDPgrowth" seems to be the most relevant one for this question. I should check its schema to understand its structure and the data it contains.
Action: sql_db_schema
Action Input: "ADOSep2023_A1-GDPgrowth"
Observation:
CREATE TABLE [ADOSep2023_A1-GDPgrowth] (
[RegionalMember] VARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[Year] VARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[GDPgrowth] VARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[UnitOfMeasurement] VARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[Subregion] VARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[CountryCode] VARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL
)

/*
3 rows from ADOSep2023_A1-GDPgrowth table:
RegionalMember Year GDPgrowth UnitOfMeasurement Subregion CountryCode
Developing Asia 2020 -0.6 %
Developing Asia 2021 7.2 %
Developing Asia 2022 4.3 %
*/
Thought:The table "ADOSep2023_A1-GDPgrowth" contains the columns "RegionalMember", "Year", "GDPgrowth", "UnitOfMeasurement", "Subregion", and "CountryCode". The "Year" and "GDPgrowth" columns are relevant to the question. The "Subregion" column might contain the information about the region Asia. I should write a query to get the country with the highest GDP growth in 2022 in Asia.
Action: sql_db_query_checker
Action Input: "SELECT TOP 1 RegionalMember, GDPgrowth FROM ADOSep2023_A1-GDPgrowth WHERE Year = '2022' AND Subregion = 'Asia' ORDER BY GDPgrowth DESC"
Observation:
SELECT TOP 1 RegionalMember, GDPgrowth FROM ADOSep2023_A1-GDPgrowth WHERE Year = '2022' AND Subregion = 'Asia' ORDER BY GDPgrowth DESC
Thought:The query syntax is correct. Now I can execute it to get the country with the highest GDP growth in 2022 in Asia.
Action: sql_db_query
Action Input: "SELECT TOP 1 RegionalMember, GDPgrowth FROM ADOSep2023_A1-GDPgrowth WHERE Year = '2022' AND Subregion = 'Asia' ORDER BY GDPgrowth DESC"
Observation: Error: (pyodbc.ProgrammingError) ('42000', "[42000] [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Incorrect syntax near '-'. (102) (SQLExecDirectW)")
[SQL: SELECT TOP 1 RegionalMember, GDPgrowth FROM ADOSep2023_A1-GDPgrowth WHERE Year = '2022' AND Subregion = 'Asia' ORDER BY GDPgrowth DESC]
(Background on this error at: https://sqlalche.me/e/20/f405)
Thought:The error message indicates that there is an issue with the table name. The hyphen in the table name "ADOSep2023_A1-GDPgrowth" might be causing the problem. I should rewrite the query with the table name enclosed in square brackets.
Action: sql_db_query_checker
Action Input: "SELECT TOP 1 RegionalMember, GDPgrowth FROM [ADOSep2023_A1-GDPgrowth] WHERE Year = '2022' AND Subregion = 'Asia' ORDER BY GDPgrowth DESC"
Observation:
SELECT TOP 1 RegionalMember, GDPgrowth FROM [ADOSep2023_A1-GDPgrowth] WHERE Year = '2022' AND Subregion = 'Asia' ORDER BY GDPgrowth DESC
Thought:The query syntax is correct. Now I can execute it to get the country with the highest GDP growth in 2022 in Asia.
Action: sql_db_query
Action Input: "SELECT TOP 1 RegionalMember, GDPgrowth FROM [ADOSep2023_A1-GDPgrowth] WHERE Year = '2022' AND Subregion = 'Asia' ORDER BY GDPgrowth DESC"
Observation:
Thought:The query did not return any results. It's possible that there are no records for the year 2022 in the Asia subregion in the database. Therefore, I cannot provide the answer to the question.
Final Answer: I'm sorry, but I couldn't find the information you're looking for in the database.

> Finished chain.

Response#3

“I’m sorry, but I can’t provide the answer to your question due to lack of data.”

This time, the Langchain agent is not able to get the answer which should be the same answer as the previous question: Malaysia, 8.7%. By looking into the steps, the agent is actually getting very close and failed in the last step. Since the subregion is coded as ‘South Asia’, ‘Southeast Asia’, etc., the where clause ‘Subregion = ‘Asia’’ is the direct cause for no data returned because ‘Asia’ is not a subregion. Of cause, we can fix this by changing the SQLToolkit in Langchain, however, since we are working with LLMs, I want to see if changing the wording (prompt engineering) can fix the problem.

Question#4

agent_executor.run("which country has the fatest GDP grown in 2022 in general Asia area?")

Process#4

> Entering new AgentExecutor chain...
Action: sql_db_list_tables
Action Input: ""
Observation: ADOSep2023_A1-GDPgrowth, crime, test
Thought:The table "ADOSep2023_A1-GDPgrowth" seems to be the most relevant one for this question as it contains GDP growth data. I should check its schema to understand its structure and the data it contains.
Action: sql_db_schema
Action Input: "ADOSep2023_A1-GDPgrowth"
Observation:
CREATE TABLE [ADOSep2023_A1-GDPgrowth] (
[RegionalMember] VARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[Year] VARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[GDPgrowth] VARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[UnitOfMeasurement] VARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[Subregion] VARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[CountryCode] VARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL
)

/*
3 rows from ADOSep2023_A1-GDPgrowth table:
RegionalMember Year GDPgrowth UnitOfMeasurement Subregion CountryCode
Developing Asia 2020 -0.6 %
Developing Asia 2021 7.2 %
Developing Asia 2022 4.3 %
*/
Thought:The table "ADOSep2023_A1-GDPgrowth" contains the columns "RegionalMember", "Year", "GDPgrowth", "UnitOfMeasurement", "Subregion", and "CountryCode". The "Year" and "GDPgrowth" columns are relevant to the question. The "Subregion" column might contain information about the general Asia area. I should write a query to find the country with the fastest GDP growth in 2022 in the general Asia area.
Action: sql_db_query_checker
Action Input: "SELECT TOP 1 RegionalMember, GDPgrowth FROM ADOSep2023_A1-GDPgrowth WHERE Year = '2022' AND Subregion LIKE '%Asia%' ORDER BY GDPgrowth DESC"
Observation:
SELECT TOP 1 RegionalMember, GDPgrowth
FROM ADOSep2023_A1-GDPgrowth
WHERE Year = '2022' AND Subregion LIKE '%Asia%'
ORDER BY GDPgrowth DESC
Thought:The query syntax is correct. Now I can execute it to get the country with the fastest GDP growth in 2022 in the general Asia area.
Action: sql_db_query
Action Input: "SELECT TOP 1 RegionalMember, GDPgrowth FROM ADOSep2023_A1-GDPgrowth WHERE Year = '2022' AND Subregion LIKE '%Asia%' ORDER BY GDPgrowth DESC"
Observation: Error: (pyodbc.ProgrammingError) ('42000', "[42000] [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Incorrect syntax near '-'. (102) (SQLExecDirectW)")
[SQL: SELECT TOP 1 RegionalMember, GDPgrowth FROM ADOSep2023_A1-GDPgrowth WHERE Year = '2022' AND Subregion LIKE '%Asia%' ORDER BY GDPgrowth DESC]
(Background on this error at: https://sqlalche.me/e/20/f405)
Thought:The error message indicates that there is an issue with the table name. The hyphen in the table name "ADOSep2023_A1-GDPgrowth" is causing the error. I need to enclose the table name in square brackets to resolve this issue.
Action: sql_db_query_checker
Action Input: "SELECT TOP 1 RegionalMember, GDPgrowth FROM [ADOSep2023_A1-GDPgrowth] WHERE Year = '2022' AND Subregion LIKE '%Asia%' ORDER BY GDPgrowth DESC"
Observation:
SELECT TOP 1 RegionalMember, GDPgrowth FROM [ADOSep2023_A1-GDPgrowth] WHERE Year = '2022' AND Subregion LIKE '%Asia%' ORDER BY GDPgrowth DESC
Thought:The query syntax is correct. Now I can execute it to get the country with the fastest GDP growth in 2022 in the general Asia area.
Action: sql_db_query
Action Input: "SELECT TOP 1 RegionalMember, GDPgrowth FROM [ADOSep2023_A1-GDPgrowth] WHERE Year = '2022' AND Subregion LIKE '%Asia%' ORDER BY GDPgrowth DESC"
Observation: [('Malaysia', '8.7')]
Thought:I now know the final answer
Final Answer: The country in the general Asia area with the fastest GDP growth in 2022 is Malaysia with a GDP growth of 8.7%.

> Finished chain.

Response#4

‘The country in the general Asia area with the fastest GDP growth in 2022 is Malaysia with a GDP growth of 8.7%.’

Bingo!By changing the question and asking about the “general Asia area”, the agent successfully come up with the correct answer. Let’s walk through the revised sequence of actions and thoughts after adjusting the question.

Sequence

1. Table Selection:

  • Action: sql_db_list_tables
  • Thought: Recognized “ADOSep2023_A1-GDPgrowth” as the potential source of GDP data.

2. Schema Exploration:

  • Action: sql_db_schema
  • Input: “ADOSep2023_A1-GDPgrowth”
  • Thought: Identified relevant columns (“Year”, “GDPgrowth”, “Subregion”) and recognized the focus on 2022 data.

3. Query Design and Checking:

  • Action: sql_db_query_checker
  • Input: “SELECT TOP 1 RegionalMember, GDPgrowth FROM ADOSep2023_A1-GDPgrowth WHERE Year = ‘2022’ AND Subregion LIKE ‘%Asia%’ ORDER BY GDPgrowth DESC”
  • Thought: Initially missed the syntactical error related to the hyphen in the table name.

4. First Query Execution:

  • Action: sql_db_query
  • Input: The query without brackets surrounding the table name.
  • Thought: Encountered a syntax error due to the hyphen, necessitating a query revision.

5. Query Revision and Revalidation:

  • Action: sql_db_query_checker
  • Input: “SELECT TOP 1 RegionalMember, GDPgrowth FROM [ADOSep2023_A1-GDPgrowth] WHERE Year = ‘2022’ AND Subregion LIKE ‘%Asia%’ ORDER BY GDPgrowth DESC”
  • Thought: Syntax was validated, proceeding toward execution.

6. Revised Query Execution:

  • Action: sql_db_query
  • Input: The revised and corrected query.
  • Observation: Malaysia with a GDP growth of 8.7% in 2022.
  • Thought: Secured the necessary data to answer the question.

7. Concluding Answer:

  • Final Answer: Malaysia experienced the fastest GDP growth in the general Asia area in 2022, at a rate of 8.7%.

Upon shifting the focus to the ‘‘general Aisa area’’, the agent successfully navigated through database interactions to extract relevant data and derive the answer. A key point of note is the utilization of SQL wildcards (LIKE '%Asia%') to broaden the subregion scope, ensuring the inclusion of all areas within Asia without explicit "Asia" labeling. This was particularly astute given the potential for various Asian subregion categorizations. The agent adeptly managed an SQL syntax error (related to the hyphen), revising and revalidating the query to successfully retrieve the desired data, thus demonstrating a nimble problem-solving capability. Consequently, Malaysia was identified as the country with the highest GDP growth in the broader Asian region in 2022.

Conclusion

In this article, we covered environment setup, database preparation, Langchain SQL agent build, and a simple performance evaluation. Let’s evaluate the Langchain SQL agent across several dimensions for the conversational SQL query task.

Strengths

1. Understanding and Parsing the Question:

  • The agent showed a strong ability to comprehend user intent and formulate SQL queries that aim to extract relevant data from the database.

2. Schema Understanding:

  • The agent successfully pulled and interpreted the schema, identifying relevant columns for the query, which is crucial for constructing accurate and pertinent SQL statements.

3. Error Handling:

  • Demonstrated adept error identification and handling by recognizing syntax issues (e.g., the hyphen in the table name) and adjusting accordingly.

4. Query Optimization:

  • The usage of SQL wildcards to navigate and adapt to potential subregion categorization variations (‘%Asia%’) was smart and indicative of anticipatory problem-solving.

5. Structured Approach:

  • The agent operated in a structured manner, executing actions sequentially from understanding the schema to designing, validating, and executing the SQL query.

Weaknesses

1. Assumption of Data and Categories:

  • Initially, the agent made assumptions regarding subregion categorization (“Asia”) without validating them against the database, which could potentially lead to errors or inaccurate queries.

2. Data Type Consideration:

  • There was a lack of consideration regarding data types in columns. The GDP growth rates, typically considered numerical, were stored as VARCHAR and should ideally be converted for accurate sorting.

3. Preemptive Query Validation:

  • The initial query execution without adequate syntax validation (especially with known irregular characters like hyphens in table names) revealed a potential area for enhancement in preemptive syntax checking and correction.

4. Explanatory and Confirmatory Actions:

  • The agent might enhance user trust and alignment by providing brief summaries or previews of data, and offering confirmatory cues throughout the process.

5. Retrospective Data Validation:

  • After no data was returned for 2023, the agent could potentially perform a retrospective check to verify data existence in the specified year, ensuring complete user clarity.

Opportunities for Improvement

  • Enhanced Data Validation: Before constructing queries, ensuring the validation of user-specified categories or values against existing data could enhance accuracy and user experience.
  • Data-Type Adaptation: Implementing data type conversion where pertinent, ensuring accurate numerical sorting and comparisons in SQL queries.
  • Informative Interim Updates: Provide interim summaries, data previews, or validation checks during the interaction to keep the user informed and aligned.
  • Action Robustness: Enhancing the agent’s ability to handle varied table names and schema structures robustly, minimizing syntax or execution errors.
  • User Confirmation: Including occasional user confirmations or check-ins, especially when assumptions are made, to ensure user-agent alignment and accuracy.

Overall Evaluation

The Langchain SQL agent exhibited a commendable capability in understanding, constructing, and executing SQL queries in a conversational manner, reflecting a structured approach and adept problem-solving, particularly in error handling. The incorporation of logical and anticipatory query adjustments (like SQL wildcards) reflected a nuanced understanding of database interaction nuances. However, the process also revealed opportunities for enhancing the approach toward data and syntax validation, user engagement, and explanatory interaction to augment the accuracy, robustness, and user experience in future interactions. With a few refinements, the agent shows substantial promise as an effective and user-friendly interface for conversational SQL querying.

Implementation Suggestion

Other than performance, security, reliability, efficiency, and user satisfaction are left out but need to be considered for production implementation. Some key items suggested for consideration include but are not limited to:

1. Enhanced Security:

  • Ensure that the LangChain system has robust authentication and authorization mechanisms.
  • Implement data encryption both at rest (in the SQL Server) and in transit (when data is communicated between components).
  • Regularly audit and update security protocols to guard against vulnerabilities.

2. Rate Limiting and Throttling:

  • Implement rate limiting on the LangChain to prevent abuse or Denial-of-Service (DoS) attacks.

3. Monitoring and Logging:

  • Integrate comprehensive monitoring and logging mechanisms to track queries, anomalies, and potential security breaches.
  • This will assist in troubleshooting, performance monitoring, and ensuring system health.

4. Scalability:

  • Design the system with scalability in mind. The SQL Server and LangChain should be able to handle increased loads, possibly through load balancing or clustering solutions.

5. Backup and Disaster Recovery:

  • Regularly back up the SQL Server data and ensure there’s a disaster recovery plan in place. This will help in quickly restoring services in case of failures.

6. Natural Language Processing (NLP) Improvements:

  • As the system relies on natural language interpretation, continuous training and improvement of the NLP model in LangChain is crucial for accuracy and efficiency.
  • Consider incorporating feedback loops where users can correct or validate translations, enhancing the system’s learning over time.

7. User Interface & Experience:

  • For the Client Application, ensure that the interface is user-friendly and intuitive, promoting easy and efficient communication in natural language.
  • Provide users with feedback when their queries are being processed, and helpful error messages if something goes wrong.

8. Redundancy:

  • Implement redundancy for critical system components to ensure high availability. This is especially crucial for databases and the LangChain service.

--

--