Building SQL Validation Rules with LangGraph
In this blog post, we’ll introduce a simple tool created with LangGraph, designed to generate SQL validation rules that help detect errors in table columns on any relational database. This command line tool will internally use Large Language Models (LLMs) to analyse the relational database fields and generate SQL statements used to pinpoint faulty or anomalous data.
What is LangGraph?
LangGraph is a software library built on top of LangChain which allows to build workflow-based or multi-actor applications with LLMs. It is part of the LangChain ecosystem, allows to build agents as graphs and also to create multi-agent applications, similar to those orchestrated by frameworks like e.g. Autogen.
Below you can see the relationship between LangGraph and LangChain:
LangChain is a framework which helps you to create LLM based workflows and agents. It has a wide variety of agent tools that integrate with all types of systems — from databases, to search engines, to data science libraries and much more. And it also provides APIs to access a LLMs in a standardised manner. Here you can find all LLM integrations available in LangChain:
LCEL — the LangChain Expression Language — allows the simplified creation of acyclic workflows. It supports features, like parallelisation, streaming, asynchronous executions of LLM based workflows with a simplified syntax. Check the following link for more information:
LangGraph supports cyclic graphs, making it well-suited for developing agents, which typically feature workflows that include loops.
When you use LangGraph you have to setup a graph by defining nodes and the edges connections between them. The nodes correspond to Python or Javascript functions, depending on whether you are using the Python or Javascript version of this library.
You can find the documentation of the Javascript LangGraph implementation here:
And the Python version of the LangGraph is here:
A SQL Validation Rule Generator
But let us get back to our SQL validation rule generator. What does it do? It generates SQL statements which detect potential anomalies in the columns of a table in a relational database.
The rule generator is capable of generating multiple SQL validation queries with simple explanation for one column:
# Table: call_center Column: cc_call_center_sk
Attempted generations: 4
```sql
-- Numeric Validation
SELECT cc_call_center_sk
FROM "TPCDS_SF10TCL".call_center
WHERE TRY_CAST(cc_call_center_sk AS NUMBER) IS NULL
LIMIT 1000;
```
```sql
-- Duplicate Check
SELECT
cc_call_center_sk,
COUNT(*)
FROM
call_center
GROUP BY
cc_call_center_sk
HAVING
COUNT(*) > 1
LIMIT 1000;
```
```sql
-- Check for non-numeric values
SELECT cc_call_center_sk FROM "TPCDS_SF10TCL".call_center WHERE TRY_TO_NUMBER(cc_call_center_sk) IS NULL LIMIT 1000;
```
```sql
-- Check for non-numeric values
SELECT cc_call_center_sk
FROM "TPCDS_SF10TCL".call_center
WHERE TRY_TO_NUMBER(cc_call_center_sk) IS NULL
LIMIT 1000;
```
This tool has at the time of writing this blog only a simple command line interface (CLI) that uses a simple LangGraph agent to connect to the database and to generate the SQL statements.
The SQL Validation Rule Graph
The agent is the core part of this tool. It is embedded into an execution graph. This execution graph can be described using a BPMN workflow which you can see below:
These are the steps in this execution graph:
- Receive the initial input which has the table and column as mandatory arguments and an optional list of previously executed rules. (see “Receive input”)
- The LLM powered agent receives an input (see “Ask LLM what to do”).
- After interacting with the LLM it decides which tool to use or whether it has gathered all information needed to generate a rule. It might also fail after too many retries. These are the potential outcomes of the LLM call:
- Get table info tool: this tool fetches the DDL (Data Definition Language) query of the table to which the column belongs and also fires a query which returns a sample of the data in that table.
- Check SQL query tool: this tool uses the LLM to verify if the generated SQL is syntactically correct.
- Run SQL query tool: this tool executes the generated SQL against the database, in order to see if it works.
- Success: in case the agent has generated and validated the SQL query, then it sends the output to the final extraction step.
- Error: too many retries or the LLM call triggered some kind of error. - Tool execution gateway: After a tool is executed the result is either success or an error. The result of the tool execution is fed back to the LLM (See “Tool replies” and “Tool error”).
- SQL query extraction: the SQL query gets extracted from a typically verbose LLM response (See “Extract validation type and SQL using LLM”). This is the last step before the workflow finished successfully.
As you can see from the BPMN diagramme, the agent is the main part of the graph. The agent also has a state that acts as an internal memory and accumulates knowledge during its execution (more information about the agent down below).
Source Code
You can find the SQL Validation Rule Generator in this Github repository:
Setup
The README file of the project describes how you can install this project locally. You will need to install Conda first, create your environment, install Poetry and then run these commands:
conda create -n sql_validation_rules python=3.12
conda activate sql_validation_rules
pip install poetry
poetry install
The configuration for this tool will be stored in the .env
file. All of the configuration variables can be found in the .env_local file:
OPENAI_API_KEY=<key>
OPENAI_API_MODEL=gpt-4-1106-preview
OPENAI_API_TEMPERATURE=0
VERBOSE_LLM=True
PROJECT_ROOT=/development/playground/agents/sql_validation_rules
SNOWFLAKE_ACCOUNT=<account_name>
SNOWFLAKE_USER=<user>
SNOWFLAKE_PASSWORD=<password>
SNOWFLAKE_DATABASE=SNOWFLAKE_SAMPLE_DATA
SNOWFLAKE_SCHEMA=<schema>
SNOWFLAKE_WAREHOUSE=DEMO_WH
SNOWFLAKE_HOST=<host>
As you can see there are configuration variables related to an LLM connection and a Snowflake database. We have tested this tool against a Snowflake instance, but it should work with any other database supported by sqlalchemy, a “Python SQL Toolkit and Object Relational Mapper”.
You will need to have access to the OpenAI API and access to a SQL database, like Snowflake, Postgres, Oracle, MySQL, etc to be able to run this tool.
CLI
The command line application code can be found in the main.py file:
The CLI allows you to generate rules either for a full table or for a single column in a table. The main commands are:
- generate-multiple-column-rules: this command can be used to generate multiple rules for a single column.
- generate-rules: used to generate multiple commands for all columns of a specific table.
If you want to list all commands of this tool you can execute this command:
python .\sql_validation_rules\cli\main.py --help
Upon running this command you will see this:
Usage: main.py [OPTIONS] COMMAND [ARGS]...
Options:
--help Show this message and exit.
Commands:
generate-column-rule Generate a single rule for a column
generate-multiple-column-rules Generate multiple rules for a column
generate-rules Generates rules for all fields in a table
list-columns Lists all columns in a table.
list-tables Lists all tables in a database.
You can get more information on each command using the — help
option:
python .\sql_validation_rules\cli\main.py generate-multiple-column-rules --help
Usage: main.py generate-multiple-column-rules [OPTIONS]
Generate multiple rules for a column
Options:
-t, --table TEXT The table of the column for which you want to generate
rules.
-c, --column TEXT The column for which you want to generate rules.
--count INTEGER How many distinct rules to generate
-f, --file TEXT The output file into which the output is recorded.
--help Show this message and exit.
CLI Example Usage
If you want to generate 3 validation rules for column cc_street_name
in table call_center
you can execute this command:
python ./sql_validation_rules/cli/main.py generate-multiple-column-rules \
--table call_center -c cc_city --count 3 -f call_center_cc_city.txt
This command will produce a file call_center_cc_city.txt
. with this content:
# Table: call_center Column: cc_city
Attempted generations: 3
```sql
-- Regex Pattern Validation
SELECT cc_city FROM "TPCDS_SF10TCL".call_center WHERE REGEXP_LIKE(cc_city, '^[a-zA-Z\s]+$') = FALSE LIMIT 1000;
```
```sql
-- Anomaly Detection
SELECT *
FROM "TPCDS_SF10TCL".call_center
WHERE LENGTH(TRIM(cc_city)) = 0 -- Check for empty strings
OR cc_city REGEXP '[^a-zA-Z .-]' -- Check for invalid characters using REGEXP
LIMIT 1000;
```
```sql
-- Anomaly Detection
SELECT cc_city FROM "TPCDS_SF10TCL".call_center WHERE REGEXP_LIKE(cc_city, '^[A-Za-z\s]+$') = FALSE LIMIT 1000;
```
Implementation
These are the main components of the implementation:
- configuration
- agent
- agent state
- agent initialization
- tools - graph
- CLI
Configuration
The configuration is based on environment variables typically stored in the .env
file. You can find the file which prepares the configuration here.
The graph relies on a state (see this file) to be able to operate:
from typing import TypedDict, Annotated, List, Union
from langchain_core.agents import AgentAction, AgentFinish
from langchain_core.messages import BaseMessage
import operator
from sql_validation_rules.chain.sql_commands import SQLCommand
FIELD_EXCLUSION_RULES = "exclusion_rules"
class AgentState(TypedDict):
"""The graph state."""
table: str
"""The relational table to which the column is associated."""
field: str
"""The name of the column for which the SQL validation rules are generated."""
exclusion_rules: str
"""The rule types that were previously generated and we would like to exclude."""
chat_history: List[BaseMessage]
"""The list with the message up to a certain point in time."""
agent_outcome: Union[AgentAction, AgentFinish, None]
"""The last agent outcome which could be an agent action or an object marking the end of the agent execution."""
intermediate_steps: Annotated[list[tuple[AgentAction, str]], operator.add]
"""List of actions and corresponding observations."""
extraction_content: SQLCommand
"""The extracted SQL command from the extraction LLMChain."""
The state contains the initial parameters, lists that accumulate messages and also the output object extraction_content
.
Agent
The agent initialization is implemented in the agent_factory.py file:
from typing import List
from langchain.agents import create_openai_functions_agent
from langchain_core.runnables.base import RunnableSequence
from sql_validation_rules.config.config import cfg
from sql_validation_rules.tools.sql_tools import (
sql_query,
sql_query_checker,
sql_info_tables,
)
from sql_validation_rules.agent.prompt_factory import create_sql_validation_template
from sql_validation_rules.agent.agent_state import FIELD_EXCLUSION_RULES
def create_agent_runnable() -> RunnableSequence:
"""Construct an OpenAI functions agent"""
return create_openai_functions_agent(
llm=cfg.llm,
prompt=create_sql_validation_template(),
tools=[sql_info_tables, sql_query, sql_query_checker],
)
agent_runnable = create_agent_runnable()
In this file we initialize an OpenAI functions agent with a call to create_openai_functions_agent
. The agent is associated during its initialization to a prompt 3 tools (sql_info_tables
, sql_query
, sql_query_checker
) and to a prompt. The prompt is initialized in prompt_factory.py:
from langchain_core.prompts import ChatPromptTemplate
from sql_validation_rules.config.toml_support import prompts
def create_sql_validation_template() -> ChatPromptTemplate:
main_prompts = prompts["sql_validation"]["main"]
return ChatPromptTemplate.from_messages(
[
("system", main_prompts["system_message"]),
# Means the template will receive an optional list of messages under
# the "placeholder" key
("placeholder", "{chat_history}"),
("human", main_prompts["human_message"]),
("placeholder", "{agent_scratchpad}"),
]
)
The prompt template refers to the prompt stored in file prompts.toml. The main prompt is the human prompt which is linked to three input keys:
- table
- field
- exclusion_rules
This is the prompt with the agent instructions:
Given an input query create syntactically correct Snowflake SQL queries which validate the data, then look at the results of the query and return the query.
Remember: a SQL validation query checks whether the data in a field is valid and looks for anomalies.
If the field is an email you need to generate SQL which checks whether the content of the field is an email.
If the field is a personal name, then it should not contain numbers. It should contain only alpha numeric characters with spaces and little punctuation.
If the field is a phone number, then the field should only contain characters allowed in phone numbers.
If the field is a zip code, then the field should only contain a specific alpha numeric pattern.
If the field is a numeric, then the field should only contain numbers in a specific range that matches the purpose of the field (like e.g. for an personal age field you should have no negative mumbers).
You must double check your query before executing it. If you get an error while executing a query, rewrite the query and try it again.
Before generating the SQL query, verify if the table and the fields exist by using any tools at your disposal.
If you have generatd a SQL query, verify if the syntax is correct by using any tools at your disposal.
If you execute SQL queries limit them to a maximum of 1000 rows.
DO NOT run any DML statements (INSERT, UPDATE, DELETE, DROP, etc.) against the database.
Can you NOT generate sql for these specific rules if specified (ignore if nothing specified):
```{exclusion_rules}```
Can you generate SQL validation queries for table {table} and field {field}?
As you can see the prompt gives the main instruction about SQL query generation. It then gives some detailed output instructions (“if this do that”) and then some instructions on tool usage, as well as what to avoid in terms of SQL generation.
If you decide to use this prompt, perhaps you should remove the reference to “Snowflake”.
And finally you have the agent tools which are instantiated in file sql_tools.py and are just wrappers around tools provided by LangChain:
import sys
from sql_validation_rules.db_connection_factory import sql_db_factory
from langchain_community.tools.sql_database.tool import (
InfoSQLDatabaseTool,
ListSQLDatabaseTool,
QuerySQLCheckerTool,
QuerySQLDataBaseTool,
)
from langchain.tools import BaseTool, tool
from langchain_community.tools.sql_database.tool import ListSQLDatabaseTool
from langchain_core.tools import BaseTool
from sql_validation_rules.config.config import cfg
from sql_validation_rules.tools.list_columns_tool import ListIndicesSQLDatabaseTool
db = sql_db_factory()
list_tables_tool: BaseTool = ListSQLDatabaseTool(db=db)
info_tables_tool: BaseTool = InfoSQLDatabaseTool(db=db)
query_sql: BaseTool = QuerySQLDataBaseTool(db=db)
query_sql_checker: BaseTool = QuerySQLCheckerTool(db=db, llm=cfg.llm)
query_columns_tool: BaseTool = ListIndicesSQLDatabaseTool(db=db)
...
@tool("info_tables", return_direct=True)
def sql_info_tables(table_list_str: str) -> str:
"""Returns information about a list of tables."""
return info_tables_tool(tool_input=table_list_str)
@tool("sql_query", return_direct=True)
def sql_query(sql: str) -> str:
"""Executes a SQL query"""
return query_sql(tool_input=sql)
@tool("sql_query_checker", return_direct=True)
def sql_query_checker(sql: str) -> str:
"""Validates SQL queries to check if the syntax is correct"""
Graph
The graph is created in file graph_factory.py. This is the relevant code which defines the graph and then creates a runnable instance:
from langgraph.graph import END, StateGraph
...
workflow = StateGraph(AgentState)
workflow.add_node(AGENT, run_agent) # LLM
workflow.add_node(ACTION, execute_tools) # SQL tools
workflow.add_node(EXTRACTION, run_extraction) # Extraction
workflow.set_entry_point(AGENT)
workflow.add_edge(ACTION, AGENT)
workflow.add_conditional_edges(
AGENT, should_continue, {"continue": ACTION, "end": EXTRACTION}
)
workflow.add_edge(EXTRACTION, END)
app = workflow.compile()
The workflow is instantiated, the nodes are added as function references. Then the entry point (“AGENT”) is defined. Finally the edges are added; note that here you have two types of edges:
- normal edges
- conditional edges — here you can see that the conditional edge is linked to the function
should_continue
:
def should_continue(data):
"""Define logic that will be used to determine which conditional edge to go down."""
if isinstance(data[AGENT_OUTCOME], AgentFinish):
return "end"
return "continue"
Conclusion
The SQL Rule Validator works well with ChatGPT 4 (model: gpt-4-1106-preview
). We have also tried ChatGPT 3.5, but the generated SQL did not make always sense, even though the agent would not crash or anything like that. It would be really interesting to try it out with other models, like the newly released Llama 3.
It is also quite interesting to see how the model interacts with the tools and inspects the databases, checks queries and fires them and then analyses errors. We have captured in more details some of the model interactions in this file here.
The SQL Rule Validator barely touches the full potential of LangGraph. You can write complex extremely complex workflows with it, combined with multi-agent interactions in all sorts of topologies — including supervisor agent based topologies, which we are going to explore in the next blog. You should be able to produce similar tools with Autogen and CrewAI. One advantage of LangGraph over these two is its tight integration with the extensive LangChain environment. Another one is the definition of a clearly defined execution graph.
Thank you for reading this story. See you on the next one.