Harnessing SQL Agents to Power Telecom Data Insights using LangChain

Awais Kaleem
8 min readMay 4, 2024

--

If you have worked in the telecommunication industry, you understand the struggle of catching up to thousands of business rules in the data model. From billing systems, network and usage records to payment plans, it is a vast sea of intertwining rules. Onboarding onto such a system can take months for a new joiner, not to mention the daunting effort of bringing a change that doesn’t break the existing model. An LLM powered LangChain SQL Agent can help with that. To start up one in an established telco company, all it needs is access to the model schema, some commonly executed queries and permission to execute.

For this particular blog, I also created a dummy Telco SQL database which is available in the github link for this project. The data size is small but it will serve the purpose of demonstration here. For setup, you will need following libraries besides a python environment of or above version 3.10.

langchain
langchain-community
langchain-openai
faiss-cpu
dot-env

LangChain (langchain) is a popular framework for simplifying the creation of LLM applications. LangChain Community (langchain-community) contains third-party integrations that implement the base interfaces defined in LangChain Core. LangChain-OpenAI (langchain-openai) is an integration package that connects OpenAI and LangChain. Faiss-AI (faiss-cpu) is a library for efficient similarity search and clustering of dense vectors.

Database Connection

To load the data from the database, we use SQLDatabase from langchain_community.utilities module. Even though we are using an SQLLite version of database, you can replace this with any other database provider. Here is an example of how to connect Oracle DB with SQLDatabase instance.

from langchain_community.utilities import SQLDatabase

db = SQLDatabase.from_uri("sqlite:///Telco.db")
print(db.dialect)
print(db.get_usable_table_names())
db.run("SELECT * FROM Customers LIMIT 10;")

LLM Connection

Now that we know our database is accessible, let’s connect it with our choice our LLM. I am choosing OpenAI GPT 3.5 Turbo but you can replace this with an LLM of your choice here as well. For OpenAI, note that you will need a paid version of API at this stage with an environment variable OPENAI_API_KEY set.

from langchain_community.agent_toolkits import create_sql_agent
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

SQL Agent Creation

Now we proceed to create our SQL Agent with connection to the database and the LLM.

agent_executor = create_sql_agent(llm, db=db, agent_type="openai-functions", verbose=True)

We are choosing to use `openai-functions` here which is ideal for function calling. To understand different types of agents available, refer to this documentation.

Let’s see if our SQL Agent can access our DB and return us some basic results:

agent_executor.invoke(
"List the total bill per customer. Which customer has the most in Payments?"
)
agent_executor.invoke("Which Plan is the most popular?")

Few Shot Learning using example queries:

Tuning an LLM with a small number of examples is called few shot learning. We will throw in a few example queries for the LLM to learn how a database user generally chooses to use the underlying schema. Each example has a key called “input” which contains the description of what query attached to it does.

examples = [
{"input": "List all subscription plans.", "query": "SELECT * FROM Plans;"},
{"input": "What is the name for plan_id 1?", "query": "SELECT plan_name FROM Plans where plan_id=1;"},
{
"input": "Find all customers for the plan 'Basic'.",
"query": """
SELECT * FROM Customers
WHERE customer_id IN (
SELECT customer_id FROM Subscriptions
WHERE plan_id = (
SELECT plan_id FROM Plans WHERE plan_name = 'Basic'
);
);
""",
},
.... [see list of examples in notebook here]

We embed these examples using OpenAIEmbeddings and FAISS vector index (by Engineering@Meta).

from langchain_core.example_selectors import SemanticSimilarityExampleSelector
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

example_selector = SemanticSimilarityExampleSelector.from_examples(
examples,
OpenAIEmbeddings(),
FAISS,
k=5,
input_keys=["input"],
)

Langchain core provides us with `FewShotPromptTemplate` which can contain the examples we have created above as a template. It provides us with options like input_variables (where we can input params like top_k etc.) as well as `prefix` (where we can mention the prompt agent will use with each call). We are not mentioning any suffix in case.

system_prefix = """You are an agent designed to interact with a SQL database.
Given an input question, create a syntactically correct {dialect} query to run, then look at the results of the query and return the answer.
Unless the user specifies a specific number of examples they wish to obtain, always limit your query to at most {top_k} results.
You can order the results by a relevant column to return the most interesting examples in the database.
Never query for all the columns from a specific table, only ask for the relevant columns given the question.
You have access to tools for interacting with the database.
Only use the given tools. Only use the information returned by the tools to construct your final answer.
You MUST double check your query before executing it. If you get an error while executing a query, rewrite the query and try again.

DO NOT make any DML statements (INSERT, UPDATE, DELETE, DROP etc.) to the database.

If the question does not seem related to the database, just return "I don't know" as the answer.

Here are some examples of user inputs and their corresponding SQL queries:"""

few_shot_prompt = FewShotPromptTemplate(
example_selector=example_selector,
example_prompt=PromptTemplate.from_template(
"User input: {input}\nSQL query: {query}"
),
input_variables=["input", "dialect", "top_k"],
prefix=system_prefix,
suffix="",
)

Next we create our SystemMessagePromptTemplate which is the message sent internally to LLM and not to the user. According to the LangChain documentation, the prompt must have an agent_scratchpad key that is a MessagesPlaceholder . `agent_scratchpad` should be a sequence of messages that contains the previous agent tool invocations and the corresponding tool outputs. With this, we now have our ChatPromptTemplate ready.

full_prompt = ChatPromptTemplate.from_messages(
[
SystemMessagePromptTemplate(prompt=few_shot_prompt),
("human", "{input}"),
MessagesPlaceholder("agent_scratchpad"),
]
)

We now recreate our SQL Agent but this time with a prompt template specified.

agent = create_sql_agent(
llm=llm,
db=db,
prompt=full_prompt,
verbose=False, #Change this to True if you want agent execution details
agent_type="openai-functions",
)
agent.invoke({"input": "Can you name the customer with Basic Subscriptions?"})

So far so good…

Semantic Matching

If we want our agent to be resilient to some entity names like customer name or subscription plan name, we can create a vector index and ask LLM to query this index whenever relevant. For this, we will make use of `create_retriever_tool` from `langchain.agents.agent_toolkits` which creates a tool that helps in retrieving the embedded docs/values. We also add some description on what we expect this retriever and how it should treat the incoming input in comparison to the available values.

import ast
import re


def query_as_list(db, query):
res = db.run(query)
res = [el for sub in ast.literal_eval(res) for el in sub if el]
res = [re.sub(r"\b\d+\b", "", string).strip() for string in res]
return list(set(res))


customer_names = query_as_list(db, "SELECT name FROM Customers")
plan_details = query_as_list(db, "SELECT description FROM Plans")


from langchain.agents.agent_toolkits import create_retriever_tool

vector_db = FAISS.from_texts(customer_names + plan_details, OpenAIEmbeddings())
retriever = vector_db.as_retriever(search_kwargs={"k": 1})
description = """Use to look up values to filter on. Input is an approximate spelling of the proper noun, output is \
valid proper nouns. Use the noun most similar to the search."""
retriever_tool = create_retriever_tool(
retriever,
name="search_proper_nouns",
description=description,
)

Fine tuning the system prompt a bit again:

table_names = db.get_usable_table_names()

system = """You are an agent designed to interact with a SQL database.
Given an input question, create a syntactically correct {dialect} query to run, then look at the results of the query and return the answer.
Unless the user specifies a specific number of examples they wish to obtain, always limit your query to at most {top_k} results.
You can order the results by a relevant column to return the most interesting examples in the database.
Never query for all the columns from a specific table, only ask for the relevant columns given the question.
You have access to tools for interacting with the database.
Only use the given tools. Only use the information returned by the tools to construct your final answer.
You MUST double check your query before executing it. If you get an error while executing a query, rewrite the query and try again.

DO NOT make any DML statements (INSERT, UPDATE, DELETE, DROP etc.) to the database.

If you need to filter on a proper noun like customer name or a plan description, you must ALWAYS first look up the filter value using the "search_proper_nouns" tool!

You have access to the following tables: {table_names}.

If the question does not seem related to the database, just return "I don't know" as the answer."""

prompt = ChatPromptTemplate.from_messages(
[("system", system), ("human", "{input}"), MessagesPlaceholder("agent_scratchpad")]
)

agent = create_sql_agent(
llm=llm,
db=db,
extra_tools=[retriever_tool],
prompt=prompt,
agent_type="openai-tools",
verbose=True,
)agent.invoke({"input": "Is there a customer with name Bobb?"})

Let’s test now by searching for a customer and intentionally providing an incorrect name spelling:

agent.invoke({"input": "Is there a customer with name Bobb?"})

🎉🎉🎉 Indeed we were able to find Bob Johnson along with their relevant details.

Conclusion & Consideration

To conclude, SQL Agents can be a strong analyst tool that can streamline a lot of reporting requirements as well as general schema maintenance. I have created a very simple version of an SQL Agent here. This can easily be extended to a RAG where schema understanding is also there which can help data modellers to scale the existing model.
Last but not the least, from a security standpoint, you can do couple of preemptive measures:

  • The permissions can of course be limited to read rights. So all your agent is able to do is create SELECT statements and send them over to the database engine. Additionally, you can also add this to your system prompt as we did above that the agent must never create a DML statement.
  • If you are concerned about LLM API providers using your calls to retrain their systems, you can always use one of the many open source LLMs available that will do just as good as an average LLM API provided like OpenAI.
  • Your final chat prompt is a make-it-or-break-it moment in your whole RAG. Notice the system prompt written. Parts like “You have access to the following tables: {table_names}.” and “If the question does not seem related to the database, just return “I don’t know” as the answer.” are critical to a successful result. For more on prompt strategies, refer to this help.

I’d love to hear your comments in the section below, anything I missed or need to update. Looking forward! 🙂

--

--

Awais Kaleem
Awais Kaleem

Written by Awais Kaleem

Senior Data Scientist - Credit Modelling, Multimodal LLMs, Segmentation and Predictive Analytics