Revolutionize Your Data Exploration: Unveiling the Power of LangChain

Luc Nguyen
6 min readNov 26, 2023

--

In the ever-growing world of data science and machine learning, there’s a game-changer on the scene that promises to make talking to your data a breeze — enter LangChain. This dynamic tool is not just another player; it’s your buddy, making the tricky world of data analysis simpler. Join me on this journey as we set up, say a “Hello World” to our data, unravel LangChain’s magic, and troubleshoot a bit.

Prepare Environment

Let’s start by getting our environment ready to use LangChain. It doesn’t matter if you’re a data pro or a newbie; let’s make this journey fun. I’ll show you LangChain with Teradata, but you can use your favorite database.

1. Install Langchain

pip install langchain

2. Setup your Database (Teradata, in my case)
In this demonstration, the dataset comprises three main entities: Customer, Product, and Order. The Order table specifically stores information pertaining to the products customers have ordered. This information includes the product_id, quantity, and the total amount spent.

DB Schema

Hello World Example:

Let’s break the ice with a classic “Hello World” example. Open your Jupyter notebook, import the Langchain library, and let’s greet our data:

from langchain.agents import create_sql_agent
from langchain.agents.agent_toolkits import SQLDatabaseToolkit
from langchain.agents.agent_types import AgentType
from langchain.llms.openai import OpenAI
from langchain.sql_database import SQLDatabase
import os
# Setting up the SQL Database Connection
db = SQLDatabase.from_uri("teradatasql://dbc:dbc@192.168.11.7:1025/Sales")

# Configuring the OpenAI Language Model
model = OpenAI(temperature=0, openai_api_key=os.getenv("OPENAI_API"))

# Creating SQL Database Toolkit
toolkit = SQLDatabaseToolkit(db=db, llm=model)

# Creating and Running a SQL Agent
agent_executor = create_sql_agent(
llm=model,
toolkit=toolkit,
verbose=True,
agent_type=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
)

agent_executor.run("Number of customer")

Result :

Thought: I should query the customer table to get the number of customers
Action: sql_db_query
Action Input: SELECT COUNT(*) FROM customer
Observation: [(10,)]
Thought: I now know the final answer
Final Answer: There are 10 customers.

Congratulations! You’ve just initiated a conversation with your dataset.

First Problem and Solution

In my Langchain and Teradata journey, things got a bit tangled when I asked my Agent,

agent_executor.run("Give me top 5 customer who have bill greater than 500")

The query Agent generated as following :

SELECT customer_name, total_price FROM customer, orderdetail 
WHERE customer.customer_id = orderdetail.customer_id
AND total_price > 500 ORDER BY total_price DESC LIMIT 5

The Issue

Error: (teradatasql.OperationalError) [Version 17.20.0.31] [Session 1116]
[Teradata Database] [Error 3706] Syntax error: expected something
between the 'DESC' keyword and the 'LIMIT' keyword.

The Agent’s SQL talk was more in line with databases like PostgreSQL, not Teradata.

The Fix

When creating the Agent, clarity was essential. The instruction included specifying Teradata and generating Teradata-friendly SQL.

Here is the prompt I used to define Agent :

# Define prompt 
prefix = """
You are an agent designed to interact with a Teradata database.
Given an input question, create a syntactically correct {dialect} query to run, then look at the results of the query and return the answer.
Unless the user specifies a specific number of examples they wish to obtain,
always limit your query to at most {top_k}
results by using SELECT TOP {top_k}, note that LIMIT function does not works
in Teradata DB.

You can order the results by a relevant column to return the most interesting examples in the database.
Never query for all the columns from a specific table, only ask for the relevant columns given the question.
You have access to tools for interacting with the database.
Only use the below tools. Only use the information returned by the below tools to construct your final answer.
You MUST double check your query before executing it. If you get an error while executing a query, rewrite the query and try again.

DO NOT make any DML statements (INSERT, UPDATE, DELETE, DROP etc.) to the database.

If the question does not seem related to the database, just return "I don't know" as the answer.
"""

Re-define the agent executor with the updated prefix content.

agent_executor = create_sql_agent(
llm=model,
toolkit=toolkit,
verbose=True,
agent_type=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
prefix=prefix
)

The result :

After redefining the agent with the new prefix content, it can now generate queries that run smoothly on Teradata DB.

SELECT TOP 5 customer_name, total_price 
FROM customer, orderdetail
WHERE customer.customer_id = orderdetail.customer_id AND total_price > 500

Assessing the Agent’s Capabilities

To evaluate the agent’s proficiency, I devised a range of questions, from simple queries to more complex ones involving multiple table joins.

sql_questions = [
"1. Retrieve all columns from the customer table.",
"2. Retrieve the names and email addresses of all customers.",
"3. Retrieve the unique customer names from the customer table.",
"4. List the customers who have placed an order in the last month.",
"5. Retrieve the total number of orders placed by each customer.",
"6. List the customers who have not placed any orders.",
"7. Calculate the total revenue generated by each customer.",
"8. Find the customer who has spent the most money on orders.",
"9. Retrieve the customer who has placed the most orders in the last quarter.",
"10. List the customers who have placed orders on consecutive days."
]

for i, question in enumerate(sql_questions):
try:
ans = agent_executor(question)
print("Question {} : {}".format(i, question))
print("Answer : ", ans.get("output"))
except Exception as e:
print("Question {} : {}".format(i, question))
print("Error : ", e)

The result :


Question: Retrieve all columns from the customer table.
Answer: The customer table contains customer_id, customer_name, customer_email, and customer_address columns. Examples of data in the table include customer_id 8 with name Grace Davis, customer_id 7 with name Frank Miller, and customer_id 5 with name Charlie Brown.

Question: Retrieve the names and email addresses of all customers.
Answer: The names and email addresses of all customers are:
Grace Davis (grace@example.com), Frank Miller (frank@example.com), Charlie Brown (charlie@example.com), Jane Smith (jane@example.com), Harry Lee (harry@example.com), Eva White (eva@example.com), John Doe (john@example.com), Ivy Smith (ivy@example.com), Alice Johnson (alice@example.com), and Bob Johnson (bob@example.com).

Question: Retrieve the unique customer names from the customer table.
Answer: The unique customer names from the customer table are Grace Davis, Frank Miller, Charlie Brown, Jane Smith, Harry Lee, Eva White, John Doe, Ivy Smith, Alice Johnson, and Bob Johnson.

Question: List the customers who have placed an order in the last month.
Error: This model's maximum context length is 4097 tokens, however, you requested 4267 tokens (4011 in your prompt; 256 for the completion). Please reduce your prompt or completion length.

Question: Retrieve the total number of orders placed by each customer.
Error: This model's maximum context length is 4097 tokens, however, you requested 4179 tokens (3923 in your prompt; 256 for the completion). Please reduce your prompt or completion length.

Question: List the customers who have not placed any orders.
Answer: John Doe, Harry Lee, Alice Johnson.

Question: Calculate the total revenue generated by each customer.
Answer: The total revenue generated by each customer is calculated by querying the customer, orderdetail, and product tables using the provided SQL query.

Question: Find the customer who has spent the most money on orders.
Answer: The customer with ID '1' has spent the most money on orders.

Question: Retrieve the customer who has placed the most orders in the last quarter.
Answer: The customer who has placed the most orders in the last quarter is Frank Miller with 9 orders.

Question: List the customers who have placed orders on consecutive days.
Answer: The customers who have placed orders on consecutive days are customer 10, customer 2, customer 3, customer 4, customer 10, customer 1, customer 6, and customer 9.

The agent exhibited commendable proficiency, effectively responding to a variety of questions. However, it’s crucial to note that the model encountered errors when the data used for feedback exceeded its limit.

Conclusion

In conclusion, LangChain stands as a powerful framework for developing language model-powered applications. Its context-aware and reasoning capabilities open doors to a new era of data interaction. As you embark on your LangChain journey, consider the nuances of agent instruction clarity and be mindful of potential limitations with extensive data.

Unlock the full potential of your data with LangChain — your gateway to a seamless fusion of language models and data exploration. Try it for yourself and witness the transformative impact it can have on your data science endeavors.

For those eager to dive into the code and experiment further, the full source code is available on [GitHub]

For a deeper understanding of how it works, please check the next post here.

--

--