Constructing Queries with Langchain

Published in

Tech Blog

10 min readJun 24, 2024

In today’s era, we’re all looking to converse with computers just as we do with people, where we can speak naturally, and they understand a wide range of data, whether it’s well-organized, messy, or somewhere in between. This prospect is becoming a reality with the rise of language user interfaces (LUIs). However, these interfaces come with their own set of challenges, depending on the type of data they interact with.

To overcome these challenges, Large Language Models (LLMs) possess a strong capability for query construction. This involves converting natural language into a query syntax tailored for each specific data type.

Let’s explore how LUIs are transforming the future of communication and the unique considerations they bring to structured, unstructured, and semi-structured data.

What is Query Construction?

In traditional retrieval-augmented generation (RAG), a user’s query is transformed into a vector format. This vector is then matched against the vectors of source documents to identify those that are most similar. (For more details, check out my blog post, “Exploring LangChain”.)

This approach is quite effective for unstructured data, but how does it perform with structured data?

While query construction for unstructured data often relies on transforming queries into vector representations, handling structured data requires more specialized techniques. Structured data is organized in defined formats, such as tables or databases.

Below, we will highlight several examples of query construction techniques.

Text-to-Metadata-Filter:

Vector stores with metadata filtering allow structured queries to filter embedded unstructured documents. The self-query retriever translates natural language queries into structured queries using these steps:

Data Source Definition: Identify relevant metadata files (e.g., director name, release year, and genre, for effective movie retrieval.).
User Query Interpretation: Convert a natural language question into a query and metadata filter (e.g., movies directed by Christopher Nolan or Steven Spielberg, released after 2010, in the sci-fi genre).
Logical Condition Extraction: Use vector store comparators and operators to create the filter.
Structured Request Formation: Create a structured request that separates the semantic search term from the logical conditions for document retrieval.

This type uses techniques such as self-querying from Langchain: A self-querying retriever is a technique that can generate its own queries. When provided with a natural language query, it employs a query-constructing LLM chain to create a structured query, which it then uses to search its underlying VectorStore.

from langchain_chroma import Chroma
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings,ChatOpenAI
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever

# Define the documents with page content and associated metadata
docs = [
    Document(
        page_content="A young wizard embarks on a journey to defeat a dark lord with the help of his friends",
        metadata={"year": 2001, "rating": 7.6, "genre": "fantasy"},
    ),
    Document(
        page_content="A superhero team assembles to save the world from an alien invasion",
        metadata={"year": 2012, "director": "Joss Whedon", "rating": 8.0},
    ),
    Document(
        page_content="A group of astronauts travel through a wormhole in search of a new home for humanity",
        metadata={"year": 2014, "director": "Christopher Nolan", "rating": 8.6},
    ),
]

# Create a vector store from the documents using OpenAI embeddings
vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings())

# Define metadata field information
metadata_field_info = [
    AttributeInfo(
        name="genre",
        description="The genre of the movie. One of ['fantasy', 'action', 'science fiction']",
        type="string",
    ),
    AttributeInfo(
        name="year",
        description="The year the movie was released",
        type="integer",
    ),
    AttributeInfo(
        name="director",
        description="The name of the movie director",
        type="string",
    ),
    AttributeInfo(
        name="rating", 
        description="A 1-10 rating for the movie", 
        type="float"
    ),
]

# Define a description for the document content
document_content_description = "Short plot synopsis of a film"

# Initialize the OpenAI language model with specific settings
llm = ChatOpenAI(temperature=0)

# Create a self-query retriever using the language model, vector store, document content description, and metadata field information
retriever = SelfQueryRetriever.from_llm(
    llm,
    vectorstore,
    document_content_description,
    metadata_field_info,
)

# Use the retriever to find and print films with a rating above 8.5
retriever.invoke("Show me films with a rating above 8.5")

Text-to-SQL

SQL and relational databases represent a significant part of structured data management. Efforts have been concentrated on converting natural language queries into SQL commands to interact with these databases effectively. To facilitate the translation of natural language into SQL queries, several steps are crucial:

Database Description: To help language models (LLMs) generate accurate SQL queries, it’s essential to give them a clear understanding of the database structure (PROMPT).
Few-shot Examples: Improving query accuracy involves providing the LLM with examples of how questions translate into SQL queries. By including standard examples directly in the prompt, the model learns how to formulate queries based on different types of questions. This method guides the model to generate precise SQL commands tailored to specific queries.

This type uses techniques such as QuerySQLDataBaseTool from Langchain:

import os
import getpass
from operator import itemgetter
from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate
from langchain.chains import create_sql_query_chain
from langchain_community.utilities import SQLDatabase
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_community.tools.sql_database.tool import QuerySQLDataBaseTool

# Securely set the OpenAI API key using getpass to prompt for input
# You can read from .env
os.environ["OPENAI_API_KEY"] = getpass.getpass()

# Add your own URI here to connect to the SQL database
db = SQLDatabase.from_uri("sqlite:///Chinook.db")
# Print the database dialect to ensure connection
print(db.dialect)
# Print usable table names to ensure connection and check available tables
print(db.get_usable_table_names())

# Run a simple query to ensure connection to the database
db.run("SELECT * FROM Artist LIMIT 10;")

# Initialize the OpenAI language model with specific settings
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

# Create a tool to execute queries on the database
execute_query = QuerySQLDataBaseTool(db=db)

# Create an SQL query chain using the language model and database
write_query = create_sql_query_chain(llm, db)

# Create a prompt template for answering questions based on SQL query results
answer_prompt = PromptTemplate.from_template(
    """Given the following user question, corresponding SQL query, and SQL result, answer the user question.

Question: {question}
SQL Query: {query}
SQL Result: {result}
Answer: """
)

# Chain to generate the answer based on the query and result
answer = answer_prompt | llm | StrOutputParser()

# Define the full chain of operations
chain = (
    RunnablePassthrough.assign(query=write_query).assign(
        result=itemgetter("query") | execute_query
    )
    | answer
)
# Invoke the chain with a user question
chain.invoke({"question": "How many employees are there"})

Agents

For generating more complex queries, it can be helpful to create few-shot prompts or incorporate query-checking steps. LangChain offers an SQL Agent that allows for more flexible interactions with SQL databases. To set up this agent, we use the create_sql_agent function, which includes the SQLDatabaseToolkit.

from langchain_community.agent_toolkits import create_sql_agent
from langchain_community.utilities import SQLDatabase
from langchain_openai import ChatOpenAI

# Set OpenAI API key securely using getpass to prompt for input
# you can read it from .env
os.environ["OPENAI_API_KEY"] = getpass.getpass()

# Add your own URI here to connect to the SQL database (postgres/mysql/sqllite...)
db = SQLDatabase.from_uri("sqlite:///Chinook.db")
# Initialize the OpenAI language model with specific settings
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

# Create an SQL agent executor using the language model and the database connection
agent_executor = create_sql_agent(llm, db=db, agent_type="openai-tools", verbose=True)

Text-To-SQL with semantic search Using a dynamic few-shot prompt

To enhance the agent’s performance, we can supply it with a custom prompt that includes domain-specific knowledge. We’ll create a few-shot prompt using an example selector, which will dynamically generate the prompt based on user input. This approach helps the model formulate better queries by incorporating relevant examples into the prompt, providing a useful reference for the model.

from langchain_community.vectorstores import FAISS
from langchain_core.example_selectors import SemanticSimilarityExampleSelector
from langchain_openai import OpenAIEmbeddings

# Define a list of example queries and their corresponding SQL queries
examples = [
    {"input": "List all movies.", "query": "SELECT * FROM Movies;"},
    {
        "input": "Find all movies directed by 'Steven Spielberg'.",
        "query": "SELECT * FROM Movies WHERE Director = 'Steven Spielberg';",
    },
    {
        "input": "List all movies in the 'Action' genre.",
        "query": "SELECT * FROM Movies WHERE Genre = 'Action';",
    },
    {
        "input": "Find the average rating of all movies.",
        "query": "SELECT AVG(Rating) FROM Movies;",
    },
    {
        "input": "List all movies released in the year 2020.",
        "query": "SELECT * FROM Movies WHERE Year = 2020;",
    },
]

# Create a semantic similarity example selector from the provided examples
example_selector = SemanticSimilarityExampleSelector.from_examples(
    examples,
    OpenAIEmbeddings(),
    FAISS,
    k=5,
    input_keys=["input"],
)

Now, we can create our FewShotPromptTemplate. This template utilizes our example selector, an example prompt to format each example,

from langchain_core.prompts import (
    ChatPromptTemplate,
    FewShotPromptTemplate,
    MessagesPlaceholder,
    PromptTemplate,
    SystemMessagePromptTemplate,
)

# Define the system prefix message for the agent
system_prefix = """You are an agent designed to interact with a SQL database.
Given an input question, create a syntactically correct {dialect} query to run, then look at the results of the query and return the answer.
Unless the user specifies a specific number of examples they wish to obtain, always limit your query to at most {top_k} results.
You can order the results by a relevant column to return the most interesting examples in the database.
Never query for all the columns from a specific table, only ask for the relevant columns given the question.
You have access to tools for interacting with the database.
Only use the given tools. Only use the information returned by the tools to construct your final answer.
You MUST double check your query before executing it. If you get an error while executing a query, rewrite the query and try again.

DO NOT make any DML statements (INSERT, UPDATE, DELETE, DROP etc.) to the database.

If the question does not seem related to the database, just return "I don't know" as the answer.

Here are some examples of user inputs and their corresponding SQL queries:"""

# Create a few-shot prompt template using examples
few_shot_prompt = FewShotPromptTemplate(
    example_selector=example_selector,
    example_prompt=PromptTemplate.from_template(
        "User input: {input}\nSQL query: {query}"
    ),
    input_variables=["input", "dialect", "top_k"],
    prefix=system_prefix,
    suffix="",
)

# Create a full chat prompt template with system messages and placeholders
full_prompt = ChatPromptTemplate.from_messages(
    [
        SystemMessagePromptTemplate(prompt=few_shot_prompt),
        ("human", "{input}"),
        MessagesPlaceholder("agent_scratchpad"),
    ]
)

# Create an SQL agent with the specified LLM, database, and prompt
agent = create_sql_agent(
    llm=llm,
    db=db,
    prompt=full_prompt,
    verbose=True,
    agent_type="openai-tools",
)
# Invoke the agent with a sample input query
agent.invoke({"input": "How many artists are there?"})

Dealing with High-Cardinality Columns

When filtering columns that contain proper nouns, like addresses, song names, or artists, it’s crucial to double-check the spelling to ensure accurate data filtering.

We can achieve this by creating a vector store that includes all distinct proper nouns in the database. Whenever a user query includes a proper noun, the agent can query this vector store to verify the correct spelling. This ensures the agent understands which entity the user is referring to before constructing the final query.

To begin, we need to gather the unique values for each relevant entity. This involves defining a function that parses the results into a list of elements.

import ast
import re
from langchain.agents.agent_toolkits import create_retriever_tool

# Function to execute a query and parse the results into a list of elements
def query_as_list(db, query):
    res = db.run(query)
    res = [el for sub in ast.literal_eval(res) for el in sub if el]
    res = [re.sub(r"\b\d+\b", "", string).strip() for string in res]
    return list(set(res))

# Create a vector store from the combined list of artists and albums
vector_db = FAISS.from_texts(artists + albums, OpenAIEmbeddings())

# Create a retriever from the vector store, retrieving the top 5 similar items
retriever = vector_db.as_retriever(search_kwargs={"k": 5})

# Description for the retriever tool
description = """Use to look up values to filter on. Input is an approximate spelling of the proper noun, output is \
valid proper nouns. Use the noun most similar to the search."""

# Create the retriever tool with the specified name and description
retriever_tool = create_retriever_tool(
    retriever,
    name="search_proper_nouns",
    description=description,
)

# System message template for the agent
system = """You are an agent designed to interact with a SQL database.
Given an input question, create a syntactically correct {dialect} query to run, then look at the results of the query and return the answer.
Unless the user specifies a specific number of examples they wish to obtain, always limit your query to at most {top_k} results.
You can order the results by a relevant column to return the most interesting examples in the database.
Never query for all the columns from a specific table, only ask for the relevant columns given the question.
You have access to tools for interacting with the database.
Only use the given tools. Only use the information returned by the tools to construct your final answer.
You MUST double check your query before executing it. If you get an error while executing a query, rewrite the query and try again.

DO NOT make any DML statements (INSERT, UPDATE, DELETE, DROP etc.) to the database.

If you need to filter on a proper noun, you must ALWAYS first look up the filter value using the "search_proper_nouns" tool! 

You have access to the following tables: {table_names}

If the question does not seem related to the database, just return "I don't know" as the answer."""

# Create a prompt template for the agent
prompt = ChatPromptTemplate.from_messages(
    [("system", system), ("human", "{input}"), MessagesPlaceholder("agent_scratchpad")]
)
# Create an SQL agent with the given LLM, database, tools, and prompt
agent = create_sql_agent(
    llm=llm,
    db=db,
    extra_tools=[retriever_tool],
    prompt=prompt,
    agent_type="openai-tools",
    verbose=True,
)
# Invoke the agent with a sample input query
agent.invoke({"input": "How many albums does alis in chain have?"})

Conclusion

In today’s world, we aim to communicate with computers naturally, just as we do with people, understanding a wide range of data complexities. Language user interfaces (LUIs) are making this goal attainable, but they encounter challenges depending on the data they process. Langchain powered by Large Language Models (LLMs) provides a solution by converting natural language into queries tailored to various data types. This capability promises to enhance how we interact with and extract insights from structured, unstructured, and semi-structured data.

References

Langchain. (n.d.). Retrieval in SQL. GitHub. Retrieved June 22, 2024, from https://github.com/langchain-ai/langchain/blob/master/cookbook/retrieval_in_sql.ipynb?ref=blog.langchain.dev

Langchain. (n.d.). SQL quickstart. Langchain Documentation. Retrieved June 22, 2024, from https://python.langchain.com/v0.1/docs/use_cases/sql/quickstart/

Langchain. (n.d.). Self-query retriever. Langchain Documentation. Retrieved June 22, 2024, from https://python.langchain.com/v0.1/docs/modules/data_connection/retrievers/self_query/?ref=blog.langchain.dev