Does your LLM model understand your entities?

Sirsh Amarteifio
14 min readJun 3, 2023

--

For applying LLMs to your own data, one interesting problem is merging context around known entities in a modular and deterministic way, for example allowing the LLM to know the difference between a purchase order number and a SKU. You may likely be using other databases in your agent ecosystem but one simple idea I’ll discuss here, is how to resolve possible entities via a (possibly distributed) key-value store of entity keys and their descriptions and then embed the outputs in the agent context. In a clean solution, an agent should not only parse entities out of inputs but check the logical consistency of all statements about entities throughout the conversation.

In this article I share some observations using LangChain examples. I’ll add three tools, one of which is key-value and the others for more general questions over structured (SQL) and unstructured (vector stores) data. It is the interplay between these which I would like to play with.

Today I’m going to just set up the basics with dumb defaults etc. and I will then use this as a running example in future posts to explore other aspects of the agent construction.

Lets dig into the following thought process

final thought process

Contrast a key-value store with other widely discussed (in the context of LLMs) data stores e.g. SQL, graph or vector. Key-value stores i.e. dictionaries or distributed key-values stores like Redis are deceptively simple; the query is a precise key or hash that is used to retrieve some data. Thats all you can do. This constraint is its strength.

A “poor-man’s entity resolver” can ask an LLM to parse anything that looks like an entity from the text (names, codes, numbers) and pass a comma separated list of these to the key-value based tool and see what comes back. Anything that comes back can be added to the context. For example passing a product SKU could provide some information about the name of the product that SKU refers to as well as some semantic context about how the SKU is constructed. This can all be used to “expand out” the entity in the conversation.

You can try this yourself by creating a simple LangChain with a dictionary based tool. The LLM can pass a comma separated list of suspected entities to the tool — this much is straightforward. The assumption about keys is that they are globally unique and do not collide across all entity types — but we can soften that requirement a little too if we need to.

But going beyond that, when you start chaining this “entity resolver” with other tools, you may not always get the results you want if you do not guide the agent to use the tools properly, which is what brings me to the stuff of this article. I’ll dive a little deeper into my experience in making this useful.

The experiment

To go beyond trivial cases we assume a chain of thought sequence with entity enrichment in a zero-shot agent. That is, we avoid any conversational setting but we also try to avoid situations (or rather go beyond them) where there is a simple tool lookup. We do this so as to show the gradual expansion of entities in the context of the thought process (or in future conversational).

The basic setup

The first thing I’ll mention is, for any experiments with LLMs we should keep track of trial and error. For that there is aim . We can use it locally (we can also self-host on K8s) in three steps — first pip install aim and then there is an aim up command that you use to launch a browser. In my case I’m typically working in Jupyter and its convenient to load the extension with %load_ext aim and then in another cell aim up to launch the tracking experience inline. After that, for a given agent we define the callback and flush as and when we need to….

agent = get_agent_somehow()
from langchain.callbacks import AimCallbackHandler
#i also added .aim to my .gitignore
aim_callback = AimCallbackHandler( repo=".",
experiment_name="My entity memory experiment")
ach.flush_tracker(langchain_asset=agent, reset=False, finish=True)
aim UI for tracking experiments

In this setup I have an SQL lookup tool that takes a question and turns it into a Duck DB Sql queries and returns some tabular data (in my setup these are querying parquet files on s3). I will use that as one of the tools in the setup below. I also have a vector store for similarity searches and the key value tool which is my primary focus today.

I use the Zero-Shot agent below which I would, for example, add into a Slack agent to which questions can be asked. Typically, answering these questions requires compiling context from multiple sources (which might otherwise be tedious to trawl through).

from langchain.agents import ZeroShotAgent, Tool, AgentExecutor
from langchain import LLMChain
from langchain.chat_models import ChatOpenAI

#the prefix and suffix start rather concisely but we can play with this
prefix = f"""
Answer the question in the context of any entities you observe in the question or in the context.
To answer the question expand any entity codes into their components, and pass all components to the entity resolution tools.
You have access to the following tools:"""

suffix = """Begin! Give terse answers.
Question: {input}
{agent_scratchpad}"""

#this is a good stubborn agent that works hard to get the right answer
prompt = ZeroShotAgent.create_prompt(
tools,
prefix=prefix,
suffix=suffix,
input_variables=["input", "agent_scratchpad"]
)

llm_chain = LLMChain(llm=ChatOpenAI(model_name='gpt-4', temperature=0.0), prompt=prompt)
#tools tbd#################
tool_names = [tool.name for tool in tools]
##### see below for how we build and load the tools to set here ^
agent = ZeroShotAgent(llm_chain=llm_chain, allowed_tools=tool_names)
agent_executor = AgentExecutor.from_agent_and_tools(agent=agent, tools=tools, verbose=True)

In the abstract you can then ask (about orders, say)

table = agent_executor.run("What is the [FACT1], [FACT2], [ATTRIBUTE 1], [ATTRIBUTE 2] and also the [ATTRIBUTE 3] of the two most recently cancelled orders?")

Above, so as not to distract from my particular data which will mean nothing to you, I use the FactN and AttributeN to refer to information that comes from the root SQL queries over “fact tables” followed by entity “attribute” lookups. But we will generate some data below to play with. Note, we may have a trivial question that is directly about an entity in which case we want to get an answer of the form AttributeX.

So, to generate sample data you can ask ChatGPT for something representative. Consider for example this below — you can try directly from ChatGPT webpage to get some quick samples or wait for this thing below to return an answer.


import json
import pandas as pd
#import some langchain things above - low temp is just as accurate, just as funny
llm = ChatOpenAI(model_name="gpt-4", temperature=0.0)
#play with the requested number - experiment with the prompt for low numbers
data =llm.predict("""Please generate a dataset in JSON format with 10 items by combining animals, colors and shapes.
Each item should have a code determined by the composition of the animal name, color, and shape using upper case.
When generating the code ensure that each term in the code is exactly 5 letters by zero-padding or truncating the term in the code.
For example if the animal color and shape are CAT, YELLOW, HEXAGON the code can be CAT00-YELLO-HEXAG.
For each row do the following:
- provide values for 7 "bizarre observations" (numerical, categorical and boolean) such as "number of times has flow to space" or "believes in aliens"
- Add 3 sample attributes for each of the animal, color and shape properties
- Use the values 0 and 1 for boolean columns""")
#depending on the return structure you may need to massage this
# e.g. could return list or something called dataset instead of data
df = pd.DataFrame(json.loads(data)['data'])
#not shown here I expand out the observatinos into facts - see screenshot for result
the full dataset before we take out the attributes and prune the table to just show facts

One of the sample files I generated is here.

What we are trying to simulate here is the generation of a SKU that essentially spans different dimensions. The idea is we might have facts about the SKU but the dimensions are distributed over different databases/data marts/etc. Or at least, we are pretending that we cannot make one large table and need to have a federation of tools to answer our questions — otherwise it wouldn’t be fun.

Routing and parsing are two important things for interfacing with the tool. Routing is my way to refer to the task of taking different types of entity keys and sending them to specific non-overlapping maps. I won’t discuss this here but for example a SKU goes to one map and an order number goes to another. Parsing is a standard thing required to deal with multiple agents using the tool. Ill do some crude things here and consider more elegant approaches in future. For now we will just assume keys are unique and create one big map of codes like `BLUE0` mapped to their attributes and do some basic text cleaning so the keys make sense on the way in.

#the first tool

def ER_tool_from_data(df):

def fetcher(keys):
keys = keys.split(',')
return {k: mapping.get(k.rstrip('\n').lstrip().rstrip()) for k in keys}

mapping = {}
#based on the format of the dataframe build a map with values like
#'CAT00': {'legs': 4, 'sound': 'meow', 'size': 'small'},
#'YELLO': {'rgb': '255,255,0', 'hue': 60, 'complementary': 'BLUE'},
for record in df.to_dict('records'):
keys = record['code'].split('-')
keys_typed = dict(zip(['animal','color', 'shape'], keys))
mapping[record['code']] = {
"code" : record['code'],
"description" : "this is the code for our friend. The code is made of three components or identifiers, Animal, Color and Shape"
}
for key_type, key in keys_typed.items():
attributes = record[f"{key_type}_attributes"]
attributes['type'] = key_type
mapping[key] = attributes

return Tool(
name="Entity Resolution Tool",
func=fetcher,
description = """use this tool when you need to lookup entity attributes or find out of some code or identifier.
Do not use this tool to answer questions of a statistical nature.
you should pass a comma seprated list of known or suspectedentities to use this tool""" )

ertool = ER_tool_from_data(dff)
ertool
#TEST IT
#being careful to check the code near the top to make the agent
#add this tool to the list of tools and that code and then test the agent
agent_executor.run("How many legs does PIG00 and SNAKE have?")
#> Entering new AgentExecutor chain...
#Thought: I need to find out what PIG00 and SNAKE are and how many legs they have.
#Action: Entity Resolution Tool
#Action Input: PIG00, SNAKE
#Observation: {'PIG00': {'legs': 4, 'sound': 'oink', 'size': 'medium'}, ' SNAKE\n': None}
#Thought:I now know that PIG00 has 4 legs and SNAKE has no legs.
#Final Answer: PIG00 has 4 legs and SNAKE has 0 legs.

This shows the simple entity resolution tool in action. Simple, but, small steps. Now we create the table lookup tool for getting statistics. To do this I use Duck DB. If you are an S3 user you can set that up to have a parquet file on S3 or you can just use it locally. The reason why I’m doing this is because I can get SQL queries in the Duck DB dialect from ChatGPT. You can pip install duckdb and connect with `cursor = duckdb.connect()`. If you are using S3 you need to install the `httpfs` extension.

Lets save a parquet file from this dataframe above with only the facts retained and the dimensional attributes removed. Here we are careful just to create a properly named entity code column and some facts — but we do not refer to the components or their attributes. Later however in the entity tool we add a generic lookup for each component code explaining that these are broken into sub components. Given the below sample fact data saved as a parquet file we can use the tool. For example

the fact table — a subset of the data we got from the LLM

Then as shown below we can make a tool that queries the parquet file via Duck DB.

def duck_tool_using_text_to_sql_for_df(duck_cursor, table_path, df, enums=None):
#assume this for now
llm = ChatOpenAI(model_name="gpt-4", temperature=0.0)
#wrap the question in the prompt - its the tool's job to make a smart prompt
#we can do other things like provide enum values etc for context
def ask(question):
prompt = f"""For a table called TABLE with the {df.columns}, give me an sql query for duckdb that answers the question {question} """
query = llm.predict(prompt)
print(query)
query = query.replace("TABLE", f"'{table_path}'")
return duck_cursor.execute(query).fetchdf()

return Tool(
name="Stats and data table tool",
func=ask,
description = """Use this tool to answer questions about aggregates or statistics or to get sample values or lists of values. Do not select any values that are not in the provided list of columns""" )

duck_cursor = duckdb.connect()
#you need to provide a path to the parquet file which will be swapped in instead of TABLE in the tool
stats_tool = duck_tool_using_text_to_sql_for_df(duck_cursor, table_path, df )
stats_tool

Test it…


agent_executor.run("How many animals have flown to space?")
#> Entering new AgentExecutor chain...
#Thought: I need to find the number of animals that have flown to space.
#Action: Stats and data table tool
#Action Input: number of animals flown to spaceSELECT COUNT(*) as number_of_animals_flown_to_space
#FROM TABLE
#WHERE fact_times_flown_to_space > 0;
#Observation: number_of_animals_flown_to_space
#...
# Final Answer: 2

What you want to be able to do next is combine these tools for answering more interesting questions….

agent_executor.run("Which animal has flown to space most often? 
What are its animal, color and shape identifiers? Using its shape identifier,
tell me the angle of the anaiml's shape. Lets think step by step.")

Here we are testing if and when the agent consults the right key-value store and we will iterate on prompts etc so that it can answer the question from the user with minimal effort on their part.

The thought process — how tools are used

The question we gave was quite leading. I want to be able to ask a non-leading question…

agent_executor.run("What is the shape angle for the animal that flew to space most often?")
#The shape angle for the animal that flew to space most often is 120 degrees.

Lets add one more tool to this experiment. We suppose there is some unstructured data of interest e.g. PDFs, slack conversations, other documentation. We are going to use LanceDB as the basis of another tool but you can use your vector store of choice. My reasons for using this are similar to why you would choose Duck; its embedded and easy to use with an s3 or local file system

First get some text trivia from wikipedia — we will use the matrix of types and values we have e.g. animals, colors and shapes…

from tqdm import tqdm
from langchain.utilities import WikipediaAPIWrapper
wikipedia = WikipediaAPIWrapper()
trivia = []
for record in tqdm(df.to_dict('records')):
for c in ['animal', 'color', 'shape']:
trivia.append( {"entity_type" : c, "entity_key": record[c], 'text' : wikipedia.run(record[c])})
trivia = pd.DataFrame(trivia)

I did this tool creation below in a slightly strange way just to populate data and generate the tool in one function - but this is not how we would do things in real life.

from langchain.vectorstores import LanceDB
from langchain.document_loaders import DataFrameLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.chains import RetrievalQA
#pip install this one
import lancedb

def tool_from_trivia_df(df, table_name='trivia_store'):
text_column = 'text'
# some dataframe that includes a text column
loader = DataFrameLoader(df, page_content_column=text_column)
documents = RecursiveCharacterTextSplitter().split_documents(loader.load())
embeddings = OpenAIEmbeddings()

def probe(_df):
d = dict(_df.iloc[0])
d["vector"] = embeddings.embed_query(d[text_column])
return d

db = lancedb.connect(LANCE_ROOT)
table = db.create_table( table_name, data=[probe(df)], mode="overwrite" )

docsearch = LanceDB.from_documents(documents, embeddings, connection=table)
qa = RetrievalQA.from_chain_type(
llm=ChatOpenAI(model_name="gpt-4", temperature=0.0),
chain_type="stuff",
retriever=docsearch.as_retriever(),
)

return Tool(
name="Trivia and further information tool",
func=qa.run,
description = """Use this tool to answer questions about entities when the other tools do not help""" )

trvia_tool = tool_from_trivia_df(trivia)
#summaries are slow but require less thought on our part - lets see what we have
trvia_tool("Summarize what you know about cats")

At this point I modified the prompt to propose a clearer strategy as the agent was over utilizing the trivia tool as soon as I added it in. Of everything in the prompt below, the lets takes this step by step seemed to be the thing that did the trick.

    Answer the question in the context of any entities you observe in the question or in the context. Follow this strategy:
- You should typically get context by running the stats and data tool first if you can.
- To answer the question expand any entity codes into their components, and pass all components to the entity resolution tools.
- The Further details tool should only be used to augment the context when other tools do not provide an answer.
Lets takes this step by step.
You have access to the following tools:"""
agent_executor.run("What is the shape angle for the animal that flew to space most often and what sound does it typically make and how many of these animals are there expected to be in the world?")

The answer combing all tools I get is…

The shape angle is 120 degrees, the sound it typically 
makes includes meowing, purring, trilling, hissing, growling, and grunting,
and there are an estimated 220 million owned and 480 million stray cats in the world.

Closing thoughts

In-context entity expansions are explored here as a way to detect entities in the conversation context and expand them using a tool that routes keys to a distributed key-value store. This approach has been explored due to its simplicity; its easy to test and reason about key-value lookups and its easy for even a complex system of microservices to read and write to a distributed key-value store.

One pattern that can be derived from this is to have a base fact table without joins. Then simple “root questions” can be asked on this using SQL statements (generated by the LLM) and then the agent chain can make further lookups in the key value store. This is equivalent to having one large denormalized table (or joins) of fact and dimension tables and you may prefer one approach over another. I can think of pros and cons of going either way but I wont list them here. Today I just wanted to explore the use of key-value lookups because of their simplicity.

Everything I did today was merely to put down a basic foundation on which to build in future. We would like to explore different aspects of the interface from prompts to tool IO.

In future I want to track the agents thoughts more systematically

  • how do we debug the context and total knowledge about entities
  • can we generate open questions in the context to keep pushing for self consistent answers?

The Notebook with all the snippets from above is here

Links

What are the best practices for table structure and prompts?

Prompting

--

--