LLMs, Lakes and RAG Stores

Sirsh Amarteifio
6 min readOct 6, 2023

--

I discuss a RAG setup that leans heavily on data lakes (s3 blobs) and Pydantic typing. I discuss some simple patterns I have been experimenting with.

some sort of metaphor

Im working on an experimental tool to send log files to Retrieval Augmented Generation (RAG) stores. As part of that general experiment here, I need to ingest data into different types of stores that I can build LLM supporting tools from. In the spirt of LLMOps I’m thinking not just about solving the problem for a narrow use case but how to provide generic tooling for storing data and routing agents to different tools. One big part of that are what I touched on in the article Does your LLM Understand your Entities. In this world we have different data modalities; vectors, columnar, graph key-value and we have entities or topics of interest. This is the main position I’m making here;

  • Lean on Pydantic to describe ingestion of data
  • Lean on blob-storage based solutions to store data

For experimenting, it is extremely convenient if your data are simply stored in blob storage — I’ll discuss why and when this might be useful.

When I first started messing around with LLMs there were many different vector stores ( I’ve tried a whole bunch!) and many patterns for tools connecting to many different types of data. And many agent patterns etc. One thing I can say for sure about the LLM space is its very noisy! So I wanted to cut through all the noise and just pick some simple storage solutions and patterns for feeding my data in. In LLMOps spirit I wanted to make it easy for other developers to take advantage of what I was doing. The tools im using to help me all rely on S3 or emphasize typing interfaces;

  • LanceDB for vector stores (S3 backend)
  • DuckDB for columnar store (S3 backend)
  • Polars for dataframes (parquet on S3)
  • Pydantic for all interfaces

I also use other databases like Redis and Neo4j for key-value and graph types

In the monologue repo that I am setting up for a more general experiment (its very much in a WIP state), I have created some stores. These are designed to be typed using for example these entity types — I show in the notebooks how I import data for open datasets. I will continue to add more datasets to explore different types of RAG ingestion and querying scenario — you can think of monologue as a testbed for RAG systems.

All data stores follow the same pattern in that they are typed using Pydantic entities, they can be used as a Langchain tool (so you can ask questions of them) and they have a method to ingest Pydantic objects into their underlying storage.

The VectorDataStore is a store that wraps LanceDB to ingest data. I use the Pydantic type to indicate what embedding to use and this is mapped to a PyArrow schema to create the table. You will see a theme emerge; everything is controlled via Pydantic types!

We can create a store and add data to it…

store = VectorDataStore(Places)
#use a pandas dataframe of open data set in this example (seee notebooks)
records = [Places(**d) for d in data.to_dict('records')]
store.add(records)

We can add data to the columnar and other stores using the same interface

It is interesting to take a moment to talk about the embeddings and vector lengths. It took me a moment to understand how some things like LLamaIndex work with different embeddings and how LanceDB would represent them. As I toggled between OpenAI and Instruct Embeddings, I realized for example that the queries are pushed down to LanceDB from LLamaIndex (see here), the LanceDB ANN query assumes a certain vector size for example so its important to initialize the tooling with the right embedding. You can see in the constructor of the Vector Data Store I determine the embedding from the Pydantic type and pass this through. Then in the Pydantic types I define a type per embedding using inheritance of fields where useful. I use this when generating a pyarrow schema from the Pydantic type (see the AbstractEntity base class). This is how the LanceDB is created and subsequently updated.

There is a lot to think about relating to schema evolution here, which I defer.

When we ingest data into the store, we can ask questions of it directly or as a tool in an agent chain.

store("What can you tell me about the civil airport in East Elmhurst Queens?")

AWS S3 is the main protagonist of this store with Pydantic being a faithful sidekick. It is extremely easy to read and write files such as parquet files, generate and play with embeddings and query columnar data. In this I have used LanceDB, DuckDB and Polars. Honestly I love how easy these tools are to use. Take for example the following snippet that I use in the ColumnarStore

def get_query_context(uri, name):
"""
get the polar query context from polars
"""
ctx = pl.SQLContext()
ctx.register(name, read(uri, lazy=True))
return ctx

Suppose we have ingested data into our store. Ingestion means just saving (or merging) a parquet file to S3 (or wherever). Using Pydantic we can control our types. Now the code above creates an SQL context that allows me to write SQL against my entity which is here linked to an S3 file. You can then write queries like SELECT * FROM ENTITY_NAME. DuckDB also allows you do this but the above two lines of code are extremely neat.

You will see in the as_tool method that the ColumnarDataStore uses DuckDB to answer questions. I ask the LLM for an SQL query to answer questions given a schema and then push down the queries to DuckDB.

store = ColumnarDataStore(NycTripEvent)
store("What is least popular destination in new york city? Who has travelled there?")lso with respect to Pydantic types, I started playing around with formatted responses from Zero-Shot tools and that required a little bit of prompt-fiddling.

Also with respect to the Pydantic theme, I played a little with Zero Shot agents with Pydantic response formats and it required a little prompt-fiddling. Its a wip in the repo under the agents module called `BasicTypedResponseToolUsingAgent`

Other examples

The repo contains a bunch of things you can check out in the code or in the notebooks (feel free to ping me with any questions). One pattern is the idea of dynamic types. Because I control everything via Pydantic types which is how tables are named, data is routed etc. I sometimes need to be able to take some “anonymous” data and send it to a particular store. For that we have

from monologue.core.data.clients import WikiWalker
from monologue.entities.examples import AbstractVectorStoreEntry

generic_topic = AbstractVectorStoreEntry.create_model("GeneralTopics")
generic_topic(**record)

store = VectorDataStore(generic_topic)
collection = [ generic_topic(**record) for record in WikiWalker().iter_sections("Philosophy")]
store.add(collection)
store("Where did the word Philosophy come from?")

When we generate embeddings, we are choosing to use a particular flavour. I use pydantic types to say which. For example the default uses an OPEN AI Embedding and in the repo I have some examples that use the Instruct Embeddings. Notice the `fixed_size_length` — this is important because pyarrow schema are defined with this. You can see how I map pydantic to pyarrow in the AbstractEntity type in the repo.

class AbstractVectorStoreEntry(AbstractEntity):
name: str = Field(is_key=True)
text: str = Field(long_text=True)
doc_id: Optional[str]
vector: Optional[List[float]] = Field(
fixed_size_length=OPEN_AI_EMBEDDING_VECTOR_LENGTH
)
id: Optional[str]

It is convenient to create Pydantic models from dataframes (CSVs, parquets, feather etc). What I normally do is ask OpenAI/ChatGPT to generate the pydantic type from sample rows, save it, and then I can ingest into that type with,

AbstractStore.ingest_records(uri, MyType)

The EntityDataStore uses key-values. This is useful if the query is a list of keys that an agent wants to learn more about. We can parse out a comma separated list of values and pass them to the tool (GPT is good at spotting entities). In this example I fetch some samples from the columnar store and add them to the Entity Store. Pydantic tells the store that the key is a user_id field

#fetch examples
examples = ColumnarDataStore(BookReviewers).fetch_entities()
#create an entity store for the same type
store = EntityDataStore(BookReviewers)
#add using the common interface for adding entities
store.add(examples)
#this tool acepts one or more keys
store("A12D7NI0VTFF3T,A12DH4IYORSCP6")
#returns some dictionaries with these records - this is an Entity Lookup
#these results are expected to be embedded in an agent chain context

--

--