Embeddings and Vector Databases

Vlad Rișcuția
13 min readAug 18, 2023

This is an excerpt from Chapter 5: Memory and Embeddings from my book Large Language Models at Work. The book is now available on Amazon: a.co/d/4MiwZvX.

What are embeddings

Underneath all machine learning, there’s large arrays of numbers. That’s how neural networks represent inputs and weights. Large language models deal with text, so there needs to be some mapping of language to numbers. We would also expect some notion of similarity — the numerical representation of synonyms should be numbers close in value. Large language models encode semantic meaning through this closeness of values.

To develop an intuitive understanding of this, let’s start by imagining a single dimension — meaning we associate a single number to a word. We would expect the distance between the word cat and the word dog to be smaller than the distance between the word cat and the word helicopter. If we plot these on a line, the points representing cat and dog would be closer together, and the word helicopter would be further away. That’s because cats and dogs are pets and are likely to come up in similar contexts together than next to aircrafts.

Of course, a single dimension is not enough to express how closely related words are. Kitten should be as close to cat as puppy would be to dog. And since kitten is to puppy what cat is to dog, the distance between kitten and puppy should be similar to the distance between cat and dog. All still relatively far away from helicopter. Let’s add a second dimension and plot these points, as in the following figure:

Word relationships represented in a 2-dimensional space.

While harder to visualize, large language models represent words as vectors (lists) of numbers. This would be our pets and helicopter example above but extended to a much larger number of dimensions.

Definition: Word embeddings are numerical representations of words that capture their meanings and relationships based on how they appear in a text corpus. These embeddings are generated by training a machine learning model on a large amount of text data. The model learns to assign each word a unique vector of numbers, which serves as its embedding.

The values used to represent a word like cat are determined during model training. It’s very likely that cats and dogs show up together more often than cats and helicopters, so the model will capture this relationship by keeping the words closer together.

We call these representations “embeddings” because the model takes a word and embeds it into a vector space. Embedding is not limited to single words only — sequences of text can also be embedded, capturing their meaning in the same vector space.

OpenAI offers a set of models that output embeddings. The latest, text-embedding-ada-002, performs best and is cheapest ($0.0004 per 1000 tokens¹), so that’s the one we’ll be using in this chapter.

The embedding API is simpler than the completion APIs. It has 3 parameters:

  • model – The model we want to use for the embedding, in our case this will be text-embedding-ada-002.
  • input – A string or array of tokens – the text for which we want the embedding.
  • user – An end user ID. This is an optional parameter we can pass in to help OpenAI detect abuse. More on this in chapter 8, when we discuss safety and security. In short, imagine we offer a Q&A application, but a user starts asking inappropriate questions. These would get passed through our code to the OpenAI API. If OpenAI detects abuse, it can be tied to a user ID and help us identify the abuser.

Let’s implement a get_embedding() function.

import openai

def get_embedding(text):
return openai.Embedding.create(
input=[text.replace('\n', ' ')],

We use the openai library to call the Embedding.create() function. We replace newlines with spaces in the input text and omit the optional user parameter. We won’t cover the full response shape we get from the model. The embedding property of the first item in the returned data array contains our embedding.

The following code calls this function to get the embedding for the word cat:


If you run this code, you will get a long list of numbers.

We talked about relationships being captured in embeddings and how words that are more closely related are closer together in the vector space. But we haven’t talked about how we measure distance.

There are several formulas to do this. OpenAI recommends using cosine similarity² to measure the distance between two embeddings (i.e., two lists of numbers).

Sidebar: Cosine similarity and cosine distance

For vectors A and B, the cosine similarity is:

If you haven’t done math in a while the formula might look scary but it isn’t: the top of the fraction is the dot product of vectors A and B, which simply means multiplying the first element of A with the first element of B, the second element of A with the second element of B and so on, and finally adding up all these products.

At the bottom of the fraction, we multiply the length of the two vectors. To compute the length of vector A, we square each of the elements, sum the result, then get the square root of that. We multiply the length of A with the length of B.

Cosine similarity is the cosine of the angle between the vectors. The closer the vectors, the closer this value will be to 1. The less they have in common, the closer they are to 0.

Cosine distance is 1 – cosine similarity. Since we want a measure of distance, we subtract the similarity from 1. The closer the vectors, the closer the distance gets to 0 and vice-versa, the further apart the vectors, the closer the distance gets to 1.

Here is the implementation of cosine distance:

def cosine_distance(a, b):
return 1 - sum([a_i * b_i for a_i, b_i in zip(a, b)]) / (
sum([a_i ** 2 for a_i in a]) ** 0.5 * sum([b_i ** 2 for b_i in b]) ** 0.5)

As described in the sidebar, the first sum is iterating over pairs of elements from a and b and multiplying them together. This is divided by the sum of the squares of all the elements in a raised to the 1/2 power (square root) times the sum of the squares of all the elements in b raised to the 1/2 power. All of this subtracted from 1.

Cosine distance is not the only way of determining how close two embeddings are. Euclidean distance³ is another common option. Though for vectors normalized to be in the [0, 1] range, like the OpenAI embeddings, cosine distance and Euclidean distance are the same.

With the math out of the way, let’s see why this is so important.

Memory based on embeddings

A key challenge with large language model memory is identifying which data we should add to the prompt. This is where embeddings become very useful. Instead of relying on some other indexing schema, we compute the embedding of each chunk of data we have. When the user provides some input, we compute the embedding of the input. We then use the cosine distance to see which of our pieces of data is close to what the user is asking and pull that data into the prompt.

We’ll use the Pod Racing dataset from here: llm-book/code/racing at main · vladris/llm-book (github.com).

First, we’ll get the embedding vectors for each of our text files, and store these in a JSON file as pairs of file path: embedding. Here’s how to do this:

import json
import os

embeddings = {}

for f in os.listdir('./racing'):
path = os.path.join('./racing', f)
with open(path, 'r') as f:
text = f.read()

embeddings[path] = get_embedding(text)

with open('embeddings.json', 'w+') as f:
json.dump(embeddings, f)

We read each text file in our Pod Racing dataset (league.txt, race1.txt, race2.txt, race3.txt, race4.txt and race5.txt). We call get_embedding() to produce the embedding of the text, then store this in a dictionary keyed by the file path. Finally, we save the dictionary as embeddings.json.

Now let’s see how our Q&A solution works. Here is a simple Q&A chat with embeddings. To keep things simple, we use a chat without memory. That means we won’t be able to ask the large language model follow up questions.

import json

embeddings = json.load(open('embeddings.json', 'r'))

def nearest_embedding(embedding):
nearest, nearest_distance = None, 1

for path, embedding2 in embeddings.items():
distance = cosine_distance(embedding, embedding2)
if distance < nearest_distance:
nearest, nearest_distance = path, distance

return nearest

chat = ChatTemplate(
{'messages': [{'role': 'system', 'content': 'You are a Q&A AI.'},
{'role': 'system', 'content': 'Here are some facts that can help you answer the following question: {{data}}'},
{'role': 'user', 'content': '{{prompt}}'}]})

while True:
prompt = input('user: ')
if prompt == 'exit':

context = nearest_embedding(get_embedding(prompt))
data = open(context, 'r').read()

message = chat.completion(
{'data': data, 'prompt': prompt}).choices[0].message

print(f'{message.role}: {message.content}')

The sample uses the ChatTemplate class introduced in the previous article N-shot Learning, which is a wrapper over OpenAI’s chat completion API.

First, we load the embeddings from embeddings.json. We then implement a nearest_embedding() function which, given an embedding, iterates over the embeddings in our dataset and returns the nearest key. In this case, the key is the path to the file.

Our chat template has two parameters: data and prompt. data is information we retrieve from memory. prompt is the user question.

Putting it all together, we take an input question from the user as prompt, we call get_embedding() to get the embedding of the question, and find the nearest embedding in our memory. That is the data that’s most closely related to the question. nearest_embedding() gives us the path to the file, so we open the file and read it into data.

Finally, we prompt the large language model, providing the relevant data and user question. We print the response.

An example interaction:

user: What happened to Senn Kava during the Genosis Challenge?
assistant: During the Genosis Challenge Pod Racing race, Senn Kava encountered
an unexpected technical glitch during a crucial stage of the race. A malfunction
in the pod's stabilization system caused Kava to lose control momentarily,
resulting in a brief deviation from the racing line. Despite this setback,
Kava's skillful recovery and determination enabled them to regain momentum and
finish in third place.
user: In which race did a pilot find an uncharted shortcut?
assistant: Kael Voss, the pilot of the Razor Blade, found an uncharted shortcut
during the Coruscant Circuit Pod Racing race. He veered off the traditional
track, navigating through a perilous maze of lower city alleys and weaving
through oncoming traffic in an attempt to gain an advantage over the other

Note we get correct answers to the questions, based on the facts in our text files, even though we decide which facts to add to the prompt at runtime based on the question.

For example, for the second question (in which race did a pilot find an uncharted shortcut?), the content of race2.txt was passed in as data to the prompt. That’s the only file that contains the information the model needed to produce the response.

This is the key takeaway for why we want to use embeddings: the second question didn’t mention either a pilot or a race, so it would’ve gotten complicated if we wanted to devise some other mechanism to retrieve the right data. But using embeddings, it turned out that indeed race2.txt was the nearest to the user question.


We looked at Q&A, but this is not the only application for memory/embeddings. Going back to our first example in this chapter, a chat bot, we said that storing the conversation in a list and popping off old items might make the model lose context. We could instead store each line of the conversation as an embedding and be smart about which items we retrieve that are relevant to the current topic.

Another application is prompt selection: remember in chapter 3 we used a selection prompt — telling the large language models which prompts it has available and selecting between these (in our example we had a lawyer prompt and a writer prompt). We could instead use distance between the embedding of the ask and embeddings of available prompts to select the best prompt. The embedding of a lawyer should be closer to an NDA document than that of a writer.

A few more scenarios:

  • Personalized recommendations: We can store user preferences and history and use it to generate custom-tailored recommendations.
  • Interactive storytelling: Use external memory to store the story context, character information etc. and generate a coherent story across multiple interactions.
  • Healthcare: Store medical knowledge and patient data as external memory and retrieve the relevant data to help with diagnosis and treatment.
  • Legal research: Store legal precedents, case law, documents etc. as external memory and retrieve the relevant data to help with case analysis and legal advice.
  • Education: Store course content, student preferences and learning history to provide customized e-learning powered by AI.

There are many applications that would be impossible to pull off without a solid memory story. Another very important scenario we’ll cover in depth in the next chapter is interacting with external systems, where again memory plays a crucial role in storing the available interactions.


To keep things simple, we’ve been dealing with toy examples. Let’s talk a bit about scaling. Our approach was to create an embedding for each of our text files, then pick the nearest one to the user question. We can refactor this to scale in several ways.

First, it’s up to us how large we want a unit of data to be. It can be a whole file, or a paragraph. Depending on our specific scenario and dataset, we can choose different ways to split things up.

We are also not limited to just injecting one piece of data into the prompt. If we’re dealing with a much larger dataset, we can look for the top n nearest embeddings and pass all of them in.

Finally, finding the nearest embedding in a huge dataset can become a bottleneck. Let’s address this next.

Vector databases

Our toy dataset is tiny enough that we have no problem iterating over all 6 embeddings and comparing them with the question embedding, but if we have a million embeddings, things can get much slower.

Finding the nearest embeddings in a huge dataset can be done very efficiently using a vector database.

Definition: A vector database is a specialized type of database designed to store and efficiently retrieve vector data. Unlike traditional databases that primarily handle structured data (e.g., tables with rows and columns), vector databases are optimized for managing and querying high-dimensional vector representations.

Let’s throw a vector database in the mix to see how we can use that for storage and retrieval.

For this example, we’ll use Chroma⁴, an open-source vector database. First, let’s install it using the Python package manager:

pip install chromadb

We will create a new database, embed our Pod Racing facts there, then reimplement our Q&A chat to leverage Chroma:

import chromadb
from chromadb.config import Settings
from chromadb.utils import embedding_functions
import os

client = chromadb.Client(Settings(

collection = client.create_collection(

for f in os.listdir('../racing'):
path = os.path.join('../racing', f)
with open(path, 'r') as f:
text = f.read()

text = text.replace('\n', ' ')

collection.add(documents=[text], ids=[path])


We first create a new database client using chromadb.Client(). The API supports various settings, can run in client/server mode etc., but we’ll keep things simple and run an in-memory instance. We’ll want to save to disk, so we’re specifying the format in which we want the data to be persisted (in this case as Parquet files using DuckDB⁵) and the path to the folder where we want the data saved. That’s the racingdb folder.

Next, we create a collection of documents named pod-racing. Chroma comes with a bunch of different embedding functions, the default being a built-in mini-model. We’ll use the OpenAI embedding (from text-embedding-ada-002) to keep parity with our previous example.

I noticed using the built-in embedding produces worse results, for example it doesn’t retrieve the right documents for some queries (due to different embeddings and distances). This is expected — the better the model, the better the embedding, and a mini-model running locally can’t compete with a large language model running in the cloud.

Chroma provides several embedding functions, including OpenAIEmbeddingFunction(), to which we need to pass an API key.

We iterate over our dataset and add each document to the collection using collection.add(). We need to provide the documents and unique ids. The documents get automatically converted to vectors using the embedding function we configured.

Finally, we save the database to disk by calling client.persist(). This should create a racingdb folder inside the current folder, containing the persisted data.

Let’s now reimplement our chat using the Chroma API:

import chromadb
from chromadb.utils import embedding_functions
from chromadb.config import Settings
from llm_utils import ChatTemplate
import os

client = chromadb.Client(Settings(

collection = client.get_collection(

chat = ChatTemplate(
{'messages': [{'role': 'system', 'content': 'You are a Q&A AI.'},
{'role': 'system', 'content': 'Here are some facts that can help you answer the following question: {{data}}'},
{'role': 'user', 'content': '{{prompt}}'}]})
while True:
prompt = input('user: ')
if prompt == 'exit':

data = collection.query(query_texts=[prompt], n_results=1)[

message = chat.completion(
{'data': data, 'prompt': prompt}).choices[0].message

print(f'{message.role}: {message.content}')

Like in the listing we saw earlier, we’re using a chat template in which we inject relevant data and the user prompt.

The difference is we don’t use our embedding.json and handcrafted nearest_embedding() function. Instead, we use chromadb. We create a client to use the racingdb subfolder and DuckDB plus Parquet for storage. This should load the data we persisted.

We then call get_collection(), again passing the OpenAI embedding function configured with our API key.

The chat template is the same as before, and the interactive loop, except that instead of nearest_embedding() we now call collection.query() to retrieve the relevant data. The query text is the user prompt, since that’s the text we want to compute an embedding for and find the nearest stored embeddings. The n_results parameter tells Chroma how many nearest documents to retrieve – we only ask for one.

We won’t cover the API in depth, suffice to say it returns a dictionary which includes, among other things, the original text we embedded. We’ll save this as data.

Vector databases

As we just saw, vector databases abstract away the embedding and retrieval of information. This allows us to scale when implementing complex solutions over large datasets. We used Chroma in our example. There are several other options to consider:

Some other, more established and well-known databases also have vector search capabilities:

The full chapter includes a discussion on simple list and key-value memories, embeddings and vector databases, and a novel approach covered in the paper [2304.03442] Generative Agents: Interactive Simulacra of Human Behavior (arxiv.org). The book is now available on Amazon: a.co/d/4MiwZvX.