Vector Databases: A Beginner’s Guide

A no-nonsense introduction to vector databases and embeddings, plus a code example with Chroma

Raj Pulapakura
6 min readJan 9, 2024

Unless you’ve been living under a rock, you’ve probably heard of vector databases. The sheer speed of innovation in the AI world can make it hard to keep up with the latest buzzwords and technologies, such as embeddings, vectors, and vector databases. That’s why I’ve written this article, to bring you up to speed with these concepts in a beginner-friendly way.

Vectors? Embeddings?

Vectors are lists of numbers. Embeddings are vectors that have rich, machine-understandable information baked into them.

We can take data such as images, text and audio and convert them into embeddings.

Doggo image source

These embeddings are learned by a model, so while us humans don’t understand these numbers, baked into these embeddings are the features of the original content which machines can understand. Embeddings are like the five senses for machines, a portal through which they can comprehend the real world, in a language that it understands (numbers).

Embeddings are powerful because they are able to capture the meaning and core information of unstructured data in a way that machines can easily understand.

A cool property of embeddings is that they preserve the similarity between the original content. For example, the embeddings of two labrador images will be similar, while the embedding of a labrador and the embedding of a hammer would be wildly different.

Embeddings of similar content are physically close to each other in the embedding space.

There are many ways of measuring the “similarity” or “closeness” of embeddings, including cosine distance, dot product and Euclidean distance. The smaller the distance between two embeddings, the more similar they are.

Embeddings are the reason convolutional neural networks are able to understand visual content and LLMs are able to understand natural language the way a human does.

How do we store these embeddings?

Embeddings are not your typical kind of data. When thinking of a storage solution for embeddings, our main priority is fast retrieval and fast similarity search (we’ll talk more about use cases later). Our storage solution should be able to handle high-dimensional data, and should faciliate fast retrieval of these embeddings.

Hey, I heard that relational databases have fast retrieval! Yes, relational databases enable fast retrieval through indexes, but these indexes rely on human-understandable features, which embeddings lack.

What about document databases? Or graph databases? While these databases have their own advantages and uses cases, they’re not built for embeddings.

We need a tailored solution.

Vector Databases!

A vector database is a structure specifically built to store embeddings. You can think of a vector database as a pool of embeddings.

Vector databases facilitate fast retrieval and similarity search.

Given a query vector, a vector database can quickly find similar embeddings.

But how do vector databases find similar embeddings? We talked about similarity metrics such as cosine distance, dot product and Euclidean distance. For a database with a few embeddings, we can calculate the cosine distances between the query vector and all the embeddings in the database, and return the ones with the lowest distances. But for a vector database with millions of embeddings, with each embedding having hundreds or thousands of dimensions, doing a linear search through all these embeddings is simply intractable.

For this reason, there are several algorithms that enable vector databases to perform quick retreival and similarity search. This is an ongoing area of research, as more efficient algorithms are being proposed. Most vector databases use a combination of algorithms to enable ultra-fast retrieval, however these algorithms often follow an accuracy-speed tradeoff.

Locality-Sensitive Hashing (LSH)

One of these algorithms, LSH (Locality-Sensitive Hashing), sorts the embeddings into buckets using a hashing function, where each bucket contains similar vectors. You can think of one bucket containing vectors for dogs, another bucket containing vectors for cats, etc.

LSH uses a hash function to group similar vectors into buckets

When a query comes in, the hashing function maps the query vector to one of the buckets. If the query vector was for a dog, the hashing function would route the query vector to the “dog bucket”. Once the bucket is found, a linear search is performed to find the candidates which are most similar to the query.

A query vector is mapped to a bucket of similar vectors.

The reason this works is because the hashing function is designed to group similar vectors into the same bucket. This method is much faster than searching the entire database as there are far fewer vectors in each bucket than in the entire database. However, LSH is an approximate search because it doesn’t look at all the vectors, so it might overlook potential candidates in other buckets (accuracy-speed tradeoff).

Checkpoint

  • An embedding/vector is a way to numerically represent data such as images, audio and text in a way that machines can understand and interpret.
  • A vector database is a pool of embeddings which allows for quick retrieval and similarity search through the use of special algorithms.

Code example

Now that we understand what vector databases are, let’s build a vector database using Python and Chroma, one of the many open source vector database libraries.

pip install chromadb
import chromadb

# create a client which manages our collections
client = chromadb.Client()

# a collection is a vector database
collection = chromadb.create_collection("example")

# add some documents to our database
# chroma automatically transforms these documents into embeddings
collection.add(
documents=["Pizza is good", "Pizza is bad", "Leaves are green"],
ids=["id1", "id2", "id3"]
)

Chroma will automatically transform each document into an embedding using an embedding function. Now we can query our database!

collection.query(
query_texts=["Pizza is amazing"]
n_results=3
)

Output:

{'ids': [['id1', 'id2', 'id3']],
'distances': [[0.298539400100708, 0.5948793292045593, 1.728091835975647]],
'metadatas': [[None, None, None]],
'embeddings': None,
'documents': [['Pizza is good', 'Pizza is bad', 'Leaves are green']],
'uris': None,
'data': None}

The results are sorted by distance (lowest to highest). “Pizza is good” got the lowest distance score, which means its the most similar result to “Pizza is amazing“.

Chroma also enables us to add metadata to documents.

collection.add(
documents=["Pizza was originally made in Italy", "Pizza is tasty", "Leaves are green"],
metadatas=[{"source": "wiki"}, {"source": "blog"}, {"source": "wiki"}],
ids=["id1", "id2", "id3"]
)

We can then refine our query by filtering for specific metadata.

collection.query(
query_texts=["Pizza facts"],
where={"source": "wiki"}
)

Output:

{'ids': [['id1', 'id3']],
'distances': [[0.7719230651855469, 1.8141719102859497]],
'metadatas': [[{'source': 'wiki'}, {'source': 'wiki'}]],
'embeddings': None,
'documents': [['Pizza was originally made in Italy', 'Leaves are green']],
'uris': None,
'data': None}

Popular vector databases

Chroma isn’t the only vector database, in fact a huge number of vector databases have popped up recently, and some non-vector databases have even added support for vector search.

Vector databases: Source

As a starting point, I recommend trying out Chroma, which we’ve already touched on in this article. I’m currently exploring different vector databases too, so if you find anything interesting, let me know at raj.pulapakura@gmail.com

Use cases

Vector databases have become popular with the rise of LLMs, but they also have more fundamental applications:

  • Long-term memory for LLMs: Retreival Augmented Generation (RAG)
  • Search engines: Google transforms your query into an embedding and performs a similarity search over its indexed web
  • Recommendation engines: find similar content in a vector database based on what you’ve already purchased/watched (Netflix, Amazon, Spotify)

I hope I was able to cut through some of the noise and give you a beginner-friendly intro to vector databases and embeddings. Of course, there a lot more to learn, so here are some links for deepening your understanding of vector databases:

Also, play around with some vector database packages such as Chroma and Milvus. Most of these packages are open source, which means you can directly inspect the code behind these libraries.

Happy exploring!

--

--

Raj Pulapakura

Machine Learning Engineer and Full Stack Developer. Passionate about advancing human intelligence and solving problems.