Vector Databases: The Secret Sauce of the AI Revolution

Part 1

David Gutsch
7 min readJul 11, 2023

Introduction

Ever wondered how your favorite music streaming service seems to read your mind, suggesting songs that perfectly fit your mood? Or how your online shopping platform knows just what you need, even before you do? Ever marveled at the extensible models and applications that are being built on top of LLMs like ChatGPT? The secret behind these modern marvels is not magic, but a powerful tool in the realm of databases: vector databases. Let’s embark on a journey to unravel the mysteries of these unsung heroes of the AI revolution.

What are Vector Databases?

You’ve been around the block with traditional databases, haven’t you? Rows, columns, tables, foreign keys — the whole SQL shebang. But let’s add a twist to this tale, shall we?

A Vector Database is a high-dimensional twist on traditional databases. Unlike a Relational Database that stores data in structured tables, or a NoSQL database that handles unstructured data; a Vector Database transforms data into vectors in a multi-dimensional space. Similar to a Graph Database that represents data relationships, a Vector Database represents data similarity, with closer vectors indicating more similar data. It’s a powerful tool for machine learning and AI applications, enabling efficient similarity searches and clustering in high-dimensional data.

Imagine a database that doesn’t just store data but comprehends it. Intriguing, isn’t it? Welcome to the world of vector databases. These aren’t your run-of-the-mill data warehouses; they’re more like data transformers. They take your data, wave their magic wand, and voila — your data is now a vector, a point in a multi-dimensional space.

Think of it like this: you’re not just building a collection of data structures anymore. You’re crafting an entire cosmos where each data point (now a star) has its own unique position, determined by its features. The closer the stars, the more similar they are. It’s like navigating through a galaxy of data, where the constellations are your clusters of similar data points. How’s that for a database upgrade?

Understanding Vector Embeddings

Now, you might be wondering, why am I talking about constellations, and more importantly what’s a vector in this context? Well, imagine you’re at a party. You don’t know anyone, so you start introducing yourself. “Hi, I’m Alex, I’m a software engineer, I love rock music and hiking.” That can be encoded as a vector (i.e. a star in our metaphor). It’s a list of features that describes you. In a vector database, every piece of data gets a similar introduction, but instead of hobbies and professions, it might be color histograms for images, word frequencies for text, or user ratings for products.

Now back to the constellations. When we are encoding these vector embeddings we are storing them in a multi-dimensional space that is most comparable to space. Alex the software Engineer is closer to Sally the Data scientist, who are both farther away from Derek the Financial advisor. There are many different mechanisms by which these embeddings may be encoded and stored including, optimized search through hashing, quantization, or tree and graph-based searches. These algorithms are my favorite part, though beyond the scope of this article, tune in for part two if this too is interesting to you.

But wait, there’s more. We have two types of vectors: input vectors and query vectors. Input vectors are like the guests at the party, each with their own set of features. Query vectors, on the other hand, are like a description of the person you want to meet. “I want to meet someone who loves rock music and hiking.” The database’s job is to find the input vectors that match the query vector. Our database will use some fancy data structures and algorithms to look through the cosmos of the database and find the input vectors most similar to the query vector.

Similarity Measures in Vector Databases

So how does the database decide which vectors are similar? It uses something called a similarity measure. It’s like a digital measuring tape of sorts that can tell how alike two vectors are, or more precisely the distance between the two in mutli-dimensional space. In vector databases similarity measures such as: cosine similarity, Euclidean distance, and dot product distance assess vector likeness. Cosine similarity gauges similarity via the angle between vectors, while Euclidean distance uses straight-line distance in multidimensional space, and dot product distance uses the product of corresponding vector values. Each measure is suited to specific data types and applications.

If that all made sense to you that’s great feel free to skip to the next section! I found some of these measures tricky so I’m providing an additional explanation for the other plebeians like myself. The Euclidean distance measure is the simplest of the three, and simply charts a straight-line path between two planets in our cosmos. Cosine similarity is like gauging the angle between two stars; the smaller the angle, the closer and more similar they are. Then there’s the dot product distance measure, which is like the gravitational pull between celestial bodies, influenced by their individual masses and distance apart. The dot product of the matrices between the vectors measure their distance between one another across all the vectors features. It is like the cosine distance except that it takes into account the magnitude of the distance as well as the angle. Each of these similarity measures guides us differently through the data universe, helping us navigate based on our destination and the cosmic conditions.

Cosine, Euclidean, and Dot Product respectively

Applications and Use Cases of Vector Databases

Now, let’s talk about where vector databases shine. You know that friend who always knows the perfect movie for movie night? That’s what a vector database does for Netflix. Or that personal shopper who always knows what’s in style? That’s a vector database for your online shopping platform. From personalized recommendations to image recognition, vector databases are the secret sauce that makes modern applications so… well, modern.

Let’s not forget the exciting world of Language Models and Generative AI, where vector databases play a pivotal role. Behind the scenes of GPT, it’s using vector databases or at least vector indexes to store and retrieve the embeddings of words and sentences. These embeddings capture the semantic meaning of words, allowing the model to understand language in a way that’s eerily similar to how we humans do.

Recap

As we journeyed through the cosmos of vector databases, we’ve uncovered some of the hidden mechanisms that are rapidly reshaping our world. To recap: Vector databases, unlike traditional ones, transform data into vectors in a multi-dimensional space, enabling a nuanced understanding of data and the ability to compare almost anything. We’ve also explored vector embeddings, which encode data into unique vectors, and similarity measures enable us to compare the distances between each of these vectors.

Conclusion

The profound applications of vector databases in AI are truly awe-inspiring. From personalized recommendations on your favorite streaming service to the semantic understanding of language in models like GPT, vector databases are the unseen powerhouse behind these marvels. They are the secret sauce that makes modern applications so intuitive and responsive, transforming the way we interact with technology.

The exponential growth of vector embeddings in fields such as NLP, computer vision, LLMS, and other AI applications has led to the rise of vector databases. These databases are specialized to tackle the challenges that arise when managing vector embeddings in production, they have significant benefits over traditional databases and scale better than standalone vector indexes. Most importantly, these databases are enabling normal application developers to create extensible models build on top of LLMs or generative AI models, thus allowing us to customize models for our customers.

As we push the boundaries of AI, the role of vector databases will only become more crucial. They are not just a tool for storing and retrieving data, but a fundamental component in our quest to make machines understand and interact with the world in a human-like way. So, stay tuned, keep exploring, and remember: in the world of data, there’s always more to learn.

If you enjoyed this come back next week for part 2!

--

--