Member-only story
Vector Databases and Search By Similarity for NLP
Learn about vector databases and how they can help your data science projects.
Introduction
When working with Natural Language Processing (NLP), you will certainly deal with vector databases. Since I started studying LLMs and how they work under the hood, vector DBs keep popping on my screen.
Stepping back a little, let’s agree that there are other types of databases, such as relational DBs, that store data structured in rows and columns — also known as rectangular form, and are manipulated by query language, using exact match, logical condition, and aggregations to return results.
There is also the No-SQL type, which stores semi or non-structured data where each observation is a document. They can be optimized for documents like JSON, and graphs.
Now, getting back to our point: vector databases.
Imagine a library where books are organized by their meaning, not just by title or author. That’s essentially how vector databases work. In these databases, the data is stored as a numerical representation — a vector — that captures the essence of information. These vectors are called embeddings and are stored and organized based on their similarity.