A Broke B**ch’s Guide to Tech Start-up: Choosing Vector Database — Part 1 Self-Hosted

Soumit Salman Rahman
10 min readMay 21, 2024

--

If you are poking into the GenAI world for your start-up, vector databases are the hot $hit right after LLMs. Primarily used for augmenting a language model’s knowledge base with custom, proprietary or recent data, vector databases support fuzzy searching using text or media embeddings.

There is a whole bunch of new databases that came out in the last 3 years. Some of the existing popular databases now offer vector storage and similarity search. On top of that there are multiple hosting modes for these: local/on-prem, serverless instance, dedicated instance in your preferred cloud platform and so on. For a database newbie or for someone who is not necessarily a database jedi, it does get kind of confusing. I know it was for me.

In this article I will go through some of the vector databases I have played with to keep the cost down and my decision points

GenAI applications are by definition compute and data intensive. Based on the volume of your usage and volume of data, the cost can rack up real fast. If you are just starting out, or prototyping your app you don’t really know what the scale in production will be. My suggestion: Go with a local/self-hosted instance. Don’t freak out, due to the advent of docker this is much easier now than it was back in my days (yes, I may have been alive when dinos used to chill at the soccer fields).

Why?

  • These are docker images. You can put ’em up and pull ’em down with one command. No need for installation bloats.
  • No need worry about creating an account, managing API Keys, setting up IP rules
  • No worry about rate limits
  • No random vampire costs
  • It may be cheaper for you to self-host the docker image
  • Try before you buy in case if you want to go for a hosted instance
Courtesy of DALL-E — whatever this means

MongoDB

Probably the most popular Document DB among web app developers for its Javascript like input and query language (and not having to use SQL). This is definitely my personal favorite general purpose database. Its API and Drivers really thought about how web app developers thinking about managing and accessing data making it quite intuitive for people like me who DO NOT LIKE SQL. MongoDB also supports storing vectors and similarity search as a first class member of the data objects/documents along with their scalar fields. Along with vector similarity search it also supports filters using scalar fields in the documents/data objects. On top of that MongoDB has be around for quite some time and the code base in rock solid.

Special Features:

  • You don’t have to create separate documents/data-objects to hold your vector fields. The vector fields are 1st class member of the documents (like the scalar fields)
  • Each document can support multiple vector fields.
  • There are drivers for pretty much every language I could think of.

Caveats:

  • Maximum vector length is 2048 (This may have changed now)

Docker image:

In addition to docker image, Mongo also offers cloud-hosted server-less instance and difference flavors of dedicated/semi-dedicated instance in Azure, AWS and GCP.

Min Sys Req: 0.25 vCPU and 0.5 Gi RAM

ChromaDB

It is the newest kid in the block of vector databases. It is open source, free to use, designed to run in memory and optimized to be a vector database. Its Python & Javascript SDK and REST API function like a simplified Document DB (no schema needed). So no need to learn/re-learn SQL. It allows for scalar search filter to rein down the vector search scope for efficiency. I ran it using 100000+ docs and it doesn’t even sweat.

Special Features:

  • The Python SDK can run embedding models locally (downloaded from HuggingFace) or call external embedding APIs (like OpenAI) under the hood so you don’t have write addition code for generating embeddings for search.
  • It supports URIs as 1st class citizen where content gets downloaded and embedded under the hood.
  • If you don’t want vector fields/vector search you can also use it like a simple document DB.
  • It supports fuzzy text search search on documents with large text fields like MongoDB.

Caveats:

  • ChromaDB is currently in Alpha release. It does not have a cloud hosted instance. Although their website claims that it will be a thing in future but the timeline is unclear.
  • The metadata cannot have nested compound fields: as in the fields can be either simple types or array.
  • Although The query syntax is also not as rich to support various scenarios like Mongodb but it is the simplicity and very focused use case is what gives it an advantage.
  • Unlike MongoDB, each document/data-object can have ONLY one vector field.

Docker image:

Min Sys Req: 0.25 vCPU and 2Gi RAM (this includes the embedding generation)

Weaviate

Similar to ChromaDB this is another new kid in the block. Weaviate built to scale horizontally which means each row can have large number of columns/fields and it won’t break a sweat making it performant for feature analytics. Similar to Mongo and Chroma, Weaviate is a NoSQL document DB which makes development so much more intuitive for non-database folks (As I am writing this article I realized how much I don’t like SQL).

Special Features:

  • It offers a GraphQL API which is a major differentiating factor specially for people working with knowledge graphs.
  • It has native integration with Cohere, Huggingface, Palm and OpenAI for embedding, reranking and text generation. This is a quite a handy feature for quick development.

Caveats:

Similar to ChromaDB this is also quite new and not all the way matured.

Docker Image:

Weaviate also offers a cloud-hosted serverless instance.

Min Sys Req: 🙅😐

Milvus

Similar to Chroma and Weaviate, Milvus is also a vector primary database. It is primarily designed to handle billions of vectors and thousands of queries per second. The reviews seem to state that it supports both horizontal and vertical scaling quite well. Milvus also allows scalar filtering with vector similarity search.

Caveats:

  • I find Milvus to be extremely heavy-weight to run locally. So I didn’t go further in playing with it and not planning to use it.
  • Like ChromaDB and Weaviate it is new and relatively immature code base.

Docker image:

Min Sys Req: 2 vCPUs and 8 Gi RAM

Neo4j

If you have worked with graph databases or knowledge graphs before you have either used or heard of Neo4j. Neo4j added support for vector fields as a first class citizen to its nodes and vector search through Cipher syntax from its 5.18 release. It is a graph native database that now supports vector search. Which means that the primary use case it still knowledge graph that can have nodes that require fuzzy search.

Pros:

  • Vector properties are first class citizens of the nodes so no need to create new/additional nodes.
  • You can do similarity search as part of Cipher, which to me is the best query languages to be ever made (GQL you can ki$$ it 🍑).
  • Neo4j supports much larger vectors of 4096 which is not necessarily the case for some of the other vector databases

Caveats:

  • I have not experienced this first hand but heard from multiple DB gurus that past 100 million nodes Neo4j performance struggles quite a bit.
  • If you need simple database of large number of documents where the relationship between the documents do not really matter much, I would say go with something simpler like ChromaDB or Weaviate.
  • I am not sure if the community edition supports vector search. The enterprise edition requires a key to activate.
  • Similar to MongoDB there is no native integration with embedding generation. But you can use libraries like Langchain or Griptape to get through that.

Docker Image:

Min Sys Req: 2 vCPUs and 2Gi RAM

KDB.AI Server

Originating from the team that created KDB+, KDB.AI vector database SKU of KDB+. Unlike Mongo, Chroma and Weaviate, KDB.AI is an RDBMS and needs a schema. The saving grace is that although this an RDBMS, neighter the Python Driver nor the API requires you to write SQL (Did I say I don’t SQL)

Pros:

  • Vector search support hybrid queries to include both sparse and dense vectors.
  • It also optimized for time series queries making it great for temporal anomaly analysis such fraud detection, product abuse detection, security detection.
  • KDB.AI is an RDBMS, time-series DB and vector DB all in one. This makes it quite a starter choice for a general purpose database for a GenAI application.
  • KDB.AI is incredibly fast even for a large load and relatively light weight.

Docker image:

KDB.AI also offers cloud hosted serverless instance.

Min Sys Req: 1 vCPU and 4 Gi RAM (although I have run it with 2 Gi)

PostgreSQL

Officially the most feature rich DB, PostgreSQL could not sit out the vector DB war. There is an open source extension for Postgres that is in active development.

Pros:

  • If a part of your code base (the non-vector stuff) is already using Postgres or you are already familiar with the ecosystem, it may make sense to continue with it for now at least for prototyping to reduce the learning time.
  • There is no feature Postgres doesn’t have. So if you have an eclectic storage and usage of data, this is a one stop shop.

Caveats:

  • It’s all SQL baby. If you are not a fan of SQL, don’t even touch this.
  • Performance is meh 😒.
  • Unless you need a relational database I don’t see the point of using Postgres. Even in that case, KDB.AI kicks butt!

Docker image:

When the instance is up and running

CREATE EXTENSION vector;

Alternatively you can read a great article by Johannes Johansson

SQLite

Last but not least if you really are a fan of using SQL and just want to do stick familiar world of SQLite, there is an extension.

But then again why though? If you want to keep everything locally, ChromaDB uses SQLite underneath anyway and its so much easier to use 😕

Running/Self-hosting in Cloud

Running on your own machine is all nice and good but let’s say you graduated to putting your service online. Now you need the database to be online as well. Guess what else comes with it

  • Persistence through restarts
  • Update and upgrade
  • Fault tolerance
  • Disaster recovery

This means you need to back up the data somewhere in cloud so that your database instance can access the content through restarts, updates, upgrades and failures.

Each of these docker images save the data in a local file system within the docker image. For example,

  • ChromaDB persists data in a folder called /chorma persisting data
  • Neo4j and Weaviate persists data in /data directory

The other databases also have similar directories. Each of these also have environment variables you can set through docker-compose.yml file to specify a custom directory.

If you are using Azure Container App or Azure Container Instance to host these you can use volume mounting to tunnel that directory to an Azure Storage File Share so that files persists irrespective of the docker instance being up. Here is an article of how to do that for ChromaDB.

Production Cost

One thing to keep in mind is that when you are running things in production cloud, its not that you get everything for free by running your own docker images. You still pay for

  1. CPU and RAM: I will suggest using Azure Container app with consumption plan to deploy an docker image if you are an Azure afficionado like me
  2. Storage: Azure Storage File Share for persisting you data.

Unless you have a low/sparse workload it tends to be cheaper to host your own database instance at least for the first 1 year of your product launch.

What to Choose

I primarily use MongoDB although I am gradually migrating to ChromaDB and Neo4j. This may not be the right thing for your case. Whatever you go with when you are building your GenAI start-up from the scratch the following will be true —

  • You are going to try more than one.
  • You are going to throw away your initial codes.
  • You are going to need more than one database to optimize different functionality/capability of your product.
  • Do not try to optimize first.
  • Opt. for max feature first and then scope it down.
  • Choose a system that makes a potential cloud migration easier

With that said, these is remotely not the exhaustive list. These are just the ones I played with. There is a whole another list of already hosted serverless instances incase you don’t want to deal with your own hosting — same time same channel next post.

PS

--

--