Setting Up Your First ChromaDB Server

Chris McKenzie
7 min readSep 24, 2023

--

Guide for JavaScript Engineers who want to start using vector databases

Have you ever wondered how Spotify suggests songs that you might like? Or how Netflix knows which movies to recommend? Enter Vector Databases. One such database is ChromaDB.

Background

ChromaDB offers JavaScript developers a concise API for a powerful vector database. It prioritizes productivity and simplicity, allowing the storage of embeddings with their relevant metadata. The database, written in Python, has an intuitive and robust JavaScript client library for seamless document embedding and querying.

Embeddings

If you’re not familiar with the topic of embeddings, I highly recommend you learn more about them before diving into ChromaDB.

TL;DR

Embeddings convert data into fixed-size vectors, preserving its semantic meaning. These vectors can capture intricate relationships, making them pivotal for machine learning tasks such as search or recommendations. Using embeddings, the words “dog” and “puppy” might be translated into similar numerical arrays, allowing systems to recognize their semantic closeness.

For more on embeddings I recommend the following resources:

Prerequisites

Setup Server

For this, we’re just going to use a locally hosted server. However, you can host ChromaDB on AWS (or other cloud providers) by following their docs. This autumn, ChromaDB will launch a hosted service.

git clone https://github.com/chroma-core/chroma.git
cd chroma

Security. By default, ChromaDB is configured to be insecure. This makes it easy to get started locally, but is also important that you never launch in production with this configuration. ChromaDB supports Basic Auth with the JS client. So, before we do anything, we’ll want to enable that.

Generate server credentials. For this, we’ll use the username “admin” and password “admin”. You should use something more secure in production.

docker run --rm --entrypoint htpasswd httpd:2 -Bbn admin admin > server.htpasswd

Next create a file called `.chroma_env` in the root of the project. This will be used to configure the server.

CHROMA_SERVER_AUTH_CREDENTIALS_FILE="/chroma/server.htpasswd"
CHROMA_SERVER_AUTH_CREDENTIALS_PROVIDER="chromadb.auth.providers.HtpasswdFileServerAuthCredentialsProvider"
CHROMA_SERVER_AUTH_PROVIDER="chromadb.auth.basic.BasicAuthServerProvider"

Run Server

Let’s start the server!

docker-compose --env-file ./.chroma_env up -d --build

That’s it! You should now have a ChromaDB server running locally. You can verify this by visiting `http://localhost:8000/` in your browser. You should see a response like this:

{"error":"Unauthorized"}

If you don’t see this, check the logs or visit the troubleshooting page.

Setup Client

For this guide, I’ve created a starting point to avoid having to setup the boilerplate. Clone this repo and install dependencies.

I suggest you clone the repo to a different folder than the server to avoid confusion.

git clone git@github.com:kenzic/chromadb-demo.git
cd chromadb-demo
git fetch --all --tags
git checkout tags/basic-demo -b sandbox

Install dependencies:

yarn add chromadb openai

Great! Now, let’s start adding code to `upload.js`

We’ll start by importing the ChromaDB client and creating a new instance.

import { ChromaClient } from 'chromadb'
const client = new ChromaClient({
auth: { // Provide client with auth options
provider: "basic", // Tells client to use basic auth
credentials: "admin:admin" // Tells client to use the username "admin" and password "admin"
}
});

Do not store credentials in your code!

Next we’ll create a new collection with `getOrCreateCollection`. A collection is a group of embeddings. For example, you might have a collection of product embeddings and another collection of user embeddings.

`getOrCreateCollection` takes a `name`, and an optional `embeddingFunction`.

  • `name` must: contain valid URL characters, between 3 and 63 characters, unique, cannot have two consecutive dots, not be an ip address, and start and end with lowercase letter or digit.
  • If you provide an `embeddingFunction` you will need to supply that every time you get the collection.

To create a collection you can call the method `createCollection` on the client, but we’re going to use `getOrCreateCollection` instead. This will create the collection if it doesn’t exist, or return the existing collection if it does.

We will also use OpenAI’s Embedding API. To do so, you’ll need an API key, which you can obtain here.

import { OpenAIEmbeddingFunction } from 'chromadb'
const embedder = new OpenAIEmbeddingFunction({openai_api_key: "apiKey"})


async function main() {
const collection = await client.getOrCreateCollection({
name: "nasaArticles",
embeddingFunction: embedder
});
}

Add Documents

Photo by NASA on Unsplash

One of the features that make ChromaDB easy to use is you can add your documents directly to the database, and ChromaDB will handle the embedding for you. A document is just plain text that you want to store and vectorize for later retrieval. Included in the repo are 5 articles from NASA’s Blog for our demo data.

// add data import
import data from "./data";

// update main
async function main() {
const embedder = new OpenAIEmbeddingFunction({openai_api_key: "apiKey"});
const collection = await client.getOrCreateCollection({
name: "nasaArticles",
embeddingFunction: embedder
});

// add the following:
const ids = [];
const documents = [];
const metadatas = [];
data.forEach((article) => {
ids.push(article.id);
documents.push(article.document);
metadatas.push({
title: article.title,
url: article.url
});
});

// Add documents to collection
const result = await collection.add({
ids,
documents,
metadatas
});

console.log('result', result);
}
npx babel-node src/upload.js
> result true

If you already have embeddings you can store those directly by including the `embeddings` option

Now that we have our documents added, let’s query them!

Query Documents

Add the following to `query.js`:

import { ChromaClient, OpenAIEmbeddingFunction } from 'chromadb'

const client = new ChromaClient({
auth: { // Provide client with auth options
provider: "basic", // Tells client to use basic auth
credentials: "admin:admin" // Tells client to use the username "admin" and password "admin"
}
});

const embedder = new OpenAIEmbeddingFunction({openai_api_key: "apiKey"})

async function main() {
const collection = await client.getCollection({
name: "nasaArticles",
embeddingFunction: embedder
});
}

Next we’ll create a query. A query is just a document that we want to find similar documents to. For this we’ll use the first article in our dataset.

// add to `main` function just under `const collection = ...`
const results = await collection.query({
nResults: 1,
queryTexts: ["What's happening on the space station?"]
});
console.log(JSON.stringify(results, null, 2));

Before running this, let’s take a look at the options we’re passing to `query`:

  • nResults: This is the number of results we want to return. In this case we’re asking for 2.
  • queryTexts: This is an array of documents we want to find similar documents to. In this case we’re only passing one document.

Now let’s run the query:

npx babel-node src/query.js

Nice! The result is the most similar document to our query. Change the query to see how it changes the results.

Three important fields to note:

  • distances: This is the distance between the query and the result. The lower the distance the more similar the result is to the query.
  • documents: This is the document whose embedded representation is closest to the query.
  • embeddings: This is null by default. Embeddings are large and can be expensive to return. If you want to return embeddings you’ll need to add `embeddings` to your list of fields to include in `includes`

Final Thoughts

This was a very basic guide to setting up your first ChromaDB server and client. There are many more features that ChromaDB offers. I highly recommend you check out the docs to learn more. Things you’ll want to check out:

  • Embedding Functions — ChromaDB supports a number of different embedding functions, including OpenAI’s API, Cohere, Google PaLM, and Custom Embedding Functions.
  • Collections — There are a lot of methods and options for collections we didn’t cover. Before building your first app I recommend spending some time here.
  • Querying — Querying in ChromaDB is much more powerful than what we covered here. You can filter by metadata, and document content, query embeddings directly with pregenerated embeddings, as well as which fields are included in the result.

We’ve just started to scratch the surface. Now, it’s your turn. Set up your own ChromaDB server, experiment with its capabilities, and share your experiences in the comments.

I will follow up this guide with a more in-depth Youtube Search engine and recommendation system.

Important Callouts

  • By default, queries don’t return embeddings. If you want to return embeddings you’ll need to add `embeddings` to your list of fields to include in `includes`
  • ChromaDB is still very new. If you run into any issues, please connect with their team on Discord.
  • You can customize the distance methods of the embedding space when you create the collection. By default it uses `l2`, but supports `cosine`, and `ip` as well.

Next Steps

This article introduced the fundamentals of setting up a ChromaDB instance. Next, we should apply this knowledge practically. I invite you to follow my tutorial on leveraging ChromaDB to create a RAG app which allows you to have a conversation with your Notion Database:

Or create an app which searches YouTube videos based on the transcript:

Resource

--

--