Getting Started with Chroma DB: A Beginner’s Tutorial

Random-long-int
4 min readMar 16, 2024

--

Are you interested in using vector databases for your next project? Look no further! In this tutorial, we will introduce you to Chroma DB, a vector database system that allows you to store, retrieve, and manage embeddings. We’ll show you how to create a simple collection with hardcoded documents and a simple query, as well as how to store embeddings generated in a local storage using persistent storage. We’ll also cover how to run Chroma using Docker with persistent local storage, and how to add authentication to your Chroma server.

Prerequisites

To follow this tutorial, you will need to have Python and Docker installed on your local machine.

What is Chroma DB?

Chroma DB is a vector database system that allows you to store, retrieve, and manage embeddings. It can be used in Python or JavaScript with the chromadb library for local use, or connected to a remote server running Chroma. Users can configure Chroma to persist data on disk and create collections of embeddings using unique names. The client object provides methods like `heartbeat()` and `reset()`. To run Chroma in client/server mode, install the `chromadb` library and start the Chroma server with a given path. The JS client then connects to the Chroma server backend.

A simple Example

Let’s start by creating a simple collection with hardcoded documents and a simple query.

First, import the chromadb library and create a new client object:

import chromadb

chroma_client = chromadb.Client()

Next, create a new collection with the create_collection() method:

collection = chroma_client.create_collection(name="personal_collection")

Now, add some documents to the collection using the add() method:

collection.add(
documents=[
"This is a document about machine learning",
"This is another document about data science",
"A third document about artificial intelligence"
],
metadatas=[
{"source": "test1"},
{"source": "test2"},
{"source": "test3"}
],
ids=[
"id1",
"id2",
"id3"
]
)

Finally, query the collection using the query() method:

results = collection.query(
query_texts=[
"This is a query about machine learning and data science"
],
n_results=2
)

print(results)

The output should be:

{
'ids': [['id1', 'id2']],
'distances': [[0.5817214250564575, 0.6953163146972656]],
'metadatas': [[{'source': 'test1'}, {'source': 'test2'}]],
'embeddings': None,
'documents': [['This is a document about machine learning',
'This is another document about data science']],
'uris': None,
'data': None
}

Setup persistent Storage

To store embeddings generated in a local storage, lets create a local folder and link it to the Chroma DB.

First, create the path for the local storage and add it to the environment variable:

mkdir -p /abs/path/to/local/db
# execute command at the root of py folder
echo "STORAGE_PATH=/abs/path/to/local/db" > .env.local

Then, use the following code to create a collection and add it to the persistent storage:

import chromadb
from dotenv import load_dotenv
import os

load_dotenv('.env.local')

storage_path = os.getenv('STORAGE_PATH')
if storage_path is None:
raise ValueError('STORAGE_PATH environment variable is not set')

client = chromadb.PersistentClient(path=storage_path)

collection = client.get_or_create_collection(name="test")

if collection.count() < 0:
collection.add(
documents=[
"This is a document about machine learning",
"This is another document about data science",
"A third document about artificial intelligence"
],
metadatas=[
{"source": "test1"},
{"source": "test2"},
{"source": "test3"}
],
ids=["id1", "id2", "id3"]
)
print(collection.count())

# private endpoint if running http server (in that case, its fine):
print(client.list_collections())

The output should be:

3
[Collection(name=test)]

Dockerize ChromaDB with persistent storage

To run Chroma using Docker with persistent storage, first create a local folder where the embeddings will be stored and pass it as an argument when running the container:

# If you have already created that path at the previous step,
# Go directly to the docker command
mkdir -p /path/to/local/db

docker run -p 8000:8000 -v /path/to/local/db/:/chroma/chroma chromadb/chroma

-v is the argument with the specified volumes created to store collections of embeddings.

Go to http://localhost:8000/api/v1 to see if everything is working correctly, and when you add a collection, you can check if it exists with the following endpoint: http://localhost:8000/api/v1/collections

If you have followed the previous example with persistent storage and use the same db (e.g. local storage path), you will see the test database.

Adding Authentication for security

Based on the Chroma Usage Guide for Static Api Auth Token, the environment variables required for header-based authentication using tokens are the following:

export CHROMA_SERVER_AUTH_CREDENTIALS_PROVIDER="chromadb.auth.token.TokenConfigServerAuthCredentialsProvider"
export CHROMA_SERVER_AUTH_PROVIDER="chromadb.auth.token.TokenAuthServerProvider"
export CHROMA_SERVER_AUTH_TOKEN_TRANSPORT_HEADER="X_CHROMA_TOKEN"
export CHROMA_SERVER_AUTH_CREDENTIALS="test-token"

You can use a token generator or pass a super unsecured value. I’ll use a random picked token generator on the web, update the token value, add the environment variable to the Docker command, and Voila!

docker run \
-p 8000:8000 \
-e CHROMA_SERVER_AUTH_CREDENTIALS_PROVIDER="chromadb.auth.token.TokenConfigServerAuthCredentialsProvider" \
-e CHROMA_SERVER_AUTH_PROVIDER="chromadb.auth.token.TokenAuthServerProvider" \
-e CHROMA_SERVER_AUTH_TOKEN_TRANSPORT_HEADER="X_CHROMA_TOKEN" \
-e CHROMA_SERVER_AUTH_CREDENTIALS="test-token" \
-v /path/to/local/db/:/chroma/chroma \
chromadb/chroma

This command sets up a complete vector database with persistent storage and authentication on your local machine using Docker!

To test if it is correctly working, just run a simple curl request after running the Docker command:

curl http://localhost:8000/api/v1/collections

The output should be:

{"error":"AuthorizationError","message":"Unauthorized"}

And with the token set in the headers:

curl -H "X-Chroma-Token: test-token" http://localhost:8000/api/v1/collections

The output should be:

[{
"name":"test",
"id":"b70ea7d9-9950-4a6c-833a-2e42e23acb70",
"metadata":null,
"tenant":"default_tenant",
"database":"default_database"
}]

A Recap

I hope you enjoyed learning about Chroma DB as much as I did! In this tutorial, I introduced you to this powerful vector database system that allows you to store, retrieve, and manage embeddings. I showed you how to create a simple collection with hardcoded documents and a simple query, as well as how to store embeddings generated in a local storage using persistent storage. I also covered how to run Chroma using Docker with persistent local storage, and how to add authentication to your Chroma server.

I encourage you to give Chroma DB a try and see how it can benefit your projects. With its easy-to-use API and scalable architecture, Chroma DB is a great choice for any application that requires fast and accurate similarity search. (Also useful for a powerful local LLM… coming soon!)

Thanks for following along and happy coding! 🚀

--

--