Hyperfast Vector Search with Binary Quantization.

Published in

Qdrant Vector Database Blog

4 min readSep 15, 2023

The new release of Qdrant 1.5 enables incredible memory and performance improvement to the already efficient storage and search of high-dimensional vectors through binary quantization. In machine learning, quantization is the transformation of the original vectors into new representations, compressing the data while preserving close to the original relative distance between vectors. They enable you to reduce thememory footprint while accelerating the search process in high-dimensional vectors, with a small sacrifice in accuracy. Quantization is crucial for developers working with large datasets, and Qdrant enables this to optimize the search engine for specific use cases, striking a balance between accuracy, storage efficiency, and search speed.

For those less familiar, Qdrant is a leading, high-performance, scalable, open-source, vector database, essential for building the next generation of AI/ML applications. Qdrant is purpose-built to enable developers to store and search for high-dimensional vectors efficiently and can be utilized via these ready-made clients for Python or other programming languages, including Go, Rust, and JavaScript. Qdrant easily integrates with your choice of Large Language Model (LLM), framework, and cloud provider including Cohere, DocArray, Hugging Face, OpenAI, LangChain, LlamaIndex, AWS, Google, and Microsoft.

Benefits of Quantization with Qdrant

We know that vector embeddings are the best current method to turn unstructured data, such as documents, images, audio, or video, into numerical representations. Neural networks learn to create vectors that maximize the amount of information extracted from the data given a chosen vector size.

For many models, the output embedding can be over 500 elements of floating point 32-bit numbers. For example, an OpenAI GPT4 embedding produces 1536 element vectors. This means each vector is 6kB for just storing the vector. You want an index for fast search and retrieval, and this will consume additional memory. Qdrant’s guidance on memory consumed for a standard HSNW index along with the vector is:

memory_size = 1.5 * number_of_vectors * vector_dimension * 4 bytes

This means if we have 1,0000,000 ChatGPT4 vectors we would need 9 Gigabytes of RAM and disk space. This consumption can start to add up rapidly as you create multiple collections or add more items to the database.

Quantization — going Scalar, then Binary

One of the ways to tame this data growth is to use a technique called quantization. You trade a small amount of precision in your search results for smaller data sizes and faster retrieval times. Our default recommended method is scalar quantization.

Improved Storage Efficiency

With scalar quantization, the floating point numbers in the vectors are converted to integers. They go from a 32-bit number to an 8-bit number, a 4x reduction. This can also be thought of as a lossy compression technique, you are reducing the information encoded in the vector, and the storage requirements.

It’s similar to the idea of taking a TIFF photograph and turning it into a GIF image (which only allows 8 bits per pixel). The GIF file will be smaller in size and faster to open but on the other hand, you won’t see as many details or color gradation in the image. Similarly, once you turn the image into a GIF, you can’t convert it back to the same quality TIFF.

Example: From left to right, TIFF image, reduced TIFF, and GIF

In vector search, for many types of problems, you only sacrifice a small bit of precision for large savings in space and a commensurate gain in performance. The process we use in Qdrant reduces the memory footprint, speeds up the search process, and is covered extensively in a previous blog post.

Fast Search and Retrieval

Qdrant uses a special set of SIMD CPU instructions to speed up similarity searches, these instructions work with smaller (8-bit) integers, and this conversion reduces the search space and accelerates query responses.

Binary quantization is an extreme case of scalar quantization, enabling vector component representation as a single bit. With binary quantization, the 32-bit float is reduced to a single bit (either a 1 or 0) effectively reducing the memory footprint by a factor of 320%, and similarity searchers will consume fewer CPU instructions enabling up to a 400% increase in speed of vector comparisons.

Using our photograph example from above, this is like taking a TIFF photograph and turning it into a black-and-white image (not grayscale). Again you are giving up more precision for decreased storage size and increased performance.

Here be dragons

Binary quantization showed good accuracy in tests and in our own experiments, using binary quantization with the Cohere AI embed-English-v2.0–4,096-dimensional vectors produced a 98% recall@50 with 2x oversampling. Models with lower dimensionality or different distributions of vector components will require additional experimentation to find optimized quantization parameters. The oversampling above means Qdrant uses the binary index to return a 2x size neighborhood of similar vectors. Then, using this much smaller subset of vectors, it does a standard 32-bit index scan to get much better precision.

Getting Started with Qdrant

In this blog, we’ve covered scalar and binary quantization, the benefits and trade-offs, and we hope you’re ready to start experimenting with your own data.

Getting started is easy, either spin up a container image or start a free Cloud account. Our documentation covers adding the data to your Qdrant instance as well as creating your indices. We would love to hear about what you are building and please connect with our engineering team on Github, Discord, or LinkedIn.