Finding a Vector Database to Search the Earth

Zoe Statman-Weil
Earth Genome
Published in
5 min readSep 18, 2023

With the vector search landscape changing daily, identifying a vector database to be the backbone for earth observation similarity search at a reasonable cost has proven to be a challenge.

At Earth Genome, we have been busy building the technology behind Earth Index, a platform for searching the earth’s surface for environmental features and change. A key technological component of the infrastructure behind this platform is a vector database that can perform similarity search on earth observation embeddings, satellite imagery data translated into low dimensional vectors. There are vector database and search engine options appearing at a shocking clip, and the individual databases are developing fast. Here is how we approached the problem of choosing the one for us.

Earth Index demo of searching for gold mining in Amazon

Requirements

Our requirements differ from many of the other applications of vector similarity search, which often involve users trying to cull large amounts of data into a handful of matches at top speed (think product finding for a large fashion retailer). Alternatively, we need to find a database / search engine that can return hundreds to tens of thousands of results so we can build a dataset like mining activity in the Amazon. Additionally, unlike a lot of similarity search applications, speed of search is a factor but not a priority. Waiting a few seconds to find all new mines in the Amazon is a reasonable expectation.

Another unique requirement is the scale at which we can envision our platform operating. With historical satellite imagery data readily available and new high quality data collected daily, our infrastructure needs to be able to scale rapidly and at relatively low cost. Our needs encompass both the ability to dynamically scale the vector database/service, as well as index our data at a reasonable pace to keep up with the incoming flow of satellite imagery. With most vector databases utilizing large amounts of RAM to deliver search results quickly, the price of vector search can skyrocket quickly. Scaling options would empower us to control our resource use, and thus cost, as much as possible.

Some other factors we considered but did not make our priority list were index types and similarity metrics (we have found the common indexes like HNSW and IVF work well with our data), documentation and API usability, and the ability to filter results by geospatial metadata. While initially open source and free was a requirement priority, as a small team we became open to the idea of using a managed database and offloading that responsibility.

Process of elimination

The vector database landscape is growing and changing daily. In Summer 2022 we eliminated Milvus as an option because filtering by string metadata was not possible; by September 2022, the VARCHAR scalar data type was available. The community was griping about latency of PostgreSQL’ pgvector extension a month ago; already HNSW has been added and performance has improved.

Our initial list of vector databases created in Summer 2022 included Embedding Hub, Milvus, Vald, Vespa, Vertex AI, Pinecone, pgvector, Qdrant, and Weaviate.

Some were eliminated for the lack of dynamic scaling, like Weaviate and Embedding Hub. Others due to cost, such as Google’s Vertex AI. Pinecone was axed because 10,000 is the max number of results that can be returned. Eventually we narrowed down our options to Qdrant and Milvus. At the time Qdrant was still a nascent company, while Milvus was much more established in the space.

Comparing the leading contenders

Both Qdrant and Milvus are developing features and stabilizing existing ones at a fast clip. We started out using Qdrant as our vector database because of its ease of use and metadata filtering capabilities (and lack of string metadata filtering in Milvus at the time of research, noted above), but we eventually did a deep dive into Milvus, throwing millions of records into both.

Qdrant is open source and offers a cloud hosted option giving us flexibility in our involvement with database management. The Qdrant team was helpful and responsive throughout, a huge bonus when working with a rapidly developing product. Our initial draw to Qdrant was the geospatial filtering option, and while that is no longer a priority for us, Qdrant has added many other intriguing features since then, such as quantization, which could both reduce cost by decreasing memory use, as well as speed up search. As another method of managing cost, Qdrant offers memmap storage, which can help reduce the memory footprint without losing much speed. Besides lack of horizontal scaling and autoscaling, and concerns about cost, Qdrant hit our requirements.

Milvus was very easy to set up in Kubernetes internally, and Zilliz offers a cloud hosted option. Milvus offers many index and distance metric options and is used by large companies, indicating it can handle the scale at which we eventually need to process and search data. Although the results returned max out at ~16K, lower than we were aiming for in our requirements. Milvus’ system is more complicated and requires many different containers to run compared to Qdrant’s one so it is a little harder to tackle and debug when managing yourself. An interesting feature of Milvus’ is you can load and unload data in memory, potentially reducing long term cost, especially as scaling options improve. Milvus also doesn’t offer autoscaling at this time but the manual horizontal scaling was easy enough. While the return limit was not ideal, Milvus also seemed to hit our requirements for now.

What’s next

Both Qdrant and Milvus were viable options for us, both easy to use, and with lots of regular features added and bug fixes. However, it became clear quickly that the cost of using these vector databases at the scale we are aiming for would soon be a limiting factor. For this reason, the team is looking into options such as using the PostgreSQL pg_embedding extension with Neon so compute resources can scale down to zero, and complexity can be reduced because we already use and are familiar with PostgreSQL.

No matter what vector service we move forward with, it is clear that at the speed the landscape is changing, we need to build and code with the flexibility to switch our database without any substantial disruption. Whether we are building an API for querying data, or a pipeline for processing and indexing embeddings, we will program with the flexibility needed to keep up with the ever changing vector space.

--

--