Knowledge Graph Search of 60 Million Vectors with Weaviate
Building a scalable Knowledge Graph search for 60+ million academic papers with Weaviate vector search
Keenious is a search engine designed for students, researchers, and the curious! Our add-on app works directly from your text editor; analyzing your entire document and finding highly relevant results as you work. Try it now for free here!
At Keenious we place a lot of value on serendipity in our search results. Unlike traditional search our academic search engine balances directly relevant results (keywords etc.) with serendipitous results that relate via their semantic meaning to the input document. Serendipitous results facilitate a continuous sense of discovery and exploration of research and topics both of which are vital for researchers and students of all levels.
Finding the right balance is hard and changes based on user intent. There is no tried and true methodology for finding the best recipe. We use intuition, trial and error, and most importantly feedback from our users to find the best mix.
Recently, we’ve been exploring ways to introduce a form of semantic search that can be performed without text-based semantic vectors. Our reasoning behind this was to enable deep exploration and discovery of research and topics without the user necessarily having a document to search on. We often refer to this as the cold start problem i.e. a situation where a user has no paper or document text to search on.
It’s been a huge focus of ours to develop a solution that allows users to easily discover personalized research recommendations via just a single prompt or input (a paper they like, topic, etc), no document, or search query required! The solution we landed on to solve this was to use Knowledge Graphs (KG) in combination with a fast vector search solution (Weaviate).
In this article, we’ll touch briefly on Knowledge Graphs and how we have leveraged them at Keenious while primarily focusing on how we scaled the graph and made its embeddings searchable using Weaviate, including an overview of Weaviate and some information on why Weaviate stood out to us as the right tool for this task.
A Quick Primer on Keenious
Before we get going, you might be unfamiliar with Keenious and what it does. Keenious is a tool that analyses your writing and shows you the most relevant research from millions of online publications, in seconds.
We believe that learning isn’t a static process so neither should research be. With Keenious your document is the search query. Our add-on analyzes your text — as you write it — and finds the most relevant research for you every step of the way. Keenious unearths the hidden research gems by sifting through interdisciplinary topics, fields, and research areas. You can change the direction of your research with a single click.
If you need to search for something more specific you can individually explore every sentence in your paper or document using focus search. We’ll narrow your search down while keeping it relevant to the rest of your document.
- Keenious lives right inside your word processor of choice (Microsoft Word or Google docs).
- Keenious analyzes your entire document with a single click and finds highly relevant academic research from a collection of over 60 million research papers.
- Focus your searches by highlighting a selection of text to conduct a more precise search that still considers the overall context of your document.
- Easily filter your search by adding research topics, date ranges, keywords and more!
- Bookmark your favourite papers and avoid needless scrolling with our smart library that organizes bookmarks intelligently relative to your current document.
- And much more coming very soon 👀
The Academic Knowledge Graph
Knowledge Graphs (KG) are a deep subject with a lot of domain-specific terminologies, for a great intro to KG’s check out this article:
WTF is a knowledge graph?
Unpicking a tangle of terminology to conclude it’s semantic, smart and alive.
However, if time is of the essence below are some brief but by no means exhaustive points about Knowledge Graphs.
- AKA Semantic Networks: They are a network that connects multiple real-world entities and concepts and distinguishes the different ways in which they can all relate to one another. Almost any concept/entity/system can be abstracted as a Knowledge Graph.
- More formally KG’s are heterogeneous graphs where there can exist multiple types of nodes and/or edges. For example, the movielens dataset can be abstracted to a KG consisting of nodes of many types such as movies, actors, genres, language, and edges that specify the relationship between two entities i.e. actors → star in → movies.
- They are in direct contrast to homogeneous graphs where all nodes and edges are of the same type, for example, a friendship (social) network.
Our Knowledge Graph’s primary entities (nodes) and relations (edges) are paper citations along with other paper specific metadata that enriches the graph.
Using these node and edge relationships we were able to create an academic knowledge graph and train a custom model to generate very rich graph embeddings, where each embedding represents a unique node type including papers in our dataset. These embeddings form the foundation of this search and are what we feed to Weaviate to make each unique entity in our KG discoverable.
Weaviate Vector Search Engine
Weaviate is a vector search engine helping to power the future of AI search and discovery. Vector search is at an important and really interesting junction right now where it’s maturing into the mainstream of search and its benefits are impossible to deny.
Much like how the inverted index changed how we conduct full-text search, vector search engines like Weaviate are powering the next generation of search on unstructured data in text, image, and in our case the knowledge graph.
Data Objects and Mixed Index Search
From the ground up the architecture of Weaviate is well thought out and considered. The data objects in Weaviate are based on a class property structure with vectors being attached to each data object. This makes all objects in Weaviate easy to query natively using GraphQL, especially when complex filters and scalar values are part of your query. In fact, the combination of both a traditional inverted index and vector index is part of what makes Weaviate really stand out. Users can choose to include or exclude data objects with certain scalar values (text, numbers etc) from the vector search all in the same query.
We really liked the inclusion of traditional search filtering as Keenious’s search is itself already built upon powerful filtering options which we can mirror in our Weaviate instance.
Plug and Play Design
Another interesting design aspect of Weaviate is that its API is highly modular. Most notably the vector index API is structured to work as a plugin system which “future proofs” Weaviate to adapt to the ongoing improvements in vector search.
Weaviate’s current vector index type is HNSW which is a state-of-the-art Approximate Nearest Neighbor (ANN) vector search algorithm. ANN search is a very active field of research and new index architectures are being presented all the time that can improve recall and efficiency. Because Weaviate’s vector index API is backend agnostic it means in the future when the latest and greatest ANN index is added to Weaviate’s available plugins users will likely be able to switch over with minimal changes to their setup if any.
I think choosing to spend the time designing an API that can be adapted to any vector index in the future was a really excellent choice. Too many text search engines are stuck using retrieval methods from 20+ years ago that has long since been surpassed but can’t be replaced because the code is too tightly coupled.
Integrations and Modules
Adding to the overall modular approach that Weaviate has taken are the functionality modules they have built on top of the search, and many custom vectorization modules to provide out of the box data to vector transforms. There are some really powerful modules to choose from out of the box, and it’s also very easy to create your own module if your use-case has a couple of routine transformation steps involved in the process. Below are some of the standouts for me from the available modules:
- text2vec-contextionary: A very interesting feature that essentially embeds a contextualized representation of each data object added to the database using both the values being inserted and their relations as the text that is to be vectorized.
- text2vec-transformers: Leverages the rich sentence embedding models from sentence-transformers to create a paragraph/document embedding for each textual object import into Weaviate, so that the developer doesn’t have to implement the inference code themselves.
- img2vec-neural: Similar to text2vec this module automatically vectorizes images into vector representations using large pre-trained computer vision models to seamlessly enable semantic search on any image.
Note: At the time of writing the horizontal scalability feature of Weaviate has just released its release candidate v.1.8.0-rc0. It is expected to have a stable release by fall 2021.
While our use case is currently able to fit on a single node instance of Weaviate we suspect that eventually, we’re going to need a vector search solution that scales indefinitely. In the world of vector search, this has been tricky. Many vector search algorithms have a ceiling in terms of performance once a certain amount of vectors have been added. Weaviate is designed to also scale horizontally as a cluster of nodes, much like Elasticsearch currently does for text search. Effectively the horizontally scalable version of Weaviate is comprised of an index broken up into many different shards or small ANN indexes that can then be distributed across a number of nodes.
With this setup, there is effectively no limit to how many objects can be added to a Weaviate cluster as it can be scaled to any use case without any performance sacrifices.
Horizontal scalability is probably the most critical feature needed for a vector search engine to be truly production-ready. Weaviate is really well-positioned for scalability. Additionally, the entire codebase, including the custom implementation of HNSW, is written in Go which as a language lends itself very well to large scalable systems given its advanced built-in concurrency and networking libraries.
Powering Knowledge Graph Search With Weaviate
Sidenote: One useful piece of information to pass on for anyone considering using Weaviate is that it has very different memory requirements depending on what “mode” you need it for. If you are actively indexing and adding new objects i.e. lots of writes, then memory consumption will likely be at its highest. On the other hand if like us you are doing one big index of a lot of objects you can restart Weaviate after and use a fraction of the memory because all the vectors don’t need to be stored in memory for inserts.
Now that we’ve done a deep dive into Weaviate and its nuts and bolts let’s discuss how Keenious is actually using Weaviate to power our upcoming Knowledge Graph search features.
While Weaviate has a number of functional modules as previously mentioned, at its core it is a pure vector-native database and search engine. Since we have trained our own custom model to produce rich embeddings for items in our academic Knowledge Graph all of our vectors were imported directly into Weaviate without any transforms.
Our use-case which started at 50 million items to index has quickly grown to over 60 million and we (along with the Weaviate team) learned a lot about how to successfully import this many items into a single node version of Weaviate while keeping searches fast, and memory usage (relatively) low.
Currently, we build the index and database on our own custom workstation, lovingly known by everyone here at Keenious as Goku. Soon we’ll be migrating this step to become a Kubernetes job. Weaviate has a very useful tool for quickly creating a base docker-compose file to get things going, check it out here.
We used the docker-compose file as our base, adding some custom environment variables for Go’s GC rate which controls the aggressiveness of Go’s garbage collection. We found success in setting this to values lower than the default of 100 as this helped keep memory usage down with the trade-off of slower imports. This is something that varies from use case to use case so experiment as you go. The Weaviate team has recently written a great doc on planning resources with Weaviate.
We went with the Python client as it is probably the most feature-rich and also the one best suited for iterative development. If you going to be importing a lot of objects you almost certainly have to use the bulk API in the client. This API has just had a major rehaul recently which provides a number of different ways to approach adding objects in bulk, depending on your preference. Also, the client and server have authentication fully enabled for when you need to set up access control to the instance.
Tips for Tuning Weaviate for your use case
Currently, the vector index has 4 major parameters that can in a sense be tuned. These parameters are:
The first 3 parameters come directly from HNSW itself and are specific to the algorithm, so it’s worth checking out the original paper for a more detailed explanation of what each parameter does. The main effect of these parameters, specifically efConstruction and maxConnections is a trade-off between recall/precision and resource usage/import time.
Increasing maxConnections will typically improve the quality of the index but will also increase the size of the in-memory HNSW graph. If you can’t afford increased memory usage then efConstruction may be the parameter you want to increase but increasing this will increase import time.
The ef parameter really only comes into play when conducting searches and depends on the number of objects in your index and your latency needs. We found that when using higher values for efConstruction at index time we can afford lower ef values at search time.
Be careful with vectorCacheMaxObjects you’ll almost certainly want this to be greater than or equal to the number of objects in your dataset at index time but when running Weaviate for just searches it can be beneficial for memory to keep this low, as you don’t need to store all the vectors in memory, the HNSW graph does the heavy lifting the vectors are just used to calculate the final distance scores.
And that’s a wrap! At Keenious we’re really happy with Weaviate both for the quality of its vector searches and for all the extras built on top that truly make it a game-changer for vector search.
Choosing Weaviate has allowed us to completely focus on developing awesome features for our search engine that involve the 60+ million Knowledge Graph embeddings we store in Weaviate. We’re able to approach solving academic search problems with the product at the forefront, not technical implementation requirements, that’s awesome.
We’ve got some pretty fun features built on this coming very soon (Late Fall 2021) and we’re looking forward to sharing those with you once released, so stay tuned.
P.S. The team behind Weaviate (SeMI Technologies) are really friendly and very engaged with their users. They’ve helped us a lot with getting going with Weaviate and have even focused their time on fixing bugs that directly impacted our use-case. If you have any questions the best place to get in contact with them is their slack.
Try Keenious today for free by downloading or installing our add-on at keenious.com.