The case for approximate graph search
Introducing Fornax: A new way to explore knowledge graphs using Python
Searching knowledge graphs by stipulating strict labels and relationships between nodes can yield poor results unless the user already has intimate knowledge of the data, especially when it is noisy or inhomogeneous. Fornax is an open source Python library for exploring knowledge graphs by searching for subgraphs which only need to exist approximately. The work came out of a collaboration between Digital Catapult and Blokur, a tech company building an accurate source of global music publishing data.
The idiosyncratic nature of noisy knowledge graph data makes it inherently hard to search. Where the user may reasonably expect some relationships to be present, the representation can be different to expectations. This necessitates a fuzzy approach where labels and relationships need to be specified approximately rather than strictly.
Take for example the social graph below centred around Queen Victoria and Prince Albert produced using data from Wikidata.
Queen Victoria is represented by a node labelled simply as “Victoria” and the graph contains 15 other alternative names (or aliases) such as “Queen Victoria”, “Victoria Hanover”, “Queen of Great Britain” etc. There is no single correct or canonical name for Queen Victoria, just a set of subjective alternatives where the user can be left with no a priori way of knowing how this choice was made. For a larger graph “Victoria” is also unlikely to be unique.
Queen Victoria’s mother also had the given name Victoria but she is somewhat inconsistently represented by the label “Princess Victoria” rather than “Victoria”. One of her alternative names is “Victoria of Saxe-Coburg-Saalfeld” however “Victoria of Saxe Coburg Saalfeld” (without hyphens) is not listed as an alternative name despite seeming like an equally reasonable name.
The bottom line is that, unexpectedly, the following patterns do not exist in the graph and strict queries for those patterns will return an empty result:
A node labelled “Victoria of Saxe Coburg Saalfeld”
An individual called Victoria who has a daughter also called Victoria
An individual Victoria who is the spouse of Albert
In order to search the graph it feels like we need intimate familiarity with its content, a kind of Catch 22 situation which will not scale to large social graphs or any other large graphs for that matter.
Khan et. al. proposed NeMa (Network Match), where a user specifies a query graph to search a large target graph using a neighbourhood approach where a scoring function ranks the top-k most similar subgraphs in the target graph to the query graph. Although the problem is NP-hard, NeMa can find an approximate solution in polynomial time. Inspired by NeMa we designed Fornax to experiment with this style of graph search and different scoring functions. Fornax can search graphs containing millions of nodes and relationships in hundreds of milliseconds.
NeMa and Fornax are best understood by example. Below is a simple query graph which can be interpreted as “I’m looking for an entity ‘Victoria’ who is related to the entity ‘Victoria of Saxe Coburg Saalfeld’”
As we have seen the label “Victoria of Saxe Coburg Saalfeld” does not exist in the graph but there is a similar label “Victoria of Saxe-Coburg-Saalfeld”. This kind of discrepancy amounts to a string discrepancy or label discrepancy. A suitable similarity metric between labels could be Levenshtein distance or the Jaccard index. Fornax allows the user to use any such metric since this choice can be very domain specific.
We have also seen there is no node with a label similar to “Victoria of Saxe Coburg Saalfeld” which has a neighbour “Victoria” (or a distance of 1 between them), but there is a Victoria which has a distance or 2. This amounts to a topological discrepancy.
Fornax is tolerant of label and topological discrepancies and will return the top-k subgraphs in figure 1 (the target graph) that resemble figure 2 (the query graph). For example two sensible results would be
Fornax allows the user to specify a label similarity metric of their choice and to combine it with a topology score where the weighting between the two is an arbitrary parameter. Fornax can then find the top-k subgraphs with the lowest score. If you’re interested in the nitty gritty details about Fornax then read the next section, otherwise feel free to skip to the end or see the docs for a more gentle introduction.
Fornax is a Python library. It exposes a GraphHandle class that can be used to manipulate target and query graphs. The relationships between nodes are stored in either SQLite or Postgres via a Connection interface. Graphs can be loaded directly into the database or via the Python interface provided. When a query is executed Fornax submits a SQL query to the database and then processes the result. This pushes a lot of the work to the database and its index rather than the python interpreter so Fornax is more performant.
The creation of a graph typically looks like this:
The execution of a search typically looks like this
The label similarity scores are provided directly by the user for maximum flexibility. The scores can be calculated by a simple python function, a full-text search engine such as an ElasticSearch cluster running over many nodes, or anything in between.
The results are returned in node link format. A format familiar to networkx and d3 users.
Fornax makes graph search accessible to data scientists and developers who want to mine knowledge graphs for information even when the data is inhomogeneous and noisy.
A special thanks goes to Matt Dean at Digital Catapult for his help.
Khan, Arijit, et al. “Nema: Fast graph search with label similarity.” Proceedings of the VLDB Endowment. Vol. 6. №3. VLDB Endowment, 2013.