Getting started with Provenance and Neo4j
We all want to know “Where does our meat come from?” or “Is this a reliable information or fake news?”. If we ask this kind of questions it is always about the Provenance of information or physical objects. This article proposes the usage of neo4j to store provenance information — originally written in 2017.
What is provenance
Provenance describes the origin and history of “a thing”. The term provenance originally has been used for paintings and their owner history over time. If there are some missing Provenance information, for example, a decade where you can’t tell who was the owner, the painting value drops immediately.
Today we use the term provenance to track processes and responsibilities in general. For example to track the production of food or to track the history of a file.
W3C — PROV standard
In 2003 the W3C adopted the official PROV standard to describe provenance structures. I don’t want to explain the complete standard only the key concepts: Entities, Activities, Agents, and Relations.
- Activities — An action that modify / create / delete Entities
- Entities —Pictures, Files, Meat, Horse
- Agent — Triggers actions and “owns” entities, used to track responsibilities
Furthermore, there is a fixed set of relation types, for example, a generation or usage. You find more information in the PROV-DM spec.
The standard provides also several serialization formats for example:
The next picture is a simple example of this concept with an agent, two activities and two entities:
This example explains that Bob baked a cake, with some additional information like the start- and end time. Now we can follow the digraph and determinate the origin activity that created the ingredients of our Cake, in this case, the Buy activity. Of course in a real-world example, we should split our ingredients activity into multiple entities and track their process of buying or growing separately.
The interesting part is that we are able to find the first activities that are somehow influenced our cake activity, without knowing the exact steps in between. In this example, this seems really simple but these digraphs can get really complex:
This example tracks the production- and shipping process for a meat product. This includes each cattle, slaughters and transportation step. You see it is a complex digraph with many entities activities and agents.
PROV and Neo4j
To handle PROV graphs it is useful to store the information in some kind of database instead of a single file. Neo4j provides the required features to store PROV data in a property graph:
- Nodes — Mapped to Entities, Activities, and Agents
- Relations — For example, used, wasGenerartedBy …
- Properties — Predefined PROV — as well as custom properties can be attached to nodes as well as to relations (for example startAtTime, endAtTime)
We developed a simple python module to store PROV documents in a neo4j database called prov-db-connector. The following script saves our “bake a cake“ example into the Neo4j database.
The script uses the ProvDocument from the prov library. The PROV lib provides basic functionality to serialize PROV documents.
The result of this script is a graph structure that represents our PROV document with all relations and properties:
If you compare the example diagram and the actual neo4j graph diagram above you see they are really similar. Within all the information in the graph database, you benefit from Cyphers powerful query mechanisms to extract information from your graph.
Maybe you ask yourself “Why should I use such a complex standard and put effort to design these PROV data model?”. Yes, you are right this not simple and not done in 1 hour. But if you collect such data you can do fabulous data analyze and answer questions like
What is the origin of this order?
Who was responsible for installation of product xyz at our client?
Is this product produced under good working conditions (based on certifications) ?
I saved a complex example PROV document that contains information about the production of meat.
You see the graph is huge and complex. With many different steps and activities. Let me show how you answer a question based on this graph:
Who was the slaughter of order 2?
The Cypher command to get the slaughter of order 2 (or the slaughters) looks like this:
If I track my data with a traditional data structure I probably would do some joins on tables and find the slaughter. Not complicated if all data is in one database, If not it might be complex but still doable. So why?
- We don’t care about structures between the order and slaughter
- It’s backward compatible in case of a change in the process eg. changing the transportation company.
- It's interoperable between eg. companies
Conclusion — For whom
This technology is not for every use-case. If you just want to display your users a log about the recent changes in a dataset, just save the recent changes in a database and show the list to your user, keep it simple ;)
The PROV standard is interesting if you have a complex process that results in “a thing”. Like a food product or a weather prediction. For example, the computation of the temperature prediction for the next day is a complex process. The calculation is based on several satellite images and other data. In this case, it is good to know on which data the temperature calculation was based on. This could be useful for debugging or for transparency reasons.
Another use-case could be the global production process and lifecycle of products. To understand the way of products and answer questions like “is this product really only build with wood from the EU?”. The PROV-Standard could be a data format to answer such questions and exchange provenance information.
I’m just started to write my first few blog articles. I’m glad about any feedback — positive and negative! Just let me know if you have any questions or suggestions to improve this article!