by Kurt Cagle, #theCagleReport
Graph databases and knowledge graphs, in particular, have seen an uptick in interest among CIOs, digital transformation managers and others in enterprise circles. Some of this has come as data warehousing and data lake technologies have not shown the anticipated bringing together of all data within a company. This is due in part to a number of untenable assumptions, not least of which being that the relational model itself is sufficient to capture the rich, complex metadata that must be associated with properties that need to be in place to be useful. Column headers are often seen as an afterthought, when in fact those column headers are essential to converting data into knowledge.
Graph databases work by using a few basic principles. One of the first is the notion that keys, which identify resources in a data system, are global. In a relational database, this isn’t really true — most identifier keys are integers generated by the system to be unique relative to a table within that database. Typically this might start out as being the same as a row number in a table, though with deletions and insertions there’s no real guarantee that two consecutive rows will consist of successive integers. If you take the data outside of the context of the database, you need additional metadata that may or may not actually be passed to determine what keys actually associate where.
To put this another way, let’s say that you have a table called employees in a database called personnel, run by BiglyCo, the big software conglomerate. The table might look something like this:
| emplId | empName | empStartDate | empDivision |+----------+------------+-----------------+----------------+
| 1 | Jane | 2019-03-12 | 7 |
| 2 | Michael | 2016-04-11 | 2 |
| 3 | Sarah | 2012-09-08 | 2 |
| 4 | Rashid | 2017-02-02 | 7 |
| 5 | Laura | 2013-11-03 | 5 |
A second table gives the divisions at BiglyCo.
| divId | divName | divManagerId |
| 2 | Sales | 4 |
| 5 | Enginrng | 5 |
| 7 | Acctng | 1 |
This is typical for how most relational databases work. Reading this table you can see that Jane (employ #1) is in accounting (divID #7), and is also the manager of that department (since divManagerId #1 points back (we hope) to the Employees table. Laura is in Engineering and is also the manager of that division. By sheer coincidence, these both have the same index number #5, but these represent two different things.
In a knowledge graph, one of the central assumptions is that you want to avoid situations where you have the same keys representing potentially different keys. To make this happen, you generally try to make the keys unique by qualifying them in some way, usually by creating an authority chain. For instance, to represent the person Laura (employee number #5), I’d create an authority ID that looked something like:
where uri is a shorthand for uniform resource identifier. This uses a scheme (uri:company/database/table#_idnumber) for generating the uri, but the scheme itself is much less important than the fact that there is really only one such identifier anywhere in the world that represents this particular individual. It is globally unique. A similar scheme can be used for the engineering division:
There are several significant things about this. The first is that while both the employee and the division have the same local index, once they are qualified it should become obvious that they are not representative of the same thing or even the same type of thing. In addition to that, while there is a temptation to see the above as a path in a hierarchy, the scheme itself is largely arbitrary. If I wanted to refer to the employee table, the better URI might look something like:
Put another way, you want to avoid putting too much in the way of semantics into the uri. Indeed, another perfectly valid (if opaque from a teaching perspective) way of setting up such an identifier is to give up all semantics:
(The underscore is used simply because some systems dislike identifiers that start with numbers, and it usually helps to identify instances of things).
Either way, the way that graph databases work is to say that so long as the identifier for a given resource is unique, then rather than talking about rows and columns, you should talk about the conceptual entity: For instance, I can make a set of assertions based upon the above tables:
Now, this is hard to read. We can simplify this by means of a lookup table that can be used to condense the URIs into something that’s a little more palatable for humans:
@prefix employee: <uri:biglyco.com/employee#>.@prefix division: <uri:biglyco.com/division#>.
@prefix thing: <uri:biglyco.com/thing#>.
@prefix class: <uri:biglyco.com/class#>.employee:_5 thing:isA class:_Employee.
employee:_5 employee:hasName "Laura".
employee:_5 employee:isInDivision division:_5.division:_5 this:isA class:_Division.
division:_5 division:hasName "Engineering".
division:_5 division:hasManager employee:_5.
Notice what’s been done here. Besides being easier to read, we’ve created an additional bit of metadata that associates a resource with a class. (We’re using thing: here because all things should generally be classifiable). We could also add a few additional assertions that let us generate identifiers and create provenance trails:
@prefix table: <uri:biglyco.com/table#>.table:_personnel-1_employees
table:hasName "Divisions". table:hasExternalUrl "https://biglyco.com/dbs/Personnel/Divisions".
The use of the semi-colons serves to indicate that the subject (such as employee:_5) should be assumed. That is to say:
is the same as
employee:_5 thing:hasIdentifier 5.
employee:_5 thing:hasSourceTable table:_personnel-1_employees.
In the larger example above, an association is made between an abstract entity (such as the employee Laura) and a table. The identifier from that table is also given, which means that if you know that you’re looking at the context of this particular table, the identifier will retrieve the particular row where that record for that resource exists. A final URL (via table:hasExternalUrl) actually identifies the “physical” location of that table on the network.
Data Provenance, Blockchains and Knowledge Graphs
The semantic linking to a data source is one aspect of what is usually called provenance information. Provenance basically tracks the origin of that data and is one form of metadata that is rarely captured in typical relational databases. However, it is relatively easy to capture this in graph databases, which are made up of assertions just like those above, because it is fundamentally referential. With the metadata given above, the graph database identifies where the data used to populate the resource in the triple store came from. It may also indicate (again, via a pointer) what transformation function was used to generate the data from that previous store into the knowledge base representation. This is a key component of a data catalog.
A blockchain (or more generally, a distributed ledger system) provides a record of a transaction, typically the transference of ownership between two legal entities of a resource in exchange for another resource (e.g., a car for money). There may be a contract resource bound in this same way that stipulates the terms of this transference.
There are three competing facets involved with trying to create a distributed ledger. The first facet is the need to uniquely identify the resources involved in that transaction. This process ultimately comes down to the need to identify some form of issuing authority, which in turn requires that you have some way of uniquely identifying that authority (implying another issuing authority and so forth).
One example of this is the way that websites ascertain security by creating certificates of authority (CAs) that in turn are part of a chain of authorities. Note that this doesn’t necessarily guarantee that the entity isn’t whom they say they are, but it does establish a chain of liability should some form of due diligence not be accomplished (or, in layman’s terms, there’s always someone who can be sued if they messed up). Typically these identifiers are then cryptographically hashed, and it is then the hash (the public key) that gets compared, not the identifier.
The second facet is a need for immutability. Ledgers are important because they are records of transactions, and consequently can be considered legal documents. Altering a ledger is a criminal offense. The best potential way to build such a ledger is a write-once-read-many solution, meaning that once an entry has been written, it cannot be altered, though it is possible that another transaction can be added to indicate that the transaction has been revoked and a new transaction enabled.
The final facet is verbosity and the issue of identity by reference. I do not want to store a huge amount of information about a given resource, especially information that is likely to change over time. I more than likely want to know a person’s phone number or address as it exists today than as it did when the transaction occurred. Moreover, the amount of space in any blockchain is very, very limited. This all implies that it would be better to encode only the barest minimum in the ledger — the public key of the transactors and the resource in question — then encode the broader detailed information in an identifier that includes the public key as a field of information.
Given the highly referential nature of these systems, the conclusion is obvious — knowledge graphs need distributed ledgers to secure keys, and distributed ledgers need knowledge graphs to provide the necessary context and provenance for those keys. For this to happen, however, it means that knowledge graphs need to evolve to become at least partially immutable.
Fortunately, knowledge graph vendors are beginning to address this particular problem. Fluree recently debuted its Fluree Data Management Stack, which has started to incorporate distributed ledger technology into its knowledge graph store. It’s worth noting that distributed ledgers do not necessarily need to use blockchain, simply because once you have an immutable knowledge graph, constructing a distributed ledger from it is, if not necessarily trivial, at least easy enough that it eliminates most of the need for the blockchain architecture. Neo4J has also focused its property graph store for use with blockchain, similarly recognizing the overlap between the two technologies and the very strong need for trust networks and knowledge networks to integrate, as has graph heavyweight Tiger Graph.
Given this momentum (and other vendors who have indicated that they have blockchain or distributed ledger support on the way), it is very likely that you will see an integration of the two technologies becoming routine by 2021, likely in parallel with attempts to better integrate machine learning algorithms and an industry-wide effort to better integrate semantic (RDF) graphs and property graphs, likely via a W3C action such as a new version of SPARQL and standardization of GQL and GraphQL.
Ultimately, my take is that knowledge graphs will end up being the integration point for a number of technologies that have separately been lumped under the rubric of artificial intelligence because most of them ultimately touch on the nature of global data in distributed environments, distributed ledgers, immutable computing, and the Internet of Things included in this. This will become increasingly the case as federation — the ability to address, query and update data across multiple data stores of different types — becomes an intrinsic capability of all such systems.
Kurt Cagle is the editor of #theCagleReport, a distributed blog, and managing editor of cognitiveworld.com. He’s a former contributor to Forbes and O’Reilly Media, and also provides consulting services on both the technical and business side of knowledge graphs, enterprise metadata management, and the data side of AI.
#semantics #knowledgeGraphs #graphDatabases #blockChain #distributedLedgers