Datalog Basics and RDFox
Semantic Reasoning Fundamentals
This article aims to provide a basic introduction to Datalog with RDFox. It will explain what Datalog is, why RDFox uses Datalog, how to write Datalog rules, how rules can enhance SPARQL query performance, and touch upon RDFox’s Datalog extensions.
What is Datalog?
Datalog is a rule language for knowledge representation. Rule languages have been in use since the 1980s in the fields of data management and artificial intelligence.
A Datalog rule is a logical implication, where both the “if” part of the implication (the rule body) and the “then” part of the implication (the rule head) consist of a conjunction of conditions. In the context of RDF, a Datalog rule conveys the idea that, from certain combinations of triples in the input RDF graph, we can logically deduce that some other triples must also be part of the graph.
Datalog has various applications including knowledge representation, ontological reasoning, data integration, networking, information extraction, cloud computing, program analysis and security.
Why does RDFox use Datalog?
RDFox is a high-performance knowledge graph and semantic reasoning engine. Reasoning is the ability to calculate the logical consequences of applying a set of rules to a set of facts. RDFox uses the Datalog rule language to express rules. Rules offer an expressive way to process and manipulate a knowledge graph. Rules bring the intelligence layer closer to the data and can also simplify query formulation and data management.
As Datalog is a declarative logic-based language, renowned for its simplicity and use within knowledge representation, it complements RDFox, as a declarative solution. Declarative tools are useful across a broad range of use cases, for example, applying business logic, machine learning models, business rules, in-house devops or compliance tools or streaming attribution.
It is a widely understood language, which is central to an abundance of rule formalisms and maintains a wide range of extensions. You can read our introduction to rules here or read the Do’s and Don’ts of Rule and Query Writing article here.
Writing in Datalog
As aforementioned, Datalog rules are logical implications. Each Datalog rule conveys the idea that, from certain combinations of triples in the input RDF graph, we can logically deduce that some other triples must also be part of the graph. Simply, this can be explained as an ‘if-then’ statement.
For example, we can say that:
?x has uncle ?z if … ?x has parent ?y and ?y has brother ?z .
(e.g. Lucy has Uncle Peter, if Lucy has Parent Alex, and Alex has brother Peter)
This is expressed in Datalog in the following format:
[ ?x , :hasUncle , ?z ] :- [ ?x , :hasParent , ?y], [?y, :hasBrother, ?z] .
- The IF part of the rule is also called the body or antecedent, and is found on the right of the :- operator.
- The THEN part of the rule is called the head or the consequent, and is found to the left of the :- operator.
Intuitively, a rule says “if [ ?x , :hasParent , ?y] , [?y, :hasBrother, ?z] all hold, then [ ?x , :hasUncle , ?z ] holds as well” and this results in new information being added to the graph.
It is important to notice that the set of logical consequences obtained is completely independent from the order in which rule applications are performed as well as of the order in which different elements of rule bodies are given. For example, the following statement would also hold:
[ ?x , :hasUncle , ?z ] :- [?y, :hasBrother, ?z], [ ?x , :hasParent , ?y].
We can also define a rule to determine if a sibling is a brother or sister:
[ ?x , :hasBrother , ?y ] :- [ ?y , :hasSibling , ?x], [?y, :gender, :male] .
Or
[ ?x , :hasSister , ?y ] :- [ ?y , :hasSibling , ?x], [?y, :gender, :female] .
The following graphs illustrates the relationships between Luke, Peter and Meg which are only partially described in the data. Using rules we can ensure that the data is complete. By having complete data we are able to maximise our ability to make correct business decisions, efficiently and accurately.
The data in this example is incomplete, as Peter is identified as Luke’s brother, but Luke isn’t identified as Peter’s brother. Although humans can logically deduce that if Peter is Luke’s brother, Luke must also be Peter’s; databases are not aware of relationships between data points unless it is explicitly stated using rules. Thus, we can use rules to inform the graph that these relationships exist.
When we import the rules for siblings and uncles into the graph, the implied relations are materialised. This occurs because the rule ranges over all possible nodes in the RDF graph. Whenever the rule is satisfied, i.e. a ‘?x has a brother ?y’ relationship is found, it will propagate this information as a new triple within the graph, enriching the data.
In this example, the sibling rule helped to establish the link between Peter and Luke and the uncle rule established that Luke was Meg’s uncle. This makes the graph more complete, which has benefits for data analysis, and when the process is scaled up it can result in significant efficiency improvements compared to using other methods for inserting data into a graph (see below).
Rules can be simple like the one demonstrated above, or layered on top of one another to provide a complex set of instructions. For more resources on the power of semantic reasoning with Datalog rules, check out this article.
Why not just use a SPARQL ‘insert’ query?
By using the SPARQL INSERT
query, one can insert triples into the RDF graph. However, with reasoning (i.e. Datalog rules) this process is enhanced.
Fundamentally, the two methods for inserting data into the RDF graph differ because Datalog rules are applied recursively. In this way, the logical consequences of a set of Datalog rules on a graph are captured by the iterative application of the rules until no new information can be added to the graph.
The new triples are added when rules are imported. With RDFox, this process is even more powerful as triples can be materialised incrementally when new data is added to the RDF graph and prior to query-time, dramatically speeding up SPARQL queries.
RDFox Datalog extensions
The rule language supported by RDFox extends the standard Datalog language with stratified negation, stratified aggregation, built-in functions, and more, so as to provide additional data analysis capabilities.
For more information on stratified negation, stratified aggregation, built-in functions, etc, see our documentation.
Datalog Tutorials
If you, or anyone within your organisation, are interested in participating in a Datalog tutorial with Oxford University Professors and Oxford Semantic Technologies’ founders, email info@oxfordsemantic.tech.
Try writing Datalog rules yourself
To try writing Datalog rules, you can request an RDFox license here. You can learn more about RDFox here or on our medium publication.
Team and Resources
The team behind Oxford Semantic Technologies started working on RDFox in 2011 at the Computer Science Department of the University of Oxford with the conviction that flexible and high-performance reasoning was a possibility for data intensive applications without jeopardising the correctness of the results. RDFox is the first market-ready knowledge graph designed from the ground up with reasoning in mind. Oxford Semantic Technologies is a spin out of the University of Oxford and is backed by leading investors including Samsung Venture Investment Corporation (SVIC), Oxford Sciences Innovation (OSI) and Oxford University’s investment arm (OUI). The author is proud to be a member of this team.