How to find information that is not there
The light of reasoner
All men are mortal. Socrates is a Man. Therefore Socrates is a mortal.
This is possibly the most famous example of syllogism ever conceived. It is one of the classical examples of logical thinking that are completely natural to us, but can be awfully difficult for a computer. Unless you have a reasoner, that is. And this is what this post is about.
Hello. I am Michelangelo, and I work at Grakn Labs, where we develop a software stack providing a flexible object-oriented schema and a knowledge-oriented query language, capable of real-time analytics and reasoning.
As prefaced in the introduction, in this post, I’m going to explain how to use an open-source semantic reasoner to work on a free movie dataset and build new knowledge, allowing you, for example, to list out popular movies without knowing in advance which movies are popular. It sounds difficult, but I’ll go through the basic concepts as we go through.
This is not an introduction to the Grakn software stack or Graql (our own query language). Fear not, though, for I have already published on Medium a series of posts covering the basics of the language and you can refer to those posts if you need a gentle introduction. Or you can have a look at our documentation for a more detailed presentation.
If, on the other hand, you are ready to go or are curious, and not interested in learning the details of Graql just yet, please read on.
Setup and data origin
As it is a quite established tradition, our dataset is going to contain information about movies, actors and several related concepts. The dataset is, in fact, a small subset of the data on which Moogi— a semantic search engine built with the Grakn technology — is based. Given that the we are using the same technology and a self-contained subset of the data, you could even build your own mini-version of Moogi, if so desired.
Big disclaimer before proceeding: Moogi is, at the moment, just a proof-of-concept; consequently, the data extracted is quite dirty and you could obtain some weird and redundant results while exploring it. It might be cleaned in the future, but, after all, these errors in the data are not particularly important: the only practical value of the dataset we will be using is for testing purposes.
NOTE: The code below was correct for early versions of Grakn. Since it was published, we have introduced some changes to Graql syntax as the platform has matured, and we have yet to update this blog post.
Where to find the data
The dataset we are going to use can be found on our public repository in the sample-dataset folder. You just need to download it and follow the instruction on the readme, just a couple of steps, nothing complicated.
About the dataset
The dataset we have just downloaded contains movie-related data, such as actors, director, original title, Rotten Tomatoes scores etc. If you want to have a more detailed idea of what kind of data you can find in the dataset and how it is organised, I suggest you take some time to have a look at schema.gql file, which, not surprisingly, contains the information related to the database schema.
Just to get a taste of what is there, try running the following query, to show 10 movie titles.
match $x isa movie has title $y; select $y; limit 10
If you have read my introduction to Graql, you should not have trouble understanding how the data is organised, but you might notice a couple of features in the schema that you are not familiar with: namely the ako and abstract keywords. I will not focus too much on what those keywords mean, as this is a topic that could take a whole post series by itself, but there are a few rules that you must know to fully understand the dataset schema.
Very briefly:
Don’t try to make an instance of an abstract type
If a concept type is declared to be abstract, it cannot have instances. In other words, if I try
insert id “test” isa something;
and “something” is an abstract concept type, Graql will complain and throw an error at you. In our schema, for example, this means that you cannot insert instances of production in the graph, as the entity type production is an abstract.
ako means “a kind of”
The line type-1 ako type-2 (for example dog ako animal) means, very roughly, that type-1 is a subclass (or specialisation) of concept type-2 (i.e., dog is a subclass of animal). For example, all instances of dog will be considered specialised instances of animal (but not vice-versa!). Hence, dog will have all resources that animal has (plus maybe some more) and instances of dog will be able to play all the roles that instances of animal can play and so on.
This means, for example, that instances of the type movie will be able to have a resource of the type title since in the schema file we defined
movie ako production
(which stands for Movie is A Kind Of production) and production has a resource called title.
ako works for roles too
One thing that might not seem obvious at first is what happens with role types and ako. If a role is ako another role, for example
director ako crew-member
then every concept that can play the second role–the crew-member role– can also play the first role–the director role–. There are technical reasons for this, but most importantly this allows us to declare that, for instance, the entity type person plays the role crew-member and that implies that a person can play the role of director, art-director, sound-editor, etc. without having to specify it, as these are all declared to be ako crew-member. In fact, in the schema file of our dataset you can find the following
Inference rules
Let’s get down to the fun part, beginning with an example. Imagine that you are building a graph database containing geographical informations. Say you want to put Italian cities, provinces and regions in your database (think of them as the equivalent of American cities, counties, and states if you are not familiar with the terms). Imagine that you have found your data while browsing the internet, so the data is, at best, incomplete.
Crawling around the interwebs, you have discovered the Wikipedia page of the small town of Policoro. That page tells you that Policoro is in the Basilicata region and it belongs to the province of Matera.
If I tell you that the even smaller town of Bernalda resides in the province of Matera as well, can you tell me in which region Bernalda is?
If you think I have suddenly gone crazy, because I just told you that Bernalda is in Basilicata, please notice that I have not: I haven’t even told you in which region the province of Matera is. You deduced–inferred is the correct logical term–that piece of information from incomplete data. Let’s break it down. I have told you three things:
- Policoro is in the province of Matera
- Policoro is in the Basilicata region
- Bernalda is in the province of Matera
What your brain did is
- Putting together the first two points to deduce that the province of Matera is in the Basilicata region
- Adding this new piece of information to point 3 above to, correctly, infer that:
4. Bernalda is in Basilicata
Without even realising it, you have probably gained new knowledge about Italian towns, because you learned a long time ago how “being part of” works. The problem is: while all this is easy for you, because your brain is smart, it is not for your computer, nor for your database system. Unless you have a reasoner, that is.
Introducing the Grakn Reasoner
If you have downloaded or built with the Grakn stack you do have a reasoner.
A reasoner is a piece of software that, given a set of inference rules it is provided with, is able to infer new information. In the example above, we can tell the reasoner how “being part of” works, i.e. we define the two rules
- IF a city is part of a province and it is part of a region THEN the province is part of the same region
- IF a city is part of a province and the province is part of a region THEN the city is part of the same region
And the reasoner will be able to deduce, if asked, that Bernalda is in the Basilicata region.
You might have noticed that the two example rules above have a similar structure: IF something is verified, THEN something else must be true.This is not by chance and in fact rules must have this kind of structure. The reasoner checks if all the conditions in the first part of the rule are verified in order to infer the statement on the second part of the rule.
In Graql we call the first part of the rule (the IF part or, if you prefer, the antecedent) simply Left Hand Side and the second part, not surprisingly, Right Hand Side. Both the left and right hand sides of the rule are expressed as a match statement enclosed in curly braces and preceded by, respectively, the keywords lhs and rhs. An inference rule, thus, looks as follows:
insert “rule-id” isa inference-rule,
lhs {match SOME CONDITION},
rhs {match SOME RESULTING CONDITION};
Admittedly, the syntax for the right hand side might look a bit weird (why there is a match statement there?), but bear with me: there are good, but technical and slightly boring, reasons for that.
Let’s then get back to our example movie dataset and think of a couple of interesting rules that we could use to gain new insight from the data we have.
If you browse the schema file of the dataset, you will notice the genre entity type, which has the description resource and gets attached to movies with the has-genre relation. Let’s examine some of the genres that are in the dataset:
As you can see, since as I had anticipated the data is quite dirty, there are a few redundant genres here, like “science fiction” and “Sci-Fi”. As we probably want to consider movies with both genres in the same category, we can use rules to try and consolidate our data. For example telling Graql that movies that have the “Sci-Fi” genre have also the “science fiction” genre. This way, when we look for movies with the “science fiction” genre we also get the movies that are tagged as “Sci-Fi”, which is probably what we wanted in the first place.
The rule look like this:
As you can see, in the left hand side of the rule, we retrieve the movies that have the genre “Sci-Fi” and the “science fiction” genre, then select the movie entity and the genre entity (the select part of left hand side of the rule is what gets passed to the right hand side of the rule) while in the right hand side we link the two retrieved entities into a “has-genre” relation. The syntax may not be immediately obvious, but it is easy to get used to it, and the documentation is there to help.
A very similar rule could be used, for example, to extract new information about the genre of a movie: for example to tell Graql that a “romantic comedy” is also a comedy (remember: what sounds obvious to us is not necessarily obvious to the computer), or something similar.
One more thing
Let me show you another example rule before concluding this long post. We are building a recommendation engine based on our dataset and we want to give more importance in our recommendations to more popular movies.
The question is: what is a popular movie? Maybe we can find a list of them on the internet and flag the corresponding movies (say, using the “status” resource)? But is there a way of deciding if something outside the list is popular too?
Our dataset contains Rotten Tomatoes scores for our movies, so maybe we could decide that, independent of the entries on the list that we have found on the internet, we want all movies that have a lot of votes on Rotten Tomatoes and a high average score to be considered popular.
And here is the code to do that.
You could go farther and define a similar rule to tell Graql that, for example, a “popular actor” is an actor that has starred in at least 5 popular movies, or 1–2 recent popular movies or whatever, and compound the rule with the one above (yes: inference rules can be chained together, although sometimes the performance can be underwhelming) and go on adding all sort of rules to infer all sort of new information from our little dataset. Keep on going up to the point where you build your own movie semantic search engine.
Conclusions
It is an exciting time for us at Grakn Labs. We have just released our software stack to the public and it is improving day by day. Things are running fast and I am still discovering something new that it can do almost everyday. But there is only so much that we can do to imagine how to improve it.
So, if I managed to whet your appetite, I encourage you to download our stack, fork our code, browse the docs and join our community. We are a friendly bunch and we will do what we can to help and make you love our creature as much as we do.
It’s open source: there is nothing to lose.
Stay tuned,
M.
PS: If you are wondering what the picture at the beginning is about, that is the album cover of Pink Floyd’s “A Momentary Lapse of Reason.” I’ll let you deduce its relevance to semantic reasoners.
PPS: Sorry for the lame joke :)