MiB by RadioKirk, Vectorised by MesserWoland [GFDL] via Wikimedia Commons

Grakn Pandas Celebrities

Sheldon Hall
Vaticle
Published in
6 min readNov 11, 2016

--

Hi everyone. I am Sheldon, a software engineer at Grakn Labs, where we have developed Grakn: an open-source distributed knowledge graph, and Graql: a knowledge-oriented query language. I have recently been testing out our stack with a few proof-of-concept applications. Inspired by the new Python driver for Grakn, I wanted to do a little statistical analysis of my own using our sample movie data. I think that a knowledge graph can be a really powerful tool for data scientists to query and model their data, and I wanted to try it out for myself.

Tools of the trade

The tools I will be using are: Grakn (of course), Python, Pandas, and the Jupyter notebook for recording my progress. The Grakn python driver takes a Graql query as input and returns a Pandas data frame containing the result. This data frame can be immediately analysed using the tools in Pandas, and I am choosing this workflow for exploring data because many data scientists will be familiar with it.

The setup instructions and files that you need are located at the bottom of this article.

Let’s jump straight in at the deep end…

I can easily find some movies using our knowledge graph query language Graql:

match $m isa movie, has title $title;
select $title;
limit 10;

What I really want to find though are some famous actors and I will assume that actors who have starred in high budget movies are famous. However, what is a high budget movie?

match $budget isa budget;
order by $budget desc;
limit 10;

Here I find all of the budgets and place them in descending order, limiting the output. For this dataset I will assume a budget > 100000000 is big. Then finding actors in movies with this budget is simply a case of running this query:

match (actor: $x,$y);
$x isa person, has name $name;
$y isa movie, has budget > 100000000.0;
select $name;
distinct;
limit 10;

No offence to these actors, but it doesn’t look like a very credible result. How could I do better?

Finding celebrity

In a graph, the simplest measure of the importance (centrality) of a node is its degree. This is simply the number of edges it has. In Graql we can compute the degree of an entity, which is the number of relationships it is involved in. Better yet, because we have a knowledge graph we can easily restrict the type of relationships that count towards this degree.

I will make another assumption: that famous actors star in a lot of movies. In order to figure out how many movies an actor has starred in, we can use a helpful analytics function called degreesAndPersist. We call this command in the Graql shell:

compute degreesAndPersist in person, has-cast;

This function will compute the number of has-cast relations a person has been in and persist it as a resource called degree attached to the person. Let’s confirm that the actors now have a degree resource:

match (actor: $x);
$x isa person, has name $name, has degree $degree;
select $name, $degree;
distinct;
limit 10;

Now we can combine all of the information that we have in the graph to find famous actors:

match (actor: $x,$y);
$x isa person, has name $name, has degree $degree;
$y isa movie, has budget > 100000000.0;
order by $degree desc;
select $name;
distinct;
limit 10;

This Graql query is quite long, but hopefully not too difficult to understand. We are asking for actors:

(actor: $x, $y)

and their degrees:

$x isa person, has name $name, has degree $degree

who have starred in movies with big budgets:

$y isa movie, has budget > 100000000.0

We then sort these actors by their importance ($degree) and display the $name of the top ten.

These look much better! The really nice thing about this is that I have used information from the structure of the graph, the degree, combined with some simple intuition to come up with convincing results.

So what?

Well, quite often I will watch a movie because it has a famous actor in it. Does this mean that the movie is actually any good though?

I have “answered” this question by first comparing the means of the ratings of all movies to those with famous actors starring in them:

famousRating = pb.process_graql_query('''
match (actor: $x,$y);
$x isa person, has degree $degree;
$y isa movie, has title $title, has budget > 100000000.0,
has rotten-tomatoes-user-rating $rating;
order by $degree desc;
select $title, $rating;
distinct;
limit 100;
''')
allRating = pb.process_graql_query('''
match isa movie, has title $title,
has rotten-tomatoes-user-rating $rating;
distinct;
''')
print(famousRating.mean())
print(allRating.mean())

and then by running a T-test:

ttest_ind(famousRating['$rating'],allRating['$rating'])

The conclusive answer is that movies with famous actors gives no indication of how good the movie is.

Useful? Not really. Interesting? Yes.

Setup

The prerequisites for following this analysis are:

To make things simple it is worth placing all of the downloaded files in the folder in which you unzip your Grakn distribution. Then you can execute these commands in the shell in order:

bin/graql.sh start
bin/graql.sh -f schema.gql
bin/graql.sh -b movie-data.gql

Once the data has finished loading, enter the Graql shell using:

bin/graql.sh

and enter this command:

compute degreesAndPersist in person, has-cast;

The path needs to be set in the python driver so open you favourite editor and mutate this line:

_GRAQL_PATH = "bin/graql.sh"

Now that you have the data and the degrees successfully loaded and computed you can start the Jupyter notebook and follow my analysis:

jupyter notebook movieAnalysis.ipynb

If you liked this post, please take the time to recommend it or leave us a comment, especially if you try the approach for yourself — we’d love to hear how you get on or what you try out with Grakn. If you’re interested in finding out more about GRAKN.AI, do please consider joining the growing Grakn Community.

--

--

Sheldon Hall
Vaticle
Writer for

I am a software engineer working at Grakn Labs where we are developing a distributed knowledge graph. Applying mathematics to solve problems is what I enjoy.