MiB by RadioKirk, Vectorised by MesserWoland [GFDL] via Wikimedia Commons

Grakn Pandas Celebrities

Using GRAKN.AI and Python to analyse movie data

Sheldon Hall
Nov 11, 2016 · 6 min read

Hi everyone. I am Sheldon, a software engineer at Grakn Labs, where we have developed Grakn: an open-source distributed knowledge graph, and Graql: a knowledge-oriented query language. I have recently been testing out our stack with a few proof-of-concept applications. Inspired by the new Python driver for Grakn, I wanted to do a little statistical analysis of my own using our sample movie data. I think that a knowledge graph can be a really powerful tool for data scientists to query and model their data, and I wanted to try it out for myself.

The tools I will be using are: Grakn (of course), Python, Pandas, and the Jupyter notebook for recording my progress. The Grakn python driver takes a Graql query as input and returns a Pandas data frame containing the result. This data frame can be immediately analysed using the tools in Pandas, and I am choosing this workflow for exploring data because many data scientists will be familiar with it.

The setup instructions and files that you need are located at the bottom of this article.

I can easily find some movies using our knowledge graph query language Graql:

What I really want to find though are some famous actors and I will assume that actors who have starred in high budget movies are famous. However, what is a high budget movie?

Here I find all of the budgets and place them in descending order, limiting the output. For this dataset I will assume a budget > 100000000 is big. Then finding actors in movies with this budget is simply a case of running this query:

No offence to these actors, but it doesn’t look like a very credible result. How could I do better?

In a graph, the simplest measure of the importance (centrality) of a node is its degree. This is simply the number of edges it has. In Graql we can compute the degree of an entity, which is the number of relationships it is involved in. Better yet, because we have a knowledge graph we can easily restrict the type of relationships that count towards this degree.

I will make another assumption: that famous actors star in a lot of movies. In order to figure out how many movies an actor has starred in, we can use a helpful analytics function called degreesAndPersist. We call this command in the Graql shell:

This function will compute the number of has-cast relations a person has been in and persist it as a resource called degree attached to the person. Let’s confirm that the actors now have a degree resource:

Now we can combine all of the information that we have in the graph to find famous actors:

This Graql query is quite long, but hopefully not too difficult to understand. We are asking for actors:

and their degrees:

who have starred in movies with big budgets:

We then sort these actors by their importance ($degree) and display the $name of the top ten.

These look much better! The really nice thing about this is that I have used information from the structure of the graph, the degree, combined with some simple intuition to come up with convincing results.

Well, quite often I will watch a movie because it has a famous actor in it. Does this mean that the movie is actually any good though?

I have “answered” this question by first comparing the means of the ratings of all movies to those with famous actors starring in them:

and then by running a T-test:

The conclusive answer is that movies with famous actors gives no indication of how good the movie is.

Useful? Not really. Interesting? Yes.

The prerequisites for following this analysis are:

To make things simple it is worth placing all of the downloaded files in the folder in which you unzip your Grakn distribution. Then you can execute these commands in the shell in order:

Once the data has finished loading, enter the Graql shell using:

and enter this command:

The path needs to be set in the python driver so open you favourite editor and mutate this line:

Now that you have the data and the degrees successfully loaded and computed you can start the Jupyter notebook and follow my analysis:

If you liked this post, please take the time to recommend it or leave us a comment, especially if you try the approach for yourself — we’d love to hear how you get on or what you try out with Grakn. If you’re interested in finding out more about GRAKN.AI, do please consider joining the growing Grakn Community.

Vaticle

Creators of TypeDB and TypeQL

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store