There R Pandas in my Graph!

Graph based Data Science with Grakn and Graql

Michelangelo Bucci
Vaticle
4 min readOct 18, 2016

--

“Connectivity becomes a craving” Sherry Turkle

Graql: now with 100% more Pandas (Photo credit: iStock.com/Hung_Chung_Chih)

Simplicity, thy name is Graql

If you have read some of my posts, you know that at Grakn Labs we have built a software stack and query language that help manage and use data in a graph database (and add a semantic layer on top of it, giving you the capability of performing inference and other things not particularly related to this post…). With Grakn, and particularly with our query language, Graql, we aim to make things simple. Graql itself is easy enough — it won’t take you more than a day or two to learn, even less if you are familiar with other query languages like Cypher or SQL.

Last week we were suddenly struck by the realisation that it would be simple to extract data from a Grakn graph and use it as a data science tool for analysis, more so since our analytics component is still at an early stage of development.

When you send a query to Graql, the results returned are, essentially, a table. Wouldn’t it be cool if you could just send a query and easily store the results in any data frame like structure?

Turns out, it’s really easy.

Yes, but how?

You will soon be able to call our engine’s REST API directly, but if you want to avoid the fuss, once you have started the Grakn engine and loaded the data you want to query into a graph, the answer lies in this simple command:

graql.sh -e QUERY

If you call the graql script in the shell with the -e option you can pass to it any query, and it will return some–very parsable–results.

So, to integrate Graql into your favourite data science environment you have to:

  1. Call the graql shell script
  2. Parse the results from the standard output
  3. Store the results into a data frame
  4. Profit!

Believe me, it’s easier than it sounds.

In fact, we were so excited by this realisation, and it is so easy to do, that we immediately put together a couple of quick and dirty scripts, one in R and one in Python to do exactly that. They are not wrappers, not real complete drivers, and they certainly are not polished yet (we will work on them, I promise), but they do run perfectly fine and allow you to interface Graql with Pandas or any R package you like.

Just as an example, imagine that you have stored our sample movie dataset in your graph and you want to extract a list of movies with their budget and a few rotten-tomatoes-scores-related properties and store the result in a data frame.

If you are working with R, you just need this:

If, on the other hand, you prefer Python and Pandas:

That’s it. In both cases, after having loaded our scripts, you need just a single command, and you are ready to analyse your data.

And you can check that, for example, it seems that the budget does not seem to influence the rotten-tomatoes score of a movie much.

While, on the other hand, it seems that there is some correlation between the budget and the number of votes, which makes sense, since bigger budget means more marketing:

Or maybe you want to use some extra networks analysis package to visualise and analyse the network of actors and movies in the dataset…

Actors are in red, movies in blue

Or maybe you want to do something else entirely; it’s up to you. It only takes one command.

That’s all folks!

So here you have it: you can easily take advantage of our stack to structure your data in a graph database, exploit the reasoner and, now, use your favourite data analysis package.

If you use something different from R or Pandas, if you want more features added to the interface scripts, if you have questions or if you just want to say “Hi”, please join us on our community site or Slack channel: we are more than happy to help.

Happy data science-ing!
M.

--

--

Michelangelo Bucci
Vaticle
Writer for

Discrete mathematician/Theoretical computer scientist, learner, curious about stuff.