Loading data into a Grakn Knowledge Graph using the Java client
This tutorial may be out of date against the latest version of Grakn. For the most up-to-date version of this tutorial, please refer to the Grakn Documentation.
This tutorial illustrates how a dataset in CSV, JSON or XML format can be migrated into a Grakn knowledge graph, using Grakn’s Java Client.
The knowledge graph that we’ll work on in this post, is called phone_calls.
The schema for this knowledge graph was defined in a previous post, here.
If you’re already familiar with Grakn, and all you need is a migration example to follow, you’ll find this Github repository useful. If, on the other hand, you’re not familiar with the technology, make sure to first complete defining the schema for the phone_calls
knowledge graph, and read on for a detailed guide on migrating data into Grakn using Java.
A Quick Look into the phone_calls Schema
Before we get started with migration, let’s have a quick reminder of how the schema for the phone_calls
knowledge graph looks like.
Migrate Data into Grakn
Let’s go through an overview of how the migration takes place.
- First, we need need to talk to our Grakn keyspace. To do this, we’ll use the Grakn’s Java Client.
- We’ll go through each data file, extracting each data item and parsing it to a JSON object.
- We’ll pass each data item (in the form of a JSON object) to its corresponding template. What the template returns is a Graql query for inserting that item into Grakn.
- We’ll execute each of those queries to load the data into our target keyspace —
phone_calls
.
Before moving on, make sure you have Java 1.8 installed and the Grakn server running on your machine.
Getting Started
Create a new Maven project
This project uses SDK 1.8 and is named phone_calls
. I’ll be using IntelliJ as the IDE.
Set Grakn as a dependency
Modify pom.xml
to include the latest version of Grakn (1.4.2) as a dependency.
Configure logging
We’d like to be able to configure what Grakn logs out. To do this, modify pom.xml
to exclude slf4j
shipped with grakn
and add logback
as a dependency, instead.
Next, add a new file called logback.xml
with the content below and place it under src/main/resources
.
Create the Migration Class
Under src/main
create a new file called Migration.java
. This is where we’re going to write all our code.
Including the Data Files
Pick one of the data formats below and download the files. After you download each of the four files, place them under the src/main/resources/data
directory. We’ll be using these to load their data into our phone_calls
knowledge graph.
CSV: companies | people | contracts | calls
JSON: companies | people | contracts | calls
XML: companies | people | contracts | calls
All code that follows is to be written in Migration.java
.
Specifying details for each data file
Before anything, we need a structure to contain the details required for reading data files and constructing Graql queries. These details include:
- The path to the data file, and
- The template function that receives a JSON object and produces a Graql insert query.
For this purpose, we create a new subclass called Input
.
Later in this article, we’ll see how an instance of the Input
class can be created, but before we get to that, let’s add the mjson
dependency to the dependencies
tag in our pom.xml
file.
Time to initialise the inputs
.
The code below calls the initialiseInputs()
method which returns a collection of inputs
. We’ll then use each input
element in this collection to load each data file into Grakn.
Input instance for a Company
input.getDataPath()
will return data/companies
.
Given company
is
{ name: "Telecom" }
input.template(company)
will return
insert $company isa company has name "Telecom";
Input instance for a Person
input.getDataPath()
will return data/people
.
Given person
is
{ phone_number: "+44 091 xxx" }
input.template(person)
will return
insert $person has phone-number "+44 091 xxx";
And given person
is
{ firs-name: "Jackie", last-name: "Joe", city: "Jimo", age: 77, phone_number: "+00 091 xxx"}
input.template(person)
will return
insert $person has phone-number "+44 091 xxx" has first-name "Jackie" has last-name "Joe" has city "Jimo" has age 77;
Input instance for a Contract
input.getDataPath()
will return data/contracts
.
Given contract
is
{ company_name: "Telecom", person_id: "+00 091 xxx" }
input.template(contract)
will return
match $company isa company has name "Telecom"; $customer isa person has phone-number "+00 091 xxx"; insert (provider: $company, customer: $customer) isa contract;
Input instance for a Call
input.getDataPath()
will return data/calls
.
Given call
is
{ caller_id: "+44 091 xxx", callee_id: "+00 091 xxx", started_at: 2018–08–10T07:57:51, duration: 148 }
input.template(call)
will return
match $caller isa person has phone-number "+44 091 xxx"; $callee isa person has phone-number "+00 091 xxx"; insert $call(caller: $caller, callee: $callee) isa call; $call has started-at 2018–08–10T07:57:51; $call has duration 148;
Connect and Migrate
Now that we have the datapath and template defined for each of our data files, we can continue to connect with our phone_calls
knowledge graph and load the data into it.
connectAndMigrate(Collection<Input> inputs)
is the only method that will be fired to initiate migration of the data into the phone_calls
knowledge graph.
The following happens in this method:
- A Grakn instance
grakn
is created, connected to the server we have running locally atlocalhost:48555
. - A
session
is created, connected to the keyspacephone_calls
. - For each
input
object in theinputs
collection, we call theloadDataIntoGrakn(input, session)
. This will take care of loading the data as specified in theinput
object into our keyspace. - Finally the
session
is closed.
Loading the data into phone_calls
Now that we have a session
connected to the phone_calls
keyspace, we can move on to actually loading the data into our knowledge graph.
In order to load data from each file into Grakn, we need to:
- retrieve an
ArrayList
of JSON objects, each of which represents a data item. We do this by callingparseDataToJson(input)
, and - for each JSON object in
items
: a) create a transactiontx
, b) construct thegraqlInsertQuery
using the correspondingtemplate
, c) run thequery
, d)commit
the transaction and e)close
the transaction.
Note on creating and committing transactions: To avoid running out of memory, it’s recommended that every single query gets created and committed in a single transaction. However, for faster migration of large datasets, this can happen once for every
n
queries, wheren
is the maximum number of queries guaranteed to run on a single transaction.
Now that we’ve done all the above, we’re ready to read each file and parse each data item to a JSON object. It’s these JSON objects that will be passed to the template
method on each Input
object.
We‘re going to write the implementation of parseDataToJson(input)
.
DataFormat-specific implementation
The implementation for parseDataToJson(input)
differs based on what format our data files have.
.csv
, .json
or .xml
.
But regardless of what the data format is, we need the right setup to read the files line by line. For this, we’ll use an InputStreamReader
.
Parsing CSV
We’ll use the Univocity CSV Parser for parsing our .csv
files. Let’s add the dependency for it. We need to add the following to the dependencies
tag in pom.xml
.
Having done that, we’ll write the implementation of parseDataToJson(input)
for parsing .csv
files.
Besides this implementation, we need to make one more change.
Given the nature of CSV files, the JSON object produced will have all the columns of the .csv
file as its keys, even when the value is not there, it’ll be taken as a null
.
For this reason, we need to change one line in the template
method for the input
instance for person.
if (! person.has("first_name")) {...}
becomes
if (person.at(“first_name”).isNull()) {...}
.
Reading JSON
We’ll use Gson’s JsonReader for reading our .json
files. Let’s add the dependency for it. We need to add the following to the dependencies
tag in pom.xml
.
Having done that, we’ll write the implementation of parseDataToJson(input)
for reading .json
files.
Parsing XML
We’ll use Java’s built-in StAX for parsing our .xml
files.
For parsing XML data, we need to know the name of the target tag. This needs to be declared in the Input
class and specified when constructing each input
object.
And now for the implementation of parseDataToJson(input)
for parsing .xml
files.
Putting it all together
Here is how our Migrate.java
looks like for loading CSV data into Grakn, and find here the ones for JSON and XML files.
Time to Load
Run the main
method, sit back, relax and watch the logs while the data starts pouring into Grakn.
To Recap
- We started off by setting up our project and positioning the data files.
- Next, we went on to set up the migration mechanism, one that was independent of the data format.
- Then, we learned how files with different data formats can be parsed into JSON objects.
- Lastly, we ran the
main
method which fired theconnectAndMigrate
method with the giveninputs
. This loaded the data into our Grakn knowledge graph.
Next
In the next post (to be published soon), we’ll see how we can get insights over this dataset by querying the phone_calls
knowledge graph using the Graql console, Grakn Workbase and the Java client. Stay tuned!