Evolving Data Models with JanusGraph

Published in

Enharmonic

10 min readFeb 13, 2019

Over the next few weeks, we’ll be taking an episodic look at building a graph data system to tackle interesting problems in interesting ways. There’s no better place to start than right in the middle — the data model. After all, a key feature of a graph data system is the tremendous flexibility it gives us to explicitly describe our data. In its ideal state, our graph data system would offer infinite flexibility for data storage and retrieval. A lot of this flexibility comes from being able to take an existing graph and change its data model to reflect new understandings about our data.

To demonstrate this, we’ll iteratively build our data model, discover some pitfalls to avoid, and check out how to improve the data model in a working database without needing to drop and rebuild the entire thing. We’ll look at some sample code and real data. All our experiments can be run with a simple in-memory JanusGraph instance running from the Gremlin Console. Production systems will be a bit more complex, but this setup works just fine for our data modeling examples.

To begin our applied graph data modeling exercise, we of course start with some data. Let’s choose something from a domain that’s a bit different from most of the examples out in the wild — data on concert performances by U.S symphony orchestras. The Baltimore Symphony Orchestra has done a great job compiling some of this data, which they use to analyze different aspects of professional orchestras’ program choices. For this example, we’ll be looking at a single season of data from 2016-2017.

As we can see, the raw data is pretty straightforward.

To build an appropriate schema, we start by conceptualizing our data into different logical categories. Broadly, we care about what was played, as well as who play it. The what are Concerts consisting of multiple individual Pieces, each written by a Composer. The who is the Orchestra (a group of many players), and a couple individual artists of interest — the Conductor and the Soloist.

This means we’re modeling 6 distinct elements:

Concert
Piece
Orchestra
Composer
Conductor
Soloist

But we can immediately find some areas for improvement. It’s clear that some Conductors might also be Composers (like Esa-Pekka Salonen) or Soloists (like Itzhak Perlman). We prefer to not have two separate vertices representing the same distinct individual, so we’ll roll these up under a single Artist label. We’ll use the relationship of that person to a Concert to describe their role in the proceedings. We can keep Orchestra separate because it is an organization made up of different individual artists. We’ll also prefer the more generic term Work over Piece, to encompass a broader variety of music.

This leaves us with 4 vertex labels:

Concert
Work
Orchestra
Artist

We can begin to connect everything together with edges like this:

This is our starting point, though there is certainly room for improvement. If some of the pitfalls of this model have already caught your eye — be patient! Evolving the schema on our graph is the key part of this exercise.

Defining an Initial Data Model

Let’s get to coding. We’ll be defining our schema in the Gremlin Console with the JanusGraph Management API.

We start by opening up the Gremlin Console and providing a simple inline configuration for our JanusGraph instance. We also open the JanusGraph Management API. There are many more things to say about configuring JanusGraph — like removing schema defaults and turning on schema constraints — but we’ll tackle those in a separate episode.

$ bin/gremlin.sh
gremlin> graph = JanusGraphFactory.build().
  set('storage.backend', 'inmemory').open()
gremlin> mgmt = graph.openManagement()

Going forward we’ll assume we’re already working in our Gremlin Console, so we’ll leave out the gremlin> prompt for readability and conciseness.

First we define our Vertex and Edge labels. Edge Multiplicity choices can be tricky, but our choices here — SIMPLE and MANY2ONE — should serve us well for this example.

// Vertices
Orchestra = mgmt.makeVertexLabel('Orchestra').make()
Artist = mgmt.makeVertexLabel('Artist').make()
Work = mgmt.makeVertexLabel('Work').make()
Concert = mgmt.makeVertexLabel('Concert').make()// Edge
COMPOSER = mgmt.makeEdgeLabel('COMPOSER').
  multiplicity(MANY2ONE).make()
SOLOIST = mgmt.makeEdgeLabel('SOLOIST').
  multiplicity(SIMPLE).make()
CONDUCTOR = mgmt.makeEdgeLabel('CONDUCTOR').
  multiplicity(SIMPLE).make()
ORCHESTRA = mgmt.makeEdgeLabel('ORCHESTRA').
  multiplicity(SIMPLE).make()
INCLUDES = mgmt.makeEdgeLabel('INCLUDES').
  multiplicity(SIMPLE).make()

We successively define all the properties for each Vertex label, and attach them to their respective labels. String is a reasonable choice for a default data type. If it looks like there may be parsing errors, and we just want to begin testing our schema, Strings work just fine! A cardinality of SINGLE is likewise a convenient default, and is perfect for this data model. We can always iteratively improve these over time. Finally, we create the connections between edges and vertex labels.

// Define Vertex Property Keys
// Orchestra
name = mgmt.makePropertyKey('name').
  dataType(String.class).cardinality(Cardinality.SINGLE).make()
mgmt.addProperties(Orchestra, name)// Artist
lastName = mgmt.makePropertyKey('lastName').
  dataType(String.class).cardinality(Cardinality.SINGLE).make()
firstName = mgmt.makePropertyKey('firstName').
  dataType(String.class).cardinality(Cardinality.SINGLE).make()
gender = mgmt.makePropertyKey('gender').
  dataType(String.class).cardinality(Cardinality.SINGLE).make()
nationality = mgmt.makePropertyKey('nationality').
  dataType(String.class).cardinality(Cardinality.SINGLE).make()
deceased = mgmt.makePropertyKey('deceased').
  dataType(Boolean.class).cardinality(Cardinality.SINGLE).make()
mgmt.addProperties(Artist, lastName, firstName, gender,
  nationality, deceased)// Work
title = mgmt.makePropertyKey('title').
  dataType(String.class).cardinality(Cardinality.SINGLE).make()
compositionDate = mgmt.makePropertyKey('compositionYear').
  dataType(Integer.class).cardinality(Cardinality.SINGLE).make()
soloInstrument = mgmt.makePropertyKey('soloInstrument').
  dataType(String.class).cardinality(Cardinality.SINGLE).make()
mgmt.addProperties(Work, title, compositionDate, soloInstrument)// Concert
firstDate = mgmt.makePropertyKey('firstDate').
  dataType(String.class).cardinality(Cardinality.SINGLE).make()
numShows = mgmt.makePropertyKey('numShows').
  dataType(Integer.class).cardinality(Cardinality.SINGLE).make()
mgmt.addProperties(Concert, name, firstDate, numShows)
// Define connections as (edgeLabel, outVertexLabel, inVertexLabel)
mgmt.addConnection(COMPOSER, Work, Artist)
mgmt.addConnection(SOLOIST, Work, Artist)
mgmt.addConnection(CONDUCTOR, Work, Artist)
mgmt.addConnection(ORCHESTRA, Concert, Orchestra)
mgmt.addConnection(INCLUDES, Concert, Work)mgmt.commit()

We finish off by committing our management transaction. Don’t forget this last part, or you’re certain to encounter some frustrating errors later. Also note that variables referencing vertex and edge labels and property keys are only valid in the current management API transaction. If we commit or close our transaction, we’ll need to open up another one if we plan on doing anything with these variables.

A Quick Aside on Schema Naming

Choosing a schema naming convention fundamentally comes down to three things:

The particulars of the data you’re modeling
The conventions of other data systems or applications you might be interacting with (don’t change existing conventions just for the sake of changing them — it ruins any existing documentation and just makes for more work all around)
Personal preference

I personally lean toward “PascalCase” vertex labels, “ALL_CAPS_UNDERSCORE” edge labels, and “camelCase” property names. This clearly distinguishes between the different graph elements during discussion, and also happens to align with the common Neo4j style conventions. I’m admittedly a bit obsessive about good naming, but I try not to get too pedantic about the parts of speech I use for edge labels. If a noun is the clearest way to express the relationship — go for it. So we’re using COMPOSER and SOLOIST as some of our edge labels, because those are the simplest and most concise ways of expressing the relationships in question.

Answering Questions with our Graph

For an initial peek into our model’s effectiveness, we’ll load the following data into our graph:

To make this happen, you can run a complete initial setup script (found here), which includes the data model definition and sample data load:

$ bin/gremlin.sh -i InitialSetup.groovy

Let’s see how useful our data model is in answering questions.

What musical activities has Esa-Pekka Salonen been involved in?

g.V().has('Artist', 'lastName', 'Salonen').
  inE().outV().in('INCLUDES').order().
  path().by('lastName').by(label).by('title').by('name')==>[Salonen,CONDUCTOR,Also sprach Zarathustra,
    Esa-Pekka Salonen Conducts US Premiere by Tansy Davies]
==>[Salonen,COMPOSER,Wing on Wing,
    Premieres by Esa-Pekka Salonen and Anna Thorvaldsdottir]
==>[Salonen,COMPOSER,Cello Concerto,Salonen & Yo-Yo Ma]
==>[Salonen,CONDUCTOR,Cello Concerto,Salonen & Yo-Yo Ma]

Starting with our Artist, Esa-Pekka Salonen, we traverse through any incoming edges, which gets us the Works that Salonen has been involved with, then finish our traversal with the Concerts which featured these pieces. We use path() so that we can view more of the details of this traversal. We see that Salonen conducted several pieces, had several of his compositions performed, and in one case conducted his own piece.

Who conducted Salonen’s pieces?

g.V().has('Artist', 'lastName', 'Salonen').
  in('COMPOSER').out('CONDUCTOR').
  path().by('lastName').by('title').by('lastName')==>[Salonen,Cello Concerto,Salonen]
==>[Salonen,Wing on Wing,Gilbert]

Great, so it’s easy to find out that Salonen conducted his own Cello Concerto, and Alan Gilbert conducted Salonen’s Wing on Wing. But there’s a huge problem — when did Salonen conduct his Cello Concerto? With which orchestra? While we can find the orchestras that have performed this piece, and the conductors who have conducted it, it’s not really possible to link them up. This means the simple task of finding what orchestras someone has conducted would be very difficult to answer.

Let’s explore this further. We currently have stored a single performance of Salonen’s Cello Concerto, in which he conducted the Chicago Symphony and Yo-Yo Ma performed as soloist. Let’s say we want to include another concert that featured this same piece, but featuring a different orchestra, conductor, and soloist. This means that we will be connecting additional conductors and soloists to a single Work vertex. Unfortunately, this means that we can no longer distinguish which soloist and conductor performed together.

Did Esa-Pekka Salonen perform with Alisa Weilerstein or Yo-Yo Ma? Who knows…

While we could try to tackle this with additional properties on our edges, or metadata, it’s clear that our data model is fundamentally lossy.

The problem is that our graph has no concept that distinguishes a Work — a piece of music written once by a composer and performed (hopefully) many times — and a Performance of that work. This isn’t a distinction that is unique to classical music. We’d have the same issue if we confused the song Live and Let Die by Paul and Linda McCartney with a cover performance of Live and Let Die (say by Guns N’ Roses).¹

The easiest way to solve this problem is to define a new vertex that represents the concept of a performance. The Performance vertex serves as an intermediary between the Work and the performing Artists.

A Second Attempt

In the context of the whole data model, this gives us the following:

Now that we have our new schema figured out, let’s define our Performance vertex and connect it to the rest of the graph’s elements.

// We'll need a new Management API transaction
mgmt = graph.openManagement()// Vertex Label
Performance = mgmt.makeVertexLabel('Performance').make()// Properties
performanceDate = mgmt.makePropertyKey('performanceDate').
  dataType(String.class).cardinality(Cardinality.SINGLE).make()
mgmt.addProperties(Performance, performanceDate)// Define a new Edge
PERFORMED = mgmt.makeEdgeLabel('PERFORMED').
  multiplicity(ONE2MANY).make()// We need to retrieve our labels within our open transaction
Orchestra = mgmt.getVertexLabel('Orchestra')
Artist = mgmt.getVertexLabel('Artist')
Work = mgmt.getVertexLabel('Work')
Concert = mgmt.getVertexLabel('Concert')
SOLOIST = mgmt.getEdgeLabel('SOLOIST')
CONDUCTOR = mgmt.getEdgeLabel('CONDUCTOR')
ORCHESTRA = mgmt.getEdgeLabel('ORCHESTRA')
INCLUDES = mgmt.getEdgeLabel('INCLUDES')// Create new connections
mgmt.addConnection(SOLOIST, Performance, Artist)
mgmt.addConnection(CONDUCTOR, Performance, Artist)
mgmt.addConnection(ORCHESTRA, Performance, Orchestra)
mgmt.addConnection(INCLUDES, Concert, Performance)
mgmt.addConnection(PERFORMED, Work, Performance)mgmt.commit()

Now, we could drop our data from our graph and re-insert it with new load scripts that reflect our updated schema definition. But that process can be time-consuming, and once we reach a dataset of meaningful scale, it’s pretty wasteful. Plus, don’t we have our ideal of an Infinitely Flexible data system? So let’s try to evolve our schema with a few quick traversals.

Let’s start by creating a single Performance for each Concert. We’ll just find each Concert’s firstDate property (since we were only provided with a single date in our Baltimore Symphony dataset), and use its value as the Performance’s performanceDate. We also connect the Performance to both the Concert and the Work.

g.V().hasLabel('Work').as('w').in('INCLUDES').
  hasLabel('Concert').as('c').
  map(addV('Performance').as('p').
        property('performanceDate', values('firstDate')).
      addE('PERFORMED').from('w').
      select('p').addE('INCLUDES').from('c')).iterate()

Now, connect the conductor and soloist Artists to each Performance and remove their connections from each Work:

g.V().hasLabel('Performance').as('p').in('PERFORMED').
  outE('CONDUCTOR').as('OLD').inV().as('cond').
  addE('CONDUCTOR').from('p').
  select('OLD').drop().iterate()g.V().hasLabel('Performance').as('p').in('PERFORMED').
  outE('SOLOIST').as('OLD').inV().as('soloist').
  addE('SOLOIST').from('p').
  select('OLD').drop().iterate()

Finally, we connect the Orchestra to each individual Performance.

g.V().hasLabel(‘Performance’).as(‘p’).in(‘PERFORMED’).
  in(‘INCLUDES’).out(‘ORCHESTRA’).
  addE(‘ORCHESTRA’).from(‘p’).iterate()

For this model, we’re also keeping the existing connection between the Orchestra and the Concert. This will make certain query pattens more concise, and the relationship between the Concert and the primary performance group (the Orchestra) more explicit. The short story of course is that there’s no single “right” answer…it all depends on what you’re trying to do with your data, and what questions you’re trying to answer.

Our Performances should now have Conductor, Orchestra and Soloist vertices attached by their respective labels:

g.V().hasLabel(‘Performance’).outE().inV().path().by(label)
==>[Performance,CONDUCTOR,Artist]
==>[Performance,ORCHESTRA,Orchestra]
==>[Performance,SOLOIST,Artist]
==>[Performance,CONDUCTOR,Artist]
==>[Performance,ORCHESTRA,Orchestra]
==>[Performance,CONDUCTOR,Artist]
==>[Performance,ORCHESTRA,Orchestra]

Our Works, on the other hand, should only be linked to a composing Artist and specific Performances of the Work:

g.V().hasLabel(‘Work’).outE().inV().path().by(label)
==>[Work,COMPOSER,Artist]
==>[Work,PERFORMED,Performance]
==>[Work,COMPOSER,Artist]
==>[Work,PERFORMED,Performance]
==>[Work,COMPOSER,Artist]
==>[Work,PERFORMED,Performance]

We can also make a few confirmations with some simple assert statements:

// 3 Performances were created
// Each has connections to Conductor, Soloist, and Orchestra
assert 3 == g.V().hasLabel('Performance').count().next()
assert 3 == g.V().hasLabel('Performance').
  out('CONDUCTOR').hasLabel('Artist').count().next()
assert 1 == g.V().hasLabel('Performance').
  out('SOLOIST').hasLabel('Artist').count().next()
assert 3 == g.V().hasLabel('Performance').
  out('ORCHESTRA').hasLabel('Orchestra').count().next()// Conductor, Soloist, Orchestra are NOT directly connected to Works
assert 0 == g.V().hasLabel('Work').outE('CONDUCTOR').count().next()
assert 0 == g.V().hasLabel('Work').outE('SOLOIST').count().next()
assert 0 == g.V().hasLabel('Work').outE('ORCHESTRA').count().next()

Perfect. Our final graph, data and all, should look like this:

The diagram may be a bit crowded, but our model allows for concise access to all of our data (The **INCLUDES** edges between ***Concert*** *and* ***Performance*** *have been excluded for readability)*

We can now easily find composers who have conducted their own works, as well as retrieve the details of the performance.

g.V().hasLabel(‘Artist’).as(‘a’).
  in(‘COMPOSER’).out(‘PERFORMED’).out(‘CONDUCTOR’).
  where(eq(‘a’)).values(‘lastName’)
==>Salonen// Or more verbosely to view the path
g.V().hasLabel(‘Artist’).as(‘a’).
  inE(‘COMPOSER’).outV().outE(‘PERFORMED’).inV().
  outE(‘CONDUCTOR’).inV().where(eq(‘a’)).
  path().by(‘lastName’).by(label).by(‘title’).
  by(label).by(‘performanceDate’).by(label).by(‘lastName’)
==>[Salonen,COMPOSER,Cello Concerto,
    PERFORMED,3/9/2017,CONDUCTOR,Salonen]

We can also hone in even more closely on what Esa-Pekka Salonen has been doing — for example, what orchestras has he conducted?

g.V().has(‘Artist’, ‘lastName’, ‘Salonen’).
  in(‘CONDUCTOR’).out(‘ORCHESTRA’).values(‘name’)
==>New York Philharmonic
==>Chicago Symphony Orchestra

Well, that concludes this look into data modeling with JanusGraph. We’ve seen that it’s easy to incrementally improve the schema as we go — and in doing so take full advantage of the unique flexibility that a graph data system provides.

Footnotes

This is a distinction that lies at the heart of the music royalty and performance rights system. That system requires a much longer discussion, but suffice it say that if we want to use our graph to understand and manage detailed music performance data, we need to have this distinction as a central part of our graph