And I want to invite you to share your story, how you came about graphs as an amazing data model for todays connected world, either as an individual user, open source collaborator or customer of a graph database like Neo4j. Please respond to this post with your own story, and I’ll make sure to get you a personalized t-shirt to celebrate it.
My own journey began a long time ago, and advanced pretty quickly after I took the red pill. That’s what I want to offer to you. Now sit back and enjoy.
In a world far far away …
It started in 1994, while I was playing in an online text-based adventure world (MUD) — where I luckily also met my wife. Getting efficiently from one place to another was important in that large world. I used my coding skills instead of my memory to solve the challenge. Having skipped the CS lessons on graph theory, and knowing nothing, I just made up a bidirectional shortest path search that worked pretty well and earned me much fame with my mudder friends.
A Geek Cruise
A few years later I met this blonde Swedish guy during a geek cruise on the Baltic Sea. He went by the name of Emil Eifrem and spoke passionately about connected data and ways of efficiently storing, managing and querying it with that graph database thing called Neo4j.
Some hours later I understood what he meant. Working with the graph database, it felt so natural to store data in the shape it originally had, without having to force it into structures it doesn’t naturally fit into. And getting it back out again in the different ways I wanted to, and quickly.
Two years later, I joined Neo Technology, which is now a company of almost 90, as the tenth graph addict.
One of my first stunts with Neo4j dealt with storing the large and small structures, including hierarchies and compositions, of source code. An idea that my friend Dirk turned many years later into jQAssistant, an extendable open source tool for software analytics.
Ever since, I was intrigued by information that didn’t live in silos but exhibited rich relationships of many kinds among each other.
And fortunately this kind of data is everywhere around us — after all —
The World is a Graph
From complex protein structures and neuronal structures in biology, via relationships in your larger online and offline families and communities to networks of machines, services and users that form our working environments, connections shape the world. You cannot escape that fact.
It doesn’t matter if you deal with people, events, locations, products, documents, knowledge or pop-culture, without the connections holding everything together, the sum would be much less than its parts.
How is it, that graphs, besides their use in marketing material and infographics only showed up so late in the game?
The very old history of everything graph
Actually the graph model is much older, it stems from the research of Leonard Euler in the 17th century, trying to solve some interesting problems (one of which is the famous 7 bridges of Königsberg) in novel ways. By doing so, he created a new field of mathematics — graph theory, which excited mathematicians and tortured students ever since.
What, Why and How?
But that’s not what I want to talk about, I’m more concerned about the creative side of things. Whenever we discuss, or brainstorm concepts with colleagues or domain experts, or grow mind-maps to connect things we’ve learned, then we leverage graph models to capture the relationships that augment the individual bits and pieces into meaningful networks.
Employing our creative, “right-side” hemisphere allows us to doodle and understand connections and their implications visually and conceptually.
But whenever we have to manage data, the logical “left side” takes over and makes us want to work with orderly lists and tables. Even I can’t actually remember the last time I drew a grocery shopping graph.
So there is something about graphs that is inherently creative and visual. No one working with complex information representation will deny that.
Graphs are helpful in many areas, especially when telling stories. Imagine all the characters, plots, arcs, places and events and their (invisible) relationships not just in Game of Thrones but also in our own history. How many lineages, time-lines, impact- and real maps were drawn in the course of it? All to represent how people, locations, actions and reactions related and affected each other.
But what about using this powerful graph model as part of our software systems? What would it take to use this versatile and flexible model of our world and make it part of the software that runs that world?
It would take a new kind of databases, which actually treats relationships as first class citizen among its data structures, allowing for efficient storage and fast retrieval of connections between entities.
My personal graph journey started slowly but then rolled over my thinking and perception like a wave. And it never left me; actually I have to warn you:
Graph Thinking really is addictive
It is hard to look at complex data and not see the connections that are visible or hidden between the pieces of information. And that makes it so easy to explain it to other people.
Whenever I meet someone who is interested in learning about this “new” graph model, getting them started is a no-brainer. You just chat about the domain they are working in and grab a nearby whiteboard to draw on while they talk. And almost instantly you create an interesting sketch consisting of named circles and arrows that represent the most important concepts of that domain and the relationships between them.
Just as a side-note: these circles and arrows have many names, like nodes & relationships or vertices & edges in graph terminology. Or just objects & references or entities & connections, just that you know, if people try to confuse you.
And you don’t stop there, you continue to explore the possibilities of what else you could add to the model under discussion, which would allow you to gain new insights or serve new use-cases. And at that point you suddenly see the epiphany light up behind your counterpart’s eyes and they continue to refine and extend the visual representation of their area of work themselves.
The Flexibility of the Graph Model makes it so useful
The model can be extended at any time, just add more and other data at the connection points or edges of your network. You want to add spatial, social, events or product information? Just go ahead and connect the bits. This ability makes even things like data integration and master data management, usually dreadful exercises, fun again.
And there is more. As the graph structure itself is pretty versatile, you can use new kinds of connections — e.g. (multidimensional) trees, lists and other fitting structures directly to provide new means of accessing to your core data.
You can decide ad-hoc from which point of view you want to look at your data, and how you want to navigate through the network to find, aggregate and project results while you do. You are not limited to a single projection or upfront aggregation, you keep the richness of the model while not compromising on performance.
And the performance aspect is where graph databases come in.
You might ask yourself: Isn’t that rich model also something I see in Entity-Relationship-Diagrams of relational databases all the time? And you’re right, that ER-diagram of yours is actually a graph. But try to find that graph in the physical representation of your data, there you will only see tables and tables and nothing else. No notion of real relationships.
And foreign keys don’t count. Why not? Every database that uses a key lookup mechanism — index based or not — to compute relationships at query time, fights a lost battle when it comes to querying along many, complex and arbitrary relationships. Because foreign-key-lookups are expensive and depending on the database system exponentially so when done galore.
So how do graph databases deal with this problem? Unfortunately, they can’t do magic, although that would be really cool. They just employ the cheap trick of pre-materializing relationships. As relationships are first class citizens in those databases, it is easy to just store all relationships adjacent to the nodes that they connect. That’s why in graph speak those are called adjacency lists.
And you can imagine that having all these connections immediately available at your command, makes it easy and fast to follow them, to reach the nodes on the other side and transitively so. No computation necessary to find matching records, just following links — much like on the web.
Unlike there, having no dangling relationships or broken links is one of the few integrity constraints of graph databases. Otherwise, they are pretty flexible in how you want to store your data. And you don’t have to define a schema up-front, you just store the attributes on nodes and relationships that you have available (or not). On relationships that’s mostly qualifying information like time, weights, costs, distances, ratings. Depending on the database you’re able to name nodes and relationships with one or more types or labels.
This also explains, why modeling complex, connected information is such a hard thing to do in many databases. Either it is a chore, unsatisfying or just ignored altogether. In a graph database, it’s the most natural thing you do, just like connecting the dots in real life.
The most frequent use of a graph database is to begin exploring the bigger neighborhood around a number of initial nodes to gather insight or aggregate information. Working with these local queries makes graph databases resilient towards large dataset sizes. For instance, if I’m only interested in information within my neighborhood, city or state, the other 7bn people in the world are not relevant for the results of that query.
And even for questions like the How am I connected to this person? on LinkedIn, the answer lies only a shortest path search between those two people away, not a global number crunching activity.
And that brings us back to the beginning, with me showing you the path towards a different kind of data model that has the potential of making you a happier person.
Despite me working mainly on Neo4j, it’s not the only graph database out there. The field is big and growing. Some other brand names are OrientDB, Titan and SparkSee. You can check db-engines.com for more details.
Of course, a graph database is no silver bullet. It always depends on the shape of your data, your use-cases and requirements what database(s) you’re gonna use these days. You make the choices, not someone on a golf court, so better be prepared to defend them. And that’s done best by choosing a database and then testing it with a vertical slice of your application use-case, including load-testing.
That’s what is called Polyglot Persistence or NoSQL — “Not only SQL”.
And now go off and see graphs, everywhere. ☺
If you enjoyed reading my story and you want to share your own graph ephipany, please go ahead and respond to this post. I won’t forget the promised, personalized t-shirts.
In my next post, I’d like to explore how you get started and model your own graph data, import it from a source and query it, both for insight and visualization.