Graph Databases: Living on the Edge
Welcome to the first entry of Charlotte’s Tangled Web where we (“Project Charlotte” team at High Alpha) plan to regularly discuss our challenges, architectural decisions, innovations, and big wins as we have them.
For anyone familiar with the concept of a graph database, the punny title likely brought on an unamused sigh. For those who are maybe not as familiar with graph terms, the word “edge” was the pun in this case. The title has more meaning behind it than this specific post, though, as we believe we are on the leading edge of a major shift in graph database usage. This will become more evident as I dive in to some of the technical challenges and decisions of Project Charlotte in terms of our use of a graph approach to mapping data.
So, What Is a Graph Database?
That question has a long-winded answer, but in short: A graph database is essentially a datastore that uses graph structures utilizing nodes, edges, and properties to retrieve and store data. The key concept to understand about graph databases is that they use edges. To fully understand the basic structure of a graph database, one must understand two required components: nodes and edges. The Node is a basic entity within a graph (person, place, thing, etc). The nodes carry useful information about the entity in which they are representing—similar to a row in a relational database or a document in a document store database. Think of a node as a circular container with data in it. The Edge is essentially a line that connects one node to another.
These edges are what make graph databases special. This type of data abstraction is not realized (easily or at all) in other types of data stores. Analyzing and traversing edges, one can start to see patterns in data, understand relationships between different types of data, and reduce a massive set of data to a manageable subset based on the entry point into the graph. Let’s start with a basic example and build upon that. Let’s say we have six people (Person 1–6), and they all work together, but some are co-workers, some are friends outside of work, and some report to others. How might we represent this?
With the nodes and edges realized, one can start to make some insights on the data. For example, every person within this group is a subordinate to Person 6 in one way or another. To deduce this, one can enter the graph at a subordinate level (ex: Person 4) and recursively look for reports_to which would lead to the following traversal: (P4) >> (P1) >>(P6). In a similar way, one could query for who reports immediately or anywhere within the chain by starting at Person 6 and and looking at the reverse edge of reports_to recursively to get a list of all subordinates. Those are basic traversals, but they start to paint the picture of what may be possible with a graph database.
Edges are not only a means of pointing from node to node, but they can also carry data that is relevant to that relationship. For example, on the reports_to edges, we could store a meeting frequency to score how often these individuals are engaged to see how strong their work relationship is. This way, we could find out what the “shortest path” based on that type of score would be from point to point. Even beyond that, edges can also be used for filtering between intermediate nodes.
Why Should You Be Using a Graph Database?
In today’s world, there is a vast amount of data. The data itself is valuable to an extent, but the real value comes from understanding the data and its relationship with other data. What hot buzz word does this sound like? Machine Learning! In nearly every way one thinks about data science and machine learning, a graph can assist. We lay no claim to be professional “data scientists” on the Charlotte Team, but we do analyze data, find relations, and make assumptions leveraging the data we have to help understand future data to help the user. That’s what data science is all about: making insights on data.
Those insights (edges) between the data (nodes) naturally lend themselves to a graph structure. There are three very important things about most graph databases that help with machine learning and data science in general:
- The relationship (non-flat/layered) structure of a graph do traverse insights.
- The dynamic data structure. Data changes, so the schema (or lack thereof) defining storage of the data, should be flexible as we scale systems that the insane rate we do today.
- The structure of a graph is naturally recursive. In order to have a deep understanding of data and its relationships, we need to understand the depth of the data.
What Are Some Other Use Cases?
There are many ways to dream up a use case for a graph database, but some common ones are social networks, recommendation engines, and fraud detection. Social networks are often stored in graphs because the very structure of a graph with relationships lends itself to an exact representation of those relationships. Recommendation Engines are easily modeled to a graph as well since, to do a suggestion, one needs to analyze other similar data related to the user. For example, if user A likes many movies with Arnold Schwarzenegger, and user B has the same disposition to Sylvester Stallone movies, one could possibly (with more data than just this most likely) start to realize that both user A and B like action movies since those stars have a strong relation with the action genre. Fraud detection takes the approach of storing very large datasets, and using the graph to be able to find similar patterns to start to extract out fraud techniques.
We have had some personal experience using graph databases both in a production environment, and also just for research and proof of concept.
In one instance, we created an enterprise security system, which we dubbed bouncer, to sit on top of an already existing application and manage/interpret granular permissions and security restrictions to all business objects in the existing system. Why did we choose a graph database for this implementation? Well, we had many business objects, many users, many accounts, many “share events”, and many many many types of permissions per business object as well as per user/role/group. This problem actually lends itself perfectly to a graph database where the edges are the access permissions, and context on an object to object basis. Beyond that, it was extremely fast since it had to sit on top of an already existing application.
Another use case was a proof of concept for trending news events happening around the world. We needed to understand where we needed to be looking for data, when, and for how long as it was trending. Since we had certain API rate limits to abide by, we wanted to optimize them by location. One way to do this is by creating buckets of events that fall into geohashes around the world. We could then adjust precision and longevity of data on a per bucket basis. The buckets were the smallest precision of the geohashes (whatever limit we had set) and each event (topic included) would get put into the smallest geohash it occurred in. Then, depending on the level of precision the service that requested the data wanted, we could give it back by using a graph edge between each geohash at each level to its parent, and neighbors. This allowed us to traverse the world by these sections of reproducible and predictable areas. Each bucket also had edges going out to topics, so using all of the above connections, we could track topic trends by location around the world.
Out With the Old, In With the Graph
The old, in this case, is the trusty relational database. Don’t get me wrong, I believe every piece of technology has it’s place, but let’s define that. What are relational data stores good at? They are great at defining a strict schema to abide by for data rules, they are tried and true, and there are many implementations to choose from. What a relational data store lacks is natural relations.
I know that sounds like an oxymoron, think about it: When someone wants to draw a quick representation of a relational datastore, what does it look like? Typically its a bunch of nodes and edges. Now, that is NOT how a relational database works, though. You basically put segregated data into different tables with a column that has a “key” in it that links it to another table. These, of course, are the joins. Joins are ok at small scale, but at scale, they are simply not performant. We have been trained to think a SQL schema is a natural way to think about related data, but it’s just not (hence the need for ORMs in general).
When you flip sides and look at a graph database, it is loosely defined and changes as the data does, relations are the central idea of the structure, and traversing those relationships are fast, even at scale. Why is that? Well, the way I think about it for fast traversal is that I enter the graph at a specific, indexed, node, and travel directly to other nodes across an (indexed) edge between the two. Pretty simple really. If you would like to dive in more on this comparison, there is a very informative post from the creators of the graph database that we use here.
Project Charlotte: What’s on the Horizon?
We are only scratching the surface on what is possible with Charlotte’s infrastructure. We will talk about many things in the coming weeks/months, but upcoming highlights are:
- Continuing the Graph Database discussion with our decision to use Dgraph as our graph database and core component to our system, and what impacts it has had on us
- A series on micro-service architecture/implementation featuring inter-service communication, service orchestration, distributed processing, Kubernetes, etc.
- Discussions on Golang and the role it plays in our entire system
Any sufficiently advanced technology is indistinguishable from magic. — Arthur C. Clarke
As the quote by Arthur Clarke suggests, advanced technology should be magical. That is certainly something we aim to do with Project Charlotte. So what makes, for example, card magic seem so unreal? Typically it is the sleight of hand, which requires immense speed. In a similar fashion, if the data is naturally and deeply related, the graph structure will almost always win with respect to speed when traversing relationships. The relationship is a “first class citizen” in the graph datastore and at the forefront of the design. If in fact, you have any data which falls into uses cases with related data—which I would wager you do—consider adding a graph database to your arsenal, better yet, consider Dgraph, as I cannot recommend that datastore and team enough, but more on that later!
High Alpha is a venture studio pioneering a new model for entrepreneurship that unites company building and venture capital. To learn more, visit highalpha.com or subscribe to our newsletter.